Software Engineering

Sorted String Table in Apache Cassandra

From my previous blog post you already know that data written to Apache Cassandra is persisted into so called Sorted String Table (SSTable) files eventually. During this article I’m going to explain the Apache Cassandra data directory and SSTable format in more detail. For some situations, that kind of knowledge comes in handy:

  • Initial cluster setup & disk size capacity planning
  • Schema modelling
  • Debugging SSTables
  • Understanding CQL limitations

Please note that the storage engine of Apache Cassandra was refactored in version 3. This article refers to the new implementation.

Demo Setup

Because from a client’s perspective all data is stored in a tabular format, let’s start by creating a demo keyspace, one table and by inserting a few rows into it. The demo setup is listed below.

At that point, if you just executed the above statements, data was not written into any SSTable on disk yet. It was written to the commit log and a structure in memory only. But we can force the system to create a new SSTable and flush the above data into it.

Cassandra Data Directory

The data is stored in a data directory. By default this is $CASSANDRA_HOME/data/data (in case of doubt, see your cassandra.yaml configuration file). Subjacent directories are created for each keyspace and CQL table respectively. Each directory for a given table contains a file ending in TOC.txt. That table of contents lists the components for the given SSTable. Each SSTable is made up of multiple components in separate files. The table below provides a brief description for each component.

FileDescription
mc-1-big-TOC.txtA file that lists the components for the given SSTable.
mc-1-big-Digest.crc32A file that consists of a checksum of the data file.
mc-1-big-CompressionInfo.dbA file that contains meta data for the compression algorithm, if enabled.
mc-1-big-Statistics.dbA file that holds statistical metadata about the SSTable.
mc-1-big-Index.dbA file that contains the primary index data.
mc-1-big-Summary.dbThis file provides summary data of the primary index, e.g. index boundaries, and is supposed to be stored in memory.
mc-1-big-Filter.dbThis file embraces a data structure used to validate if row data exists in memory i.e. to minimize the access of data on disk.
mc-1-big-Data.dbThis file contains the base data itself. Note: All the other component files can be regenerated from the base data file.

You might have noticed that all of those files follow a naming convention i.e. they are prefixed with something like „mc-1-big-“. That naming convention provides additional information about the SSTable. For example, „mc“ indicates the SSTable format version and „1“ is a counter which is incremented if compaction occurs, because SSTables are immutable.

For some more background information I recommend a paper called Bigtable: A Distributed Storage System for Structured Data. Bigtable itself uses the SSTable file format to store data. Initially, Apache Cassandra borrowed the SSTable file format from Bigtable.

SSTable Data Model

Now, to get an idea about the SSTable data model, I recommend a tool called sstabledump which is distributed along with Apache Cassandra. It is available since version 3.0.4. For version 2.2 and before, use another tool called sstable2json. Both tools can export the content of a SSTable file to the JSON format.

The JSON export shows that SSTable data files are composed of partitions and their rows. This is different compared to earlier versions of Apache Cassandra (before 3.0) where SSTable files were composed of partitions and their cells. Each partition is identified by one or more partition keys. Each row by one or more clustering values. If you worked with CQL before, you might be quite comfortable with the new layout, as it is pretty much the same how data is actually represented in CQL.

To provide a better understanding of what is shown in the above JSON, I created an entity-relationship diagram (listed below) with the primary entities derived from the above JSON and how they relate to each other. Each partition is followed by an arbitrary amount of rows. And rows are followed by an arbitrary amount of cells. Row and Cell are weak entities because a partition is required for them to be identified.

Further on, a row is described by type, position, clustering values, liveness information and cells. Note that rows within one partition are sorted by their clustering values. The listing above shows rows of type „row“ only. Another row type for example is „static_block“. That type is used for CQL static columns. Instead of repeating static column values, they are only stored once per partition.

For each cell the column name and it’s value is stored. Before 3.0 the clustering column values were repeated for each cell. I recommend this DataStax blog post if you are especially interested in the storage format changes made in Apache Cassandra 3.0 and how those changes affect disk space usage.

The liveness information might be overridden for each cell. By default however the row level liveness information applies to each cell respectively and the information is stored only once.

Storage Format and CQL

Although the partitions above look like being sorted (first key „1“, second key „2“) that’s just accidental. Rows within one partition on the other hand are always sorted by their clustering column values i.e. the „created_on“ column. By default those values are sorted ascending, but could be descending as well.

Because rows are sorted by clustering column values on disk, time range queries or range queries on clustering columns within one partition in general, perform fast with Apache Cassandra.

The storage format also explains why you usually must limit your queries to one or more partitions. For example if you want to use the ORDER BY clause. Because scanning all data (possibly across multiple nodes) is too expensive.

Furthermore, null values for regular cells are not stored on disk at all (illustrated by comment author „Bob“ from the demo setup above). Regarding schema modeling that’s a nice-to-know.

Final remarks

The sstabledump mentioned above provides some more options. One of those („-d“) allows you to see an CQL row per line internal representation of your SSTable data file. That output is more compact and more precise but also a bit more difficult to understand.

As usual, I hope you enjoyed reading this article.