Software Engineering

Wide Column Database Schema Design

Article post picture: Wide Column Database Schema Design
esentri AG
esentri AG

Because relational database management systems have dominated IT for over 30 years, it is hardly surprising that although a vast amount of literature and material with regards to database schema design is available, most of it focuses on those traditional database systems. However, since NoSQL systems raised in popularity since 2007, knowledge of relational schema design is not sufficient any more. During this post I am going to outline differences and considerable aspects when doing the same for wide column database systems (and especially Apache Cassandra).

Before going into any details, I will provide a brief recap on data modelling in general as well as techniques targeting traditional RDBMS. After reading this post, you will be able to understand why data modelling for NoSQL systems like Apache Cassandra follows different rules and what is necessary to design schemas for those systems.

Data Modelling in General

Database schema design is usually based on some groundwork: the conceptual data model. That model provides a logical overview of all data that should be persisted. It describes and is a model of a specific part of reality. It is commonly designed using some kind of ERD notation or UML. Last thing to note about conceptual models is that those are application and database system independent. Thus independent of the database system’s data model (relational, document store, graph database …) as well.

Next step is the transformation of the conceptual model into a logical data model. This process, the model notation and design objectives depend on the logical data structure supported by the targeted database system. In general terms, it is the transformation of entities and their attributes as well as their relationships to each other all derived from the conceptual model into storage objects (or data structures) supported by any given database system.

The final step is the transformation of the logical model into a physical model. It is the vendor-specific implementation of the logical model targeting a specific database management system. Some kind of data definition language (DDL) is used to describe the physical model. In case of relational database systems for example that is most certainly SQL (or some vendor specific SQL dialect).

The following sections focus on the transformation process of conceptual models into logical/physical models with regards to different data models i.e. relational and wide-column data models.

Relational Schema Design

With regards to relational database systems, the logical model is designed by creating relations (tables) based on entity types and their attributes as well as their relationships to each other all derived from the conceptual model. Furthermore, the foreign key concept is applied as well as normalization. In a nutshell, objectives of this approach are to ensure data consistency (and integrity) and to avoid data anomalies.

The physical model is created by transforming the logical model into executable statements. For relational database systems this is most certainly SQL or some vendor specific SQL dialect. Besides statements for each table of the logical model, the physical model also contains data type definitions for each column and statements for each defined data integrity condition. Statements for index structures are often added as well to ensure faster data access.

Overall, the relational schema design process is very data-centric. It is assumed that you can always write and execute a query that fits your needs, using sub-queries, join statements and so forth. Query patterns are not a primary design objective. There is a risk involved that queries are simply ignored until performance begins to decrease at a later time.

Wide Column Schema Design

Regarding wide column stores (and especially with Apache Cassandra on my mind), things are different. As described in a previous blog post, wide column stores often imply query limitations due to their storage format, and therefore, access pattern cannot be ignored during schema design. The opposite is true: Access pattern are first-class citizens while modelling wide column store schemata.

Application Query Pattern

Thus, to create the (wide column) logical data model, it is recommended to first identify all queries required by your application to solve its purposes. Each query should be assigned a unique label for easier reference. To ensure no query was forgotten, it might be useful to re-arrange the list of queries into an application workflow diagram whereby each activity must be supported by one query.

Compared to the relational schema design process, identifying and listing application queries was added as an additional step into the wide column schema design process.

Wide Column Logical Model

Next, the logical data model is designed by creating one table for each query previously identified. Table names typically begin with the name of the primary entity type served by a query. If a query includes filter attributes, a constant separator value _by_ is appended, followed by those filter attribute names. The results are table names like songs_by_album for example (whereby song is the primary entity type and album is the supported filter attribute).

Furthermore, the wide column data model requires the definition of a primary key for each table. The primary key is used to uniquely identify a table entry. It consists of a partition key and optionally of one or more clustering columns. Wide column stores other than Apache Cassandra might use a slightly different terminology. HBase for example refers to row keys and column families instead.

As a rule of thumb, designers will designate filter attributes as partition keys and sort order attributes as clustering columns. If the primary key definition does not yet guarantee uniqueness, additional clustering columns must be added. Remaining attributes (identified by the query’s primary entity type) which are not yet part of the table definition, are added as regular columns. Available attributes to be added are derived from the conceptual model.

For Apache Cassandra, the most popular notation for logical models is the Chebotko notation. Each table is modelled as a rectangle including its name, columns and queries supported or enabled by that table. Symbols are used to clearly identify partition keys (K) and clustering columns (C↑ or C↓). In another blog post I am going to describe the Chebotko notation in more detail.

Please note: Do not be misled by the rather short description of how to choose each table’s partition key and clustering columns. Designing partition and primary keys is a crucial step for wide column stores. Primary keys not only determine (un-)supported queries but may also cause cluster hot spots, accidental data overwrites and partition overflows. It is therefore highly recommended to perform a schema refinement, once the physical model is ready.

Wide Column Physical Model

In general terms, the last step, creating the (wide column) physical model, is the same as for relational database systems. It is the transformation of the logical model into executable DDL statements. The only difference is that physical models targeting wide column stores cannot use SQL as a notation. So far, there is no standard notation to do this. Available notations depend on the wide column store implementation. Schemas targeting Apache Cassandra for example will most certainly use the Cassandra Query Language (CQL).

Of course the physical model might include much more statements and configuration properties than just definitions for creating tables. But once again, those are implementation-specific. Examples include statements for materialized views (table substitution) and indexes as well as replication strategy and factors.

Summary

Overall, the wide column schema design process has another focus than the one targeting relational systems. It is a query first or schema by query pattern approach. It is less data-centric. Foreign keys and join operations are not supported. Thus, normalization is not a design objective.

Credits