Alvin's Big Data Notebook : Data Modeling in Cassandra

In Cassandra and even in other NoSQL databases, you need to forget the way you do data modeling in a relational database. You need to focus on the application in addition to the data itself. Cassandra data modeling is result oriented based on a clear understanding of how a query works internally in Cassandra.

The most important difference is that a relational database models data by relationships whereas Cassandra models data by query. A query is always the starting point of designing a Cassandra data model. As an analogy, a query is a question and the data model is the answer.

In a relational database, the primary goal is to remove data duplication through normalization to have a single source of data. It makes a relational database ACID compliant very easily. The related storage space required is also optimized.

Conversely, joins are not allowed and denormalization is really the best practice in Cassandra. This means that we do not mind repeatedly storing (duplicating) the stock name in the tables that will be queried. A rule of thumb is one table for one query; as simple as that.

Row

A column-oriented store is a multidimensional map.
Map<RowKey, SortedMap<ColumnKey, ColumnValue>>

The row key can be a unique built-in data types. Columns are stored in sorted order by their column names. Sort order is extremely important because Cassandra cannot sort by value as we do in a relational database.

A row cannot be split to store across two nodes in the cluster. It means that if a row exists on a node, the entire row exists on that node.Different rows in a column family may have different column names. That is why Cassandra is both row oriented and column oriented.

The row of dynamic usage is called a wide row, in contrast to the row containing static columns termed as a skinny row. The first column in the primary key definition is the row key.

If the primary key contains only one column, the row is a skinny row.
If the primary key contains more than one column, it is called a compound primary key and the row is a wide row.

Keyspace contains replication settings controlling how data is distributed and replicated in the cluster. Very often, one cluster contains just one keyspace.

The comparator of the column family dictates how the rows are ordered on reads. Additionally, columns are ordered by their column names, also by a comparator.

No Foreign Keys and Joins

Foreign keys are used in a relational database to maintain referential integrity that defines the relationship between two tables. Foreign keys and joins are the product of normalization in a relational data model.

Cassandra has neither foreign keys nor joins. Instead, it encourages and performs best when the data model is denormalized. Denormalization is a solution to solve poor performance of highly complex relational queries involving a large number of table joins.

Foreign keys and joins can be avoided in Cassandra with proper data modeling

No Sequences
In a relational database, sequences are usually used to generate unique values for a surrogate key. Cassandra has no sequences because it is extremely difficult to implement in a peer-to-peer distributed system. Instead, it uses UUID a 128-bit number represented by 32 lowercase hexadecimal digits, displayed in five groups separated by hyphens.

Time-To-Live (TTL)

TTL is set on columns only. The unit is in seconds. When set on a column, it automatically counts down and will then be expired on the server side without any intervention of the client application.Typical use cases are for the generation of security token and one-time token, automatic purging of outdated columns, and so on.

Alvin's Big Data Notebook

Saturday, 24 January 2015

Data Modeling in Cassandra

No comments:

Post a Comment