Alvin's Big Data Notebook : Row Partitioning in Cassandra

A keyspace in Cassandra is analogous to a schema in a relational database. Typically, one keyspace is assigned to one cluster but one cluster might contain more than one keyspace.

A client issues read or write requests to any node. The node that receives the request becomes a coordinator that acts as a proxy of the client to do the things as explained previously. Data is distributed across the cluster and the node addressing mechanism is called consistent hashing.

Consistent hashing allows each node in the cluster to independently determine which nodes are replicas for a given row key. It just involves hashing the row key, and then compares that hash value to the token of each node in the cluster. If the hash value falls in between a node's token and the token of the previous node in the ring (tokens are assigned to nodes in a clockwise direction), that node is the replica for that row.

Each row has a row key used by a partitioner to calculate its hash value. The hash value determines the node which stores the first replica of the row. The partitioner is just a hash function that is used for calculating a row key's hash value and it also affects how the data is distributed or balanced in the cluster. When a write occurs, the first replica of the row is always placed in the node with the key range of the token.

Each data is replicated at a number of nodes that are configured by a parameter called replication factor. It replicates the data to the other nodes in the ring. Replication strategy is the method of determining which nodes the replicas are placed in. It provides many options, such as rack-aware, rack-unaware, network-topology-aware, so on and so forth.

Alvin's Big Data Notebook

Friday, 23 January 2015

Row Partitioning in Cassandra

No comments:

Post a Comment