Alvin's Big Data Notebook : Secondary Index In Cassandra

Secondary Index

The secondary index in Cassandra is not identical to that in a relational database. The secondary index in Cassandra can be created to query a column that is not a part of the primary key. The benefit is a fast, efficient lookup of data matching the indexed columns in the WHERE clause. Each table can have more than one secondary index.

Behind the scenes, it is implemented as a separate hidden table which is maintained automatically by Cassandra's internal process.The major difference between a primary key and a secondary index is that the former is a distributed index while the latter is a local index.

The primary key is used to determine the node location and so, for a given row key, its node location can be found immediately.
The secondary index is used just to index data on the local node, and it might not be possible to know immediately the locations of all matched rows without having examined all the nodes in the cluster. Hence, the performance is unpredictable.
As with relational databases, keeping secondary indexes up to date is not free, so unnecessary indexes should be avoided.

Secondary indexes in Cassandra are NOT equivalent to those in the traditional RDBMS. They are not akin to a B-tree index in RDBMS. They are mostly like a hash. So, the range queries do not work on secondary indexes in Cassandra, only equality queries work on secondary indexes.

CREATE INDEX index_name ON base_table (indexed_column);

Best practice to create a secondary index:

Don't index on high-cardinality columns: Cassandra stores secondary indexes only for local rows in the data node as a hash-multimap or as bitmap indexes. The more unique values that exist in a particular column, the more overhead you will have to query and maintain the index.
Don't use index in tables having a counter column
Don't index on a frequently updated or deleted column: During compaction, marked columns are deleted in the index (a hidden table) until the tombstone limit reaches 100 K cells. After exceeding this limit, the query that uses the indexed value will fail.
Don't abuse the index to look for a row in a large partition

Alvin's Big Data Notebook

Saturday, 24 January 2015

Secondary Index In Cassandra

No comments:

Post a Comment