Tuesday 22 September 2015

Bulk Load from HDFS to Cassandra

Cassandra offers options for bulk importing data from other data sources (such as HDFS) into the Cassandra cluster by building entire SSTables and then streaming the tables into the cluster. Streaming the tables into the cluster is much simpler, faster and more efficient than sending millions or more of individual INSERT statements for all of the data you want to load into Cassandra. The


Cassandra bulk loader, also called the sstableloader, provides the ability to: Bulk load external data into a cluster. Load existing SSTables into another cluster with a different number of nodes. Restore snapshots. The sstableloader streams a set of SSTable data files to a live cluster; it does not simply copy the set of SSTables to every node, but transfers the relevant part of the data to each node, conforming to the replication strategy of the cluster. The table into which the data is loaded does not need to be empty. 

Spotify has open-sourced a tool on-top of the bulk SSTable loaded named hdfs2cass for the specific use case of ingesting data from HDFS.

For example,

JAR=target/spotify-hdfs2cass-2.0-SNAPSHOT-jar-with-dependencies.jar
CLASS=com.spotify.hdfs2cass.Hdfs2Cass
INPUT=/example/path/songstreams
OUTPUT="cql://cassandra-seed-node/keyspace/tablename?    reducers=5&columnnames=id,name,price"
hadoop jar $JAR $CLASS --input $INPUT --output $OUTPUT

Reference:

http://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsBulkloader_t.html https://github.com/spotify/hdfs2cass https://labs.spotify.com/2015/01/09/personalization-at-spotify-using-cassandra/

No comments:

Post a Comment