Cassandra offers options for bulk importing data from other data sources
(such as HDFS) into the Cassandra cluster by building entire SSTables and
then streaming the tables into the cluster. Streaming the tables into the
cluster is much simpler, faster and more efficient than sending millions
or more of individual INSERT statements for all of the data you want to
load into Cassandra.
The
Cassandra bulk loader, also called the sstableloader, provides the
ability to:
Bulk load external data into a cluster.
Load existing SSTables into another cluster with a different number of nodes.
Restore snapshots.
The sstableloader streams a set of SSTable data files to a live cluster;
it does not simply copy the set of SSTables to every node, but transfers
the relevant part of the data to each node, conforming to the replication strategy of the cluster.
The table into which the data is loaded does not need to be empty.
Spotify has open-sourced a tool on-top of the bulk SSTable loaded named
hdfs2cass for the specific use case of ingesting data from HDFS.
For example,
JAR=target/spotify-hdfs2cass-2.0-SNAPSHOT-jar-with-dependencies.jar
CLASS=com.spotify.hdfs2cass.Hdfs2Cass
INPUT=/example/path/songstreams
OUTPUT="cql://cassandra-seed-node/keyspace/tablename? reducers=5&columnnames=id,name,price"
hadoop jar $JAR $CLASS --input $INPUT --output $OUTPUT
Reference:
http://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsBulkloader_t.html
https://github.com/spotify/hdfs2cass
https://labs.spotify.com/2015/01/09/personalization-at-spotify-using-cassandra/
No comments:
Post a Comment