Thursday 15 January 2015

Use Spark Cassandra Connector in Spark-shell

Use connector from command line to access Cassandra by Spark.

$ spark-shell --jars projects/spark-cassandra-assembly-1.0.0-SNAPSHOT-jar-with-dependencies.jar

//configure a new sc
scala> import org.apache.spark.SparkContext
scala> import org.apache.spark.SparkContext._
scala> import org.apache.spark.SparkConf
scala> sc.stop
scala> val conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1")
scala> val sc = new SparkContext("local[2]", "test", conf)

//access to Cassandra
scala> import com.datastax.spark.connector._
scala> val rdd = sc.cassandraTable("test", "kv")
scala> println(rdd.first)

scala> val collection = sc.parallelize(Seq(("key3", 3), ("key4", 4)))
scala> collection.saveToCassandra("test", "kv", SomeColumns("key", "value"))
You may have some issues during this process.

1. scala> val rdd = sc.cassandraTable("test", "kv")
java.lang.AbstractMethodError
at org.apache.spark.Logging$class.log(Logging.scala:52)
at com.datastax.spark.connector.cql.CassandraConnector$.log(CassandraConnector.scala:145)

Solution: This is due to the incompatible between connector1.0 with Spark 1.1.

Upgrade connector to spark-cassandra-connector_2.10-1.1.1.jar


2. scala> val sc = new SparkContext("spark://127.0.0.1:7077", "test", conf)
SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in use
java.net.BindException: Address already in use

Solution: 
In the Spark shell, a special interpreter-aware SparkContext is already created for you, in the variable called sc. Making your own SparkContext will not work.

Either use the default sc, or
scala> sc.stop 
val sc = new SparkContext("local[2]", "test", conf)


3. scala> val rdd = sc.cassandraTable("test", "kv")
error: bad symbolic reference. A signature in CassandraRDD.class refers to term core
in package com.datastax.driver which is not available.

Solution:
This is caused by below command, which misses the other dependencies..
spark-shell --jars spark-cassandra-connector_2.10-1.1.1.jar

Add the spark-cassandra-connector jar and its dependency jars to the following classpaths:
"com.datastax.spark" %% "spark-cassandra-connector" % Version
  • the classpath of your project
  • the classpath of every Spark cluster node
Put the following jars in /your/path/to/spark/libs

cassandra-clientutil-2.0.9.jar
cassandra-driver-core-2.0.4.jar
cassandra-thrift-2.0.9.jar
guava-15.0.jar
joda-convert-1.2.jar
joda-time-2.3.jar
libthrift-0.9.1.jar
spark-cassandra-connector_2.10-1.1.0-rc4.jar

In $SPARK_HOME/conf/spark-env.sh,
SPARK_CLASSPATH=/your/path/to/spark/libs/*


Reference:
http://stackoverflow.com/questions/25837436/how-to-load-spark-cassandra-connector-in-the-shell
https://github.com/AlvinCJin/spark-cassandra-connector/blob/master/doc/0_quick_start.md
https://github.com/datastax/spark-cassandra-connector/issues/44

5 comments:

  1. Thanks a lot ......
    Very Good Content....Really help full.

    ReplyDelete
  2. Thanks. Needed help for setting cassandra and spark.

    ReplyDelete
  3. Using SPARK_CLASSPATH has been deprecated in Spark 1.0+

    ReplyDelete
    Replies
    1. yes.. me too getting same error of deprecated.. did u resolve it?

      Delete
  4. This was really helpful, trust me. The only tutorial that actually works.!!! 100 % working. Just use the latest versions of jars if you are using the latest versions of cassandra and spark

    ReplyDelete