Thursday 17 September 2015

Enable HiveContext in Spark


HiveContext is a superset of SQLContext in Spark. It supports HiveQL built-in functions.
A hive context adds support for finding tables in the MetaStore and writing queries
using HiveQL. Users who do not have an existing Hive deployment can still create a HiveContext.
When not configured by the hive-site.xml, the context automatically creates metastore_db
and warehouse in the current directory.

1. To enable HiveContext,  add spark-hive lib in build.sbt
"org.apache.spark" %% "spark-hive"

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
def collectList(hiveContext: HiveContext, dataset:DataFrame, groupByCol: String, aggCol: String)={

    dataset.registerTempTable("tmptable")
    hiveContext.sql(s"SELECT $groupByCol, COLLECT_LIST($aggCol) AS list
FROM tmptable GROUP BY $groupByCol")

  }


2. To enable HiveContext, we have to give Spark more memory.
For example, set VM option in IntelliJ:
-XX:MaxPermSize=2G
Or set in the command argument:
--driver-java-options -XX:MaxPermSize=2G
Otherwise, it will give you below error:

Exception in thread "main" java.lang.reflect.InvocationTargetException
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
at org.apache.hadoop.hive.ql.session.SessionState.s
Caused by: java.lang.OutOfMemoryError: PermGen space


3. When test in local without Hive settings, the program will generate "metastore_db" in class path.
In the cluster,  the driver also likely needs hive-site.xml in order to determine its jdbc connection settings.
--driver-class-path /classpath/mysql-connector-java-5.1.30.jar:hive-site.xml --files /home/hadoop/spark/conf/hive-site.xml


Reference:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/sql/hive/HiveFromSpark.scala
https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/examples/using-hivecontext-yarn-cluster.md

No comments:

Post a Comment