Alvin's Big Data Notebook : Insert Data from Regular hive table to Parquet hive table

Friday, 1 August 2014

Insert Data from Regular hive table to Parquet hive table

If we want to Insert Data from regular hive table to parquet hive table,
have to set up in hive script as below.

SET mapred.child.java.opts=-Xmx4G -XX:+UseConcMarkSweepGC; //CDH4.7
SET hive.exec.dynamic.partition.mode=nonstrict;

Check the job.xml in the submitted job.
Make sure the setting is effective.
Job File: hdfs://xxxx/user/xxx/.staging/job_201407251347_0506/job.xml

-Xms, -Xmx—Places boundaries on the heap size to increase the predictability of garbage collection.

-XX:+UseConcMarkSweepGC and it will enable the concurrent collector with the parallel young generation collector. If you'd rather keep the GC pauses shorter at the expense of using more total CPU time for GC, and you have more than one CPU, you can use the concurrent collector (-XX:+UseConcMarkSweepGC).

in CDH5, the syntax changes:
mapreduce.map.java.opts = -Xmx4G;
mapreduce.reduce.java.opts = -Xmx4G;

Otherwise, it will show "Java Heap Error"
2014-08-01 08:50:21,899 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: Java heap space
at parquet.bytes.CapacityByteArrayOutputStream.initSlabs(CapacityByteArrayOutputStream.java:65)

Reference:
http://stackoverflow.com/questions/220388/java-concurrent-and-parallel-gc

Alvin's Big Data Notebook

Friday, 1 August 2014

Insert Data from Regular hive table to Parquet hive table

No comments:

Post a Comment