Alvin's Big Data Notebook : Parquet OutOfMemory Issue

When we output into Parquet in CDH4.7

FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: Java heap space

Below is the explanation from google groups

1 Gig is pretty small for a MR task; parquet gets much of its efficiency because it buffers a bunch of data in memory, organizes it to be stored more efficiently, and then flushes -- so some non-negligible amount of RAM is expected to be used by the writers.

I am not familiar with the Hive side of things, but suspect it will buffer on a per-partition basis, so if you have a lot of partitions being written simultaneously, you would get multiple buffers.

Generally speaking, fewer files with more data per file would be a good thing..

So I suppose the recommendation (again, not knowing much about the Hive implementation) would be to increase memory allocation, and/or reduce number of partitions, and/or reduce parquet.page.size which is your minimum unit of data that gets written out.

---Dmitriy V Ryaboy

Remember that Parquet data files use a 1 GB block size, so when deciding how finely to partition the data, try to find a granularity where each partition contains 1 GB or more of data, rather than creating a large number of smaller files split among many partitions.

Parquet keeps all the data for a row within the same data file, to ensure that the columns for a row are always available on the same node for processing. What Parquet does is to set an HDFS block size and a maximum data file size of 1 GB, to ensure that I/O and network transfer requests apply to large batches of data.

Within that gigabyte of space, the data for a set of rows is rearranged so that all the values from the first column are organized in one contiguous block, then all the values from the second column, and so on. Putting the values from the same column next to each other lets

Loading data into Parquet tables is a memory-intensive operation, because the incoming data is buffered until it reaches 1 GB in size, then that chunk of data is organized and compressed in memory before being written out. The memory consumption can be larger when inserting data into partitioned Parquet tables, because a separate data file is written for each combination of partition key column values, potentially requiring several 1 GB chunks to be manipulated in memory at once.

Hive can skip the data files for certain partitions entirely, based on the comparisons in the WHERE clause that refer to the partition key columns. For example, queries on partitioned tables often analyze data for time intervals based on columns such as YEAR, MONTH, and/orDAY, or for geographic regions. Remember that Parquet data files use a 1 GB block size, so when deciding how finely to partition the data, try to find a granularity where each partition contains 1 GB or more of data, rather than creating a large number of smaller files split among many partitions.

The Parquet data files have an HDFS block size of 1 GB, the same as the maximum Parquet data file size, to ensure that each data file is represented by a single HDFS block, and the entire file can be processed on a single node without requiring any remote reads. If the block size is reset to a lower value during a file copy, you will see lower performance for queries involving those files, and the PROFILEstatement will reveal that some I/O is being done suboptimally, through remote reads

Reference:

http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_parquet.html#parquet_partitioning_unique_2

https://groups.google.com/forum/#!msg/parquet-dev/elPJyQHARpI/LMiAGzNmfkIJ

Alvin's Big Data Notebook

Friday, 11 July 2014

Parquet OutOfMemory Issue

No comments:

Post a Comment