Alvin's Big Data Notebook : Avro v.s. Parquet Formats

Avro is a row-based storage format for Hadoop.

Parquet is a column-based storage format for Hadoop.

If your use case typically scans or retrieves all of the fields in a row in each query, Avro is usually the best choice.

If your dataset has many columns, and your use case typically involves working with a subset of those columns rather than entire records, Parquet is optimized for that kind of work

Parquet Compression Comparisons.

There are three compression options: uncompressed, snappy, and gzip. The default is snappy. You can specify one of them once before the first store instruction in a Pig script:

SET parquet.compression gzip;

23.1 G  /user/hive/warehouse/parquet_compression.db/parquet_snappy
13.5 G  /user/hive/warehouse/parquet_compression.db/parquet_gzip
32.8 G  /user/hive/warehouse/parquet_compression.db/parquet_none

At the same time, the less agressive the compression, the faster the data can be decompressed. In this case using a table with a billion rows, a query that evaluates all the values for a particular column runs faster with no compression than with Snappy compression, and faster with Snappy compression than with Gzip compression.

In Pig, if you don't set parquet.compression, the default one is uncompressed.

Reference:

https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetOutputFormat.java

https://www.youtube.com/watch?v=AY1dEfyFeHc&list=PLGzsQf6UXBR-BJz5BGzJb2mMulWTfTu99&index=4

http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_parquet.html#parquet_compression_unique_1

https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetOutputFormat.java

Alvin's Big Data Notebook

Tuesday, 11 November 2014

Avro v.s. Parquet Formats

No comments:

Post a Comment