Monday 23 June 2014

Convert the Same Pig and Avro Output Schema into Parquet Format

We have a question on whether the Pig and Avro with the same output schema will be converted into Parquet in the same schema.

I recently did some research on this question by reading the source code of Parquet Converters for Pig and Avro.
The comparison tables are shown as follows.


Avro typeParquet type
nullno type (the field is not encoded in Parquet), unless a null union
booleanboolean
intint32
longint64
floatfloat
doubledouble
bytesbinary
stringbinary (with original type UTF8)
recordgroup containing nested fields
enumbinary (with original type ENUM)
arraygroup (with original type LIST) containing one repeated group field
mapgroup (with original type MAP) containing one repeated group field (with original type MAP_KEY_VALUE) of (key, value)
fixedfixed_len_byte_array
unionan optional type, in the case of a null union, otherwise not supported


Pig typeParquet type
nullno type (the field is not encoded in Parquet)
booleanboolean
intint32
longint64
floatfloat
doubledouble
bytesbinary
chararraybinary (with original type UTF8)
tuplean optional group containing one repeated group field
bagan optional group containing one repeated group field to preserve distinction between empty bag and null.
mapan optional group containing one repeated group field of (key, value).

It seems that Pig and Avro with the same output schema, after converted into Parquet.


Reference:

No comments:

Post a Comment