Tuesday 17 June 2014

Store Pig Output to Parquet

REGISTER lib/parquet-pig-1.3.1.jar;
REGISTER lib/parquet-column-1.3.1.jar;
REGISTER lib/parquet-common-1.3.1.jar;
REGISTER lib/parquet-format-2.0.0.jar;
REGISTER lib/parquet-hadoop-1.3.1.jar;
REGISTER lib/parquet-pig-1.3.1.jar;
REGISTER lib/parquet-encoding-1.3.1.jar;

--store in parquet format
SET parquet.compression gzip;
STORE table INTO 'src/main/resources/data/output/table' USING parquet.pig.ParquetStorer;

Loader and a Storer are provided to read and write Parquet files with Apache Pig
Storing data into Parquet in Pig is simple:
-- options you might want to fiddle with
SET parquet.page.size 1048576 -- default. this is your min read/write unit.
SET parquet.block.size 134217728 -- default. your memory budget for buffering data
SET parquet.compression lzo -- or you can use none, gzip, snappy
STORE mydata into '/some/path' USING parquet.pig.ParquetStorer;
Reading in Pig is also simple:
mydata = LOAD '/some/path' USING parquet.pig.ParquetLoader();

Storing data Testing

Pig currently drops all STORE and DUMP commands. You can tell PigUnit to keep the commands and execute the script:
test = new PigTest(PIG_SCRIPT, args);   
test.unoverride("STORE");
test.runScript();

No comments:

Post a Comment