Alvin's Big Data Notebook : Hive Load csv.gz and Skip Header

Keeping data compressed in Hive tables has, in some cases, been known to give better performance than uncompressed storage; both in terms of disk usage and query performance.

You can import text files compressed with Gzip or Bzip2 directly into a table stored as TextFile. The compression will be detected automatically and the file will be decompressed on-the-fly during query execution.

CREATE TABLE csv_table (line STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
LINES TERMINATED BY '\n'
tblproperties ("skip.header.line.count"="1");
 
LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE csv_table;

SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK
INSERT OVERWRITE TABLE csv_table_sequence SELECT * FROM csv_table;

However, in this case Hadoop will not be able to split your file into chunks/blocks and run multiple maps in parallel. If your file is large, it matters. The recommended practice is to insert data into another table, which is stored as a SequenceFile.

Please note the skip header setting is only effective in Hive0.13 instead of 0.12 and below.

ROW FORMAT serde 'com.bizo.hive.serde.csv.CSVSerde'

The CSVSerDe is being used to eliminate the double quotes in the CSV file.

Reference:
https://cwiki.apache.org/confluence/display/Hive/CompressedStorage
http://stackoverflow.com/questions/15751999/hive-external-table-skip-first-row/15753145#15753145

Alvin's Big Data Notebook

Thursday, 13 November 2014

Hive Load csv.gz and Skip Header

No comments:

Post a Comment