Keeping data compressed in Hive tables has, in some cases, been known to give better performance than uncompressed storage; both in terms of disk usage and query performance.
You can import text files compressed with Gzip or Bzip2 directly into a table stored as TextFile. The compression will be detected automatically and the file will be decompressed on-the-fly during query execution.
CREATE TABLE csv_table (line STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' tblproperties ("skip.header.line.count"="1"); LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE csv_table; SET hive.exec.compress.output=true; SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK INSERT OVERWRITE TABLE csv_table_sequence SELECT * FROM csv_table;
However, in this case Hadoop will not be able to split your file into chunks/blocks and run multiple maps in parallel. If your file is large, it matters. The recommended practice is to insert data into another table, which is stored as a SequenceFile.
Please note the skip header setting is only effective in Hive0.13 instead of 0.12 and below.
ROW FORMAT serde 'com.bizo.hive.serde.csv.CSVSerde'
Reference:
https://cwiki.apache.org/confluence/display/Hive/CompressedStorage
http://stackoverflow.com/questions/15751999/hive-external-table-skip-first-row/15753145#15753145
No comments:
Post a Comment