Tuesday 16 December 2014

Remove tab trail after Key from map output

The default separator between key and value in map output is '\t'

For some hadoop streaming jobs, we output only key. However, there is always '\t' as tail, which is not obverse.
If Distcp is used to read file names in the output file, it can't automatically remove the tailing '\t'.

1. Implement a customized outputformat, and replace '\t' by blank.

CustomOutputFormat<K, V> extends TextOutputFormat<K, V>
String keyValueSeparator = job.get("mapred.textoutputformat.separator", "");
Add below parameters in the command line.
-libjars lib/CustomFormats.jar
-outputformat com.mapreduce.output.CustomOutputFormat

For oozie mapreduce action, add followings.

<property>
       <name>mapred.output.format.class</name>
       <value>${custom_format_class}</value>
</property>
<file>${lib_custom_format_path}</file>


if use 'mapreduce.job.outputformat.class'
Error: mapreduce.job.map.class is incompatible with map compatability mode.


2. Replace '\t' by another separator in action.
However, it can't be set as empty. If empty is set, it will use the default '\t' instead.

<property>
       <name>mapreduce.output.textoutputformat.separator</name>
       <value>;</value>
</property>

Reference:
http://stackoverflow.com/questions/18133290/hadoop-streaming-remove-trailing-tab-from-reducer-output
https://github.com/zouhc/MyHadoop/blob/master/doc/hue.md

1 comment:

  1. Hello,
    I would like to write only KEY by map task. I don't want tab and value to be be written by map task. Every tab takes 1 byte while map task writes its intermediate data. I want to save this 1 byte allotted for tab.

    ReplyDelete