For some hadoop streaming jobs, we output only key. However, there is always '\t' as tail, which is not obverse.
If Distcp is used to read file names in the output file, it can't automatically remove the tailing '\t'.
1. Implement a customized outputformat, and replace '\t' by blank.
CustomOutputFormat<K, V> extends TextOutputFormat<K, V>
String keyValueSeparator = job.get("mapred.textoutputformat.separator", "");
Add below parameters in the command line.-libjars lib/CustomFormats.jar
-outputformat com.mapreduce.output.CustomOutputFormat
For oozie mapreduce action, add followings.
<property>
<name>mapred.output.format.class</name>
<value>${custom_format_class}</value>
</property>
<file>${lib_custom_format_path}</file>
if use 'mapreduce.job.outputformat.class'
Error: mapreduce.job.map.class is incompatible with map compatability mode.
2. Replace '\t' by another separator in action.
However, it can't be set as empty. If empty is set, it will use the default '\t' instead.
<property>
<name>mapreduce.output.textoutputformat.separator</name>
<value>;</value>
</property>
Reference:
http://stackoverflow.com/questions/18133290/hadoop-streaming-remove-trailing-tab-from-reducer-output
https://github.com/zouhc/MyHadoop/blob/master/doc/hue.md
Hello,
ReplyDeleteI would like to write only KEY by map task. I don't want tab and value to be be written by map task. Every tab takes 1 byte while map task writes its intermediate data. I want to save this 1 byte allotted for tab.