Alvin's Big Data Notebook : Hadoop Streaming by Shell in Oozie

Below is an example to implement a hadoop streaming by shell in Oozie

command line:

hadoop jar $STREAMINGJAR \
-Dmapreduce.task.timeout=14400000 \
-input input_path \
-output output_path \
-numReduceTasks 1 \
-mapper 'shell_mapper.sh' \
-file 'shell_path/shell_mapper.sh'

shell_mapper.sh:

while read split; do
fout=`echo "$split" | awk '{split($0,a,"/"); split(a[5],b,"-"); split(b[3],c,"."); print "hdfs://output_path/20"c[1]"-"b[1]"-"b[2]"-"a[4]".txt"}'`
hdfs dfs -cat "$split"| tar zxfO - | bzip2 | hdfs dfs -put - "$fout"
done

Note: '-' refer to stdin rather than a file name. $fout refer to a file name rather than distinction folder.

Oozie Action:

<action name="shell_mr">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${combined_file_path}"/>
</prepare>
<streaming>
<mapper>${shell_mapper} ${wf:user()} ${output_path}</mapper>
</streaming>
<configuration>
<property>
<name>mapreduce.input.fileinputformat.inputdir</name>
<value>${input_path}</value>
</property>
<property>
<name>mapreduce.output.fileoutputformat.outputdir</name>
<value>${output_path}</value>
</property>

<property>
<name>mapreduce.job.reduces</name>
<value>1</value>
</property>
</configuration>
<file>${shell_path}</file>

</map-reduce>
<ok to="end"/>
<error to="send_email"/>
</action>

Note: the $output_path must be deleted, otherwise, the $output_path will be the target file name instead of target folder.

Alvin's Big Data Notebook

Friday, 19 December 2014

Hadoop Streaming by Shell in Oozie

No comments:

Post a Comment