Below is an example to implement a hadoop streaming by shell in Oozie
command line:
hadoop jar $STREAMINGJAR \
-Dmapreduce.task.timeout=14400000 \
-input input_path \
-output output_path \
-numReduceTasks 1 \
-mapper 'shell_mapper.sh' \
-file 'shell_path/shell_mapper.sh'
shell_mapper.sh:
while read split; do
fout=`echo "$split" | awk '{split($0,a,"/"); split(a[5],b,"-"); split(b[3],c,"."); print "hdfs://output_path/20"c[1]"-"b[1]"-"b[2]"-"a[4]".txt"}'`
hdfs dfs -cat "$split"| tar zxfO - | bzip2 | hdfs dfs -put - "$fout"
done
Note: '-' refer to stdin rather than a file name. $fout refer to a file name rather than distinction folder.
Oozie Action:
<action name="shell_mr">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${combined_file_path}"/>
</prepare>
<streaming>
<mapper>${shell_mapper} ${wf:user()} ${output_path}</mapper>
</streaming>
<configuration>
<property>
<name>mapreduce.input.fileinputformat.inputdir</name>
<value>${input_path}</value>
</property>
<property>
<name>mapreduce.output.fileoutputformat.outputdir</name>
<value>${output_path}</value>
</property>
<property>
<name>mapreduce.job.reduces</name>
<value>1</value>
</property>
</configuration>
<file>${shell_path}</file>
</map-reduce>
<ok to="end"/>
<error to="send_email"/>
</action>
Note: the $output_path must be deleted, otherwise, the $output_path will be the target file name instead of target folder.
No comments:
Post a Comment