Friday 19 December 2014

Hadoop Streaming by Shell in Oozie

Below is an example to implement a hadoop streaming by shell in Oozie

command line:

hadoop jar $STREAMINGJAR \
-Dmapreduce.task.timeout=14400000 \
-input input_path \
-output output_path \
-numReduceTasks 1 \
-mapper 'shell_mapper.sh' \
-file 'shell_path/shell_mapper.sh'

shell_mapper.sh:

while read split; do
fout=`echo "$split" | awk '{split($0,a,"/"); split(a[5],b,"-"); split(b[3],c,"."); print "hdfs://output_path/20"c[1]"-"b[1]"-"b[2]"-"a[4]".txt"}'`
hdfs dfs -cat "$split"| tar zxfO - | bzip2 | hdfs dfs -put - "$fout"
done

Note: '-' refer to stdin rather than a file name. $fout refer to a file name rather than distinction folder.

Oozie Action:

   <action name="shell_mr">
        <map-reduce>
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                <delete path="${combined_file_path}"/>
            </prepare>
            <streaming>
                <mapper>${shell_mapper} ${wf:user()} ${output_path}</mapper>
            </streaming>
            <configuration>
                <property>
                    <name>mapreduce.input.fileinputformat.inputdir</name>
                    <value>${input_path}</value>
                </property>
                <property>
                    <name>mapreduce.output.fileoutputformat.outputdir</name>
                    <value>${output_path}</value>
                </property>

                <property>
                    <name>mapreduce.job.reduces</name>
                    <value>1</value>
                </property>
            </configuration>
            <file>${shell_path}</file>

        </map-reduce>
        <ok to="end"/>
        <error to="send_email"/>
    </action>

Note: the $output_path must be deleted, otherwise, the $output_path will be the target file name instead of target folder.

No comments:

Post a Comment