Alvin's Big Data Notebook : MapReduce Python Streaming Action in Oozie

MapReduce python streaming action in Oozie in CDH5.1.2

<workflow-app name="streaming_wf" xmlns="uri:oozie:workflow:0.4">
<start to="fetch_filelist_mr"/>

<action name="fetch_filelist_mr">

<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${file_list_path}"/>
</prepare>
<streaming>
<mapper>python ${fetch_filelist_python_script} parameters</mapper>
</streaming>
<configuration>
<property>
<name>mapreduce.input.fileinputformat.inputdir</name>
<value>${folder_list_path}</value>
</property>
<property>
<name>mapreduce.output.fileoutputformat.outputdir</name>
<value>${file_list_path}</value>
</property>
<property>
<name>mapreduce.job.reduces</name>
<value>20</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx2576351232</value>
</property>
<property>
<name>mapreduce.output.textoutputformat.separator</name>
<value></value>
</property>
</configuration>
<file>${fetch_filelist_python_path}</file>
<file>${lib_boto_path}</file>
</map-reduce>

<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>

<end name="end"/>
</workflow-app>

1. fetch_filelist_python_path = ${appPath}/python/list_mapper.py
fetch_filelist_python_script = list_mapper.py

2. Since Hadoop will sets default seperator between as '\t'
if we want to remove '\t' after the key in each line,
mapreduce.output.textoutputformat.separator= ''

3. You must put a dummy file in input path, eventhough it is useless and empty.
Otherwise, no mapper is executed.
mapreduce.input.fileinputformat.inputdir

Reference
http://stackoverflow.com/questions/25340961/how-to-mention-a-combiner-in-oozie-while-using-streaming-jar
https://github.com/yahoo/oozie/tree/master/examples/src/main/apps/streaming
https://oozie.apache.org/docs/3.2.0-incubating/WorkflowFunctionalSpec.html#a3.2.2.2_Streaming

Alvin's Big Data Notebook

Sunday, 9 November 2014

MapReduce Python Streaming Action in Oozie

No comments:

Post a Comment