MapReduce python streaming action in Oozie in CDH5.1.2
<workflow-app name="streaming_wf" xmlns="uri:oozie:workflow:0.4">
<start to="fetch_filelist_mr"/>
<action name="fetch_filelist_mr">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${file_list_path}"/>
</prepare>
<streaming>
<mapper>python ${fetch_filelist_python_script} parameters</mapper>
</streaming>
<configuration>
<property>
<name>mapreduce.input.fileinputformat.inputdir</name>
<value>${folder_list_path}</value>
</property>
<property>
<name>mapreduce.output.fileoutputformat.outputdir</name>
<value>${file_list_path}</value>
</property>
<property>
<name>mapreduce.job.reduces</name>
<value>20</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx2576351232</value>
</property>
<property>
<name>mapreduce.output.textoutputformat.separator</name>
<value></value>
</property>
</configuration>
<file>${fetch_filelist_python_path}</file>
<file>${lib_boto_path}</file>
</map-reduce>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
1. fetch_filelist_python_path = ${appPath}/python/list_mapper.py
fetch_filelist_python_script = list_mapper.py
2. Since Hadoop will sets default seperator between as '\t'
if we want to remove '\t' after the key in each line,
mapreduce.output.textoutputformat.separator= ''
3. You must put a dummy file in input path, eventhough it is useless and empty.
Otherwise, no mapper is executed.
mapreduce.input.fileinputformat.inputdir
Reference
http://stackoverflow.com/questions/25340961/how-to-mention-a-combiner-in-oozie-while-using-streaming-jar
https://github.com/yahoo/oozie/tree/master/examples/src/main/apps/streaming
https://oozie.apache.org/docs/3.2.0-incubating/WorkflowFunctionalSpec.html#a3.2.2.2_Streaming
No comments:
Post a Comment