Sunday, 9 November 2014

MapReduce Python Streaming Action in Oozie

MapReduce python streaming action in Oozie in CDH5.1.2


<workflow-app name="streaming_wf" xmlns="uri:oozie:workflow:0.4">
    <start to="fetch_filelist_mr"/>
 
    <action name="fetch_filelist_mr">
   
       <map-reduce>
         <job-tracker>${jobTracker}</job-tracker>
         <name-node>${nameNode}</name-node>
         <prepare>
            <delete path="${file_list_path}"/>
         </prepare>
       <streaming>
         <mapper>python ${fetch_filelist_python_script} parameters</mapper>
       </streaming>
       <configuration>
        <property>
            <name>mapreduce.input.fileinputformat.inputdir</name>
            <value>${folder_list_path}</value>
        </property>
        <property>
            <name>mapreduce.output.fileoutputformat.outputdir</name>
            <value>${file_list_path}</value>
        </property>
        <property>
            <name>mapreduce.job.reduces</name>
            <value>20</value>
        </property>
        <property>
            <name>mapreduce.map.java.opts</name>
            <value>-Xmx2576351232</value>
        </property>
        <property>
              <name>mapreduce.output.textoutputformat.separator</name>
              <value></value>
         </property>
      </configuration>
      <file>${fetch_filelist_python_path}</file>
      <file>${lib_boto_path}</file>
    </map-reduce>

    <ok to="end"/>
    <error to="kill"/>
  </action>
  <kill name="kill">
    <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
  </kill>

   <end name="end"/>
</workflow-app>


1. fetch_filelist_python_path = ${appPath}/python/list_mapper.py
fetch_filelist_python_script = list_mapper.py

2. Since Hadoop will sets default seperator between as '\t'
if we want to remove '\t' after the key in each line,
mapreduce.output.textoutputformat.separator= ''

3. You must put a dummy file in input path, eventhough it is useless and empty.
Otherwise, no mapper is executed.
mapreduce.input.fileinputformat.inputdir

Reference
http://stackoverflow.com/questions/25340961/how-to-mention-a-combiner-in-oozie-while-using-streaming-jar
https://github.com/yahoo/oozie/tree/master/examples/src/main/apps/streaming
https://oozie.apache.org/docs/3.2.0-incubating/WorkflowFunctionalSpec.html#a3.2.2.2_Streaming

No comments:

Post a Comment