Saturday, 6 September 2014

Hadoop Streaming 1: Basic


Although Hadoop is primarly designed to work with Java code, it supports other languages via Hadoop Streaming. This jar opens a subprocess to your code, sends it input via stdin, and gathers results via stdout. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer. 

For example:
$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input myInputDirs \
    -output myOutputDir \
    -mapper /bin/cat \
    -reducer /bin/wc

How does it work:



In the above example, both the mapper and the reducer are executables that read the input from stdin (line by line) and emit the output to stdout. The utility will create a Map/Reduce job, submit the job to an appropriate cluster, and monitor the progress of the job until it completes.
When an executable is specified for mappers, each mapper task will launch the executable as a separate process when the mapper is initialized. When an executable is specified for reducers, each reducer task will launch the executable as a separate process then the reducer is initialized.
Job Submission:
You can specify any executable as the mapper and/or the reducer. The executables do not need to pre-exist on the machines in the cluster; however, if they don't, you will need to use "-file" option to tell the framework to pack your executable files as a part of job submission. For example:
$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input myInputDirs \
    -output myOutputDir \
    -mapper myPythonScript.py \
    -reducer /bin/wc \
    -file myPythonScript.py

Specifying Map-Only Jobs

Often, you may want to process input data using a map function only. To do this, simply set mapred.reduce.tasks to zero. The Map/Reduce framework will not create any reducer tasks. Rather, the outputs of the mapper tasks will be the final output of the job.
    -D mapred.reduce.tasks=0
 

For example, consider the problem of zipping (compressing) a set of files across the hadoop cluster. 
1. Generate a file containing the full HDFS path of the input files.
2. Each map task would get one file name as input.Create a mapper script which, given a filename, will get the file to local disk, gzip the file and put it back in the desired output directory
3. Sometimes, we want to split tasks evenly. We can set reduce.tasks !=0, even no reducer is specified. In this case, it will use a default IdentityReducer without logics.



Reference:
http://hadoop.apache.org/docs/r1.2.1/streaming.html


No comments:

Post a Comment