Tuesday 9 September 2014

Hadoop Streaming 3: Python in Map-Reduce Mode


Download files from sftp with input is a list of folder paths.

Don't miss "#!/usr/bin/python" in the beginning of .py
Otherwise, Error:
    import: command not found
    syntax error near unexpected token `('

#!/usr/bin/python
from __future__ import print_function
import os
import time
import zipimport
import sys

ftp_username='xxx'
ftp_password='xxx'
ftp_host='xxx'

importer_pysftp = zipimport.zipimporter('pysftp.mod')
pysftp = importer_pysftp.load_module('pysftp')

importer_paramiko = zipimport.zipimporter('paramiko.mod')
paramiko = importer_paramiko.load_module('paramiko')

#download the whole folder to local
sftp =  pysftp.Connection(ftp_host, username=ftp_username, password=ftp_password)

for line in sys.stdin:
    d_path = line.strip()
    sftp.walktree(d_path, fcallback=wtcb.file_cb, dcallback=wtcb.dir_cb, ucallback=wtcb.unk_cb)
    wtcb = pysftp.WTCallbacks()
    for fpath in wtcb.flist:
        print(fpath)
sftp.close()

-output is necessary, even though no output for mapper.

For CHD5.1.2
find /opt/cloudera/parcels/ -name *hadoop-streaming*.jar
to locate the path of hadoop-streaming.jar


$hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.7.0.jar \
-D mapred.reduce.tasks=10 \
-mapper ftp_file_list_mapper.py \
-input data/folder_list \
-output data/file_list \
-file ftp_file_list_mapper.py \
-file lib/pysftp.mod \
-file lib/paramiko.mod

We can set the reduce task number, even for a map alone job. It will use the default IdentityReducer from mapred.lib.


12 comments:

  1. Thanks for sharing this informative information. For more you may refer http://www.s4techno.com/hadoop-training-in-pune/

    ReplyDelete
  2. Thanks for the information. The one thing I have noticed in this website is that you were continuously updating the changes that you have been made. It is a good sign to attract more people and I appreciate it. Hope more update and news from you. thank u so much
    Ai & Artificial Intelligence Course in Chennai
    PHP Training in Chennai
    Ethical Hacking Course in Chennai Blue Prism Training in Chennai
    UiPath Training in Chennai

    ReplyDelete