Tuesday 16 December 2014

Distcp Action in Oozie

1. Disctcp Action

<action name="distcp-node">
  <distcp xmlns="uri:oozie:distcp-action:0.1">
     <job-tracker>${jobTracker}</job-tracker>
       <name-node>${nameNode}</name-node>
       <prepare>
         <delete path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data/${outputDir}"/>
       </prepare>
       <configuration>
          <property>
            <name>mapred.job.queue.name</name>
            <value>${queueName}</value>
          </property>
       </configuration>
  <arg>${nameNode}/user/${wf:user()}/${examplesRoot}/input-data/text/data.txt</arg>
  <arg>${nameNode}/user/${wf:user()}/${examplesRoot}/output-data/${outputDir}/data.txt</arg>
  </distcp>
  <ok to="end"/>
  <error to="fail"/>
</action>

2. Shell Action

<action name="distcp_shell">
        <shell xmlns="uri:oozie:shell-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                <delete path='${file_path}'/>
            </prepare>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
           
            </configuration>
            <exec>${distcp_script}</exec>
            <file>${distcp_script_path}#${distcp_script}</file>
            <capture-output/>
        </shell>
        <ok to="end"/>
        <error to="send_email"/>
    </action>


Shell Script:

#!/bin/sh
ACCESSKEYID=xxxx
SECRETACCESSKEY=xxxx

export HADOOP_USER_NAME=current_user

hadoop distcp -D fs.s3n.awsAccessKeyId=$ACCESSKEYID \
-D fs.s3n.awsSecretAccessKey=$SECRETACCESSKEY \
-f file_list_path/part-00000 \
hdfs://cdh/current_user/file_data_path/


Please Note:
if distcp only one file1.txt to a target folder1/file1.txt, make sure folder1 exist. Otherwise, the target file name will become folder1(a file instead of a folder).

hadoop fs -mkdir folder1
hadoop distcp folder0/file1.txt   folder1


3. Command lines from hdfs to another hdfs

(1) Copy folder "files" and all its children folders and files into "target" folder.
hadoop distcp  /user/me/source/files/  hdfs://namenode/user/me/target/

This will expand the namespace under /foo/bar on nn1 into a temporary file, partition its contents among a set of map tasks, and start a copy on each TaskTracker from nn1 to nn2. Note that DistCp expects absolute paths.
By default, files already existing at the destination are skipped (i.e. not replaced by the source file). A count of skipped files is reported at the end of each job

-overwrite: Overwrite destination

-update: Overwrite if src size different from dst size

(2) Put file paths in file_list.txt, the copy only files into "target" folder.
hadoop distcp -f /user/me/file_list.txt hdfs://namenode/user/me/target/


Reference:
http://oozie.apache.org/docs/4.0.0/DG_DistCpActionExtension.html
http://hadoop.apache.org/docs/r0.18.3/distcp.html

No comments:

Post a Comment