Alvin's Big Data Notebook : Input data dependencies of Oozie workflows

You can define input data dependencies for your Coordinator Job. Your Job will not run until the input directory is created.
- eg. hdfs://localhost:9000/tmp/revenue_feed/2010/06/01/03/

$ cat coordinator.xml
<coordinator-app name="MY_APP" frequency="1440" start="2009-02-01T00:00Z" end="2009-02-07T00:00Z" timezone="UTC" xmlns="uri:oozie:coordinator:0.1">
   <datasets>
      <dataset name="input1" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC">
         <uri-template>hdfs://localhost:9000/tmp/revenue_feed/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template>
      </dataset>
   </datasets>
   <input-events>
      <data-in name="coordInput1" dataset="input1">
          <start-instance>${coord:current(-23)}</start-instance>
          <end-instance>${coord:current(0)}</end-instance>
      </data-in>
   </input-events>
   <action>
      <workflow>
         <app-path>hdfs://localhost:9000/tmp/workflows</app-path>
      </workflow>
   </action>     
</coordinator-app>

This Coordinator Job runs every 1440 minutes (24 hours).
It will start on 2009-02-01T00:00Z and end on 2009-02-07T00:00Z (7 days). The Coordinator jobs will be materialized at these times:
- 2009-02-01T00:00Z
- 2009-02-02T00:00Z
- 2009-02-03T00:00Z
- 2009-02-04T00:00Z
- 2009-02-05T00:00Z
- 2009-02-06T00:00Z
However, these jobs may not run at the specified times because we addedinput data dependencies for each job. When each job is materialized, Oozie will check if the specified input data is available.
- If the data is available, the job will run.
- If the data is not available, the job will wait in the Oozie queue until the input data is created.
Each of these daily jobs is dependent on the last 24 hours of hourly data from the input1 feed. Within the input-events section, you will notice that thedata-in block specifies the start and end instances for the input data dependencies.
- ${coord:current(0)} is a function that returns the current instance of the input1 dataset
- ${coord:current(-23)} is a function that returns the 23rd oldest instance of the input1 dataset
- For the Coordinator Job that is materialized on 2009-02-01T00:00Z, the start-instance will be 2009-01-31T01:00Z (23 hours earlier) and the end-instance will be 2009-02-01T00:00Z.

   <input-events>
      <data-in name="coordInput1" dataset="input1">
          <start-instance>${coord:current(-23)}</start-instance>
          <end-instance>${coord:current(0)}</end-instance>
      </data-in>
   </input-events>

The datasets section defines metadata for all of the input datasets
- name = logical name for the dataset
- frequency = how often data is written to this dataset
- initial-instance = timestamp for the first instance of this dataset. Older instances will be ignored.
- uri-template = HDFS directory structure for the dataset
In this example, the HDFS directory structure for the input1 dataset is as follows:

/tmp/revenue_feed/2009/01/01/00/
/tmp/revenue_feed/2009/01/01/01/
/tmp/revenue_feed/2009/01/01/02/

<datasets> <dataset name="input1" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC"> <uri-template>hdfs://localhost:9000/tmp/revenue_feed/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> </datasets>

Reference:

https://github.com/yahoo/oozie/wiki/Oozie-Coord-Use-Cases
http://www.jilles.net/perma/2014/05/28/setting-up-an-hadoop-oozie-coordinator-and-workflow/

Alvin's Big Data Notebook

Friday, 14 November 2014

Input data dependencies of Oozie workflows

No comments:

Post a Comment