Thursday, 3 July 2014

Shell and Java Action Caveats in Oozie

Shell Action Caveats

The Shell action has the following caveats:
  • Interactive commands are not supported.
  • In an unsecure cluster, everything is run as the user who started the TaskTracker where our shell script is running (mapred user in CDH4); in a “Kerberized” cluster, it will run as the UNIX user of whomever submitted the workflow. This is in contrast to MapReduce-based actions, which, for the purposes of interaction with Hadoop, are run as the user who submitted the workflow –although the UNIX process for the task still runs as mapred.
  • The Shell action is executed on an arbitrary node in the cluster.
  • Different operating systems may have different versions of the same shell commands.
The implications of that third caveat are very important. Oozie executes the shell action in the same way it executes any of the other actions: as a MapReduce job. In this case, it’s a 1-mapper-0-reducer job, which is why it can be executed on any node in the cluster. This means that any command or script that we want to execute has to be available on that node; because we don’t know which node the shell action will be executed on, the command or script has to be available on all nodes! This is fine for typical built-in shell commands like echo or grep, but can be more problematic for programs such as matlab, which must not only be installed but may also require a license. Instead, we’ll be putting our script in the same directory as the workflow.xml and taking advantage of the<file> tag to have Oozie copy it to the proper node for us.
Even though two operating systems, or even two different versions of the same operating system, may have the same built-in commands or programs, they may behave differently or accept different arguments. For example, we’ll be using the tail command later; on Mac OS 10.7.5 we can specify the number of lines with the following arguments, but this won’t work properly on CentOS 6.2:
tail +2 hour.txt

This doesn’t mean that we can’t use the tail command though; it just means that we have to be careful to ensure that all of the machines on which our Shell action could possibly run have compatible versions of the built-in commands. For example, the following arguments work correctly on both Mac OS 10.7.5 and CentOS 6.2:
tail -n +2 hour.txt

That said, the script has been tested on Mac OS 10.7.5 and CentOS 6.2, but may require minor tweaking on other operating systems or versions.

Java Action Caveats

The Java action has the following caveats:
  • The Java action is executed on an arbitrary node in the cluster.
  • Calling System.exit(int n) will always make the Java action do an “error to” transition.
The Java action is also executed on an arbitrary node in the cluster for the same reason the Shell action is; however, this is typically less problematic for the Java action because external resources would be JAR files that we’d be already including anyway.
It is also important that our Java code not call System.exit(int n) as this will make the Java action do an “error to” transition, even if the exit code was 0; instead, an “ok to” transition is indicated by gracefully finishing main and an “error to” transition is indicated by throwing an exception.

No comments:

Post a Comment