Introduction To Oozie
Oozie is a Java web application to store the details of the workflows. Oozie provides the status of the workflows at any point of time. Workflows can run concurrently using Oozie. It helps the user in scheduling the jobs.
An oozie job submission involves the following files:
1. coordinator.xml
2. job.coordinator.properties
3. workflow.xml
1. Coordinator.xml
The coordinator.xml is used to trigger a workflow on occurrence of particular event or at regular intervals of time.
For example:
a. A job needs to be triggered on arrival of a particular data set.
b. A job needs to be triggered on a daily, hourly basis etc.
<coordinator-app name="ReadWrite" start="${start}"
end="${end}" frequency="${coord:days(1)}" timezone="UTC"
xmlns="uri:oozie:coordinator:0.1">
<action>
<workflow>
<app-path>${oozie_app_path}</app-path>
<configuration>
<property>
<name>filename</name>
<value>${filename}
</value>
</property>
<property>
<name>Read_Write_Classname</name>
<value>${Read_Write_Classname}</value>
</property>
</configuration>
</workflow>
</action>
Parameters
1.start - Start time of the job
2.end - End time of the job
3.frequency - The periodic interval between successive execution of the job. The values may be of follows:
3.1 ${coord:days(1)} – For a Job to be triggered on a daily basis
3.2 60 – For a Job to be triggered on hourly basis(every 60 mins)
4.app-path - Path of the workflow.xml
5.Property - Properties that are required by the workflow.
The above parameters need to be specified in the job.coordinator.properties file.
2. Job.Coordinator.Properties
The parameters that are required for the execution of the workflow are specified in the property file.
For example:
a. The path of the workflow.xml file.
b. The run time arguments for the Java class (If Java Action is used) etc.
The property file for the corresponding coordinator.xml is mentioned below. The highlighted part specifies the properties that are required by the coordinator to pass it to the workflow.
##############################
# Hadoop settings
##############################
nameNode=hdfs://<Name Node IP>:<Name Node Port>
jobTracker=<Job tracker IP>:<Job Tracker Port>
queueName=default
##############################
# oozie settings
##############################
appName=ReadWrite
#Workflow path
oozie_app_path=${nameNode}/user/${user.name}/${appName}
oozie.use.system.libpath=true
###############################
# oozie coordinator settings
###############################
oozie.coord.application.path=${oozie_app_path}
start=2013-02-19T00:00Z
end=2013-03-19T00:00Z
initial_instance=2013-02-19T00:00Z
timeOut=-1
conCur=1
# execution order can be FIFO, LIFO, or LAST_ONLY
execOrder=FIFO
throttle=7
flag=_done
################################
# application specific settings
################################
filename=WriteFile.txt
Read_Write_Classname=com.test.ReadWriteHDFS
3. Workflow.xml
Workflow can be defined as the collection of actions which can be executed either sequentially or in parallel. The flow of actions is specified in the workflow.xml.
? The java code that has to be loaded to the workflow is exported as jar file and is specified in the <file></file> tag.
? The arguments that are needed to be passed to Java class is specified in the
<arg></arg> tags.
? The class(name) which contains the main method is specified in the
<main-class></main-class> tags.
A Java action is implemented in the workflow.xml. The implementation is as follows:
<workflow-app xmlns="uri:oozie:workflow:0.2" name="HDFS-workflow">
<start to="Read-Write" />
<action name="Read-Write">
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<main-class>${Read_Write_Classname}</main-class>
<arg>${filename}</arg>
<file>ReadWriteHDFS.jar</file>
<capture-output />
</java>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Hive failed, error
message[${wf:errorMessage(wf:lastErrorNode())}]
</message>
</kill>
<end name="end" />
</workflow-app>
The Java class that is triggered through the workflow Java action is ReadWrite.java.
ReadWrite.java
The Java action execution begins at the main function of the Java class. The execution of the Java applications are implemented as a map-reduce job where it has a single mapper task. During the execution of the java action, the job waits for it to get completed.
Scenario
When the user needs to trigger a workflow based on the existence of a data file, this Java action can be used to create a trigger file indicating the existence of the data file and copying the original contents from the data file to the trigger file. This trigger file can be used to initiate the workflow.
Steps Involved:
1. The coordinator initiates the workflow execution
2. The workflow calls the main function of the Java class
3. The Java class performs the necessary action and returns the control to the workflow.
4. The workflow continues its execution.
package com.test;
import java.io.IOException;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.Date;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class ReadWrite {
public static void main(String[] args) {
String writePath = "< path of the file to which the content has to be written>";
String readPath = "< path of the file from which the content has to be read>";
String file_name = args[0];
System.out.println("calling WriteFile");
writeFile(file_name, readPath, writePath);
}
@SuppressWarnings("deprecation")
private static void writeFile(String file_name, String readPaths,
String writePath) {
try {
Configuration conf = new Configuration();
conf.addResource(new Path("<path of core-site.xml>"));
conf.addResource(new Path("<path of hdfs-site.xml>"));
FileSystem fileSystem = FileSystem.get(conf);
// Check if the file already exists
Path path = new Path(writePath + file_name+ ".txt");
Path readPath = new Path(readPaths);
FSDataInputStream in = fileSystem.open(readPath);
if(fileSystem.exists(readPath))
{
if (fileSystem.exists(path)) {
System.out.println("File already exists");
fileSystem.delete(path);
}
// Create a new file and write data to it.
FSDataOutputStream out = fileSystem.create(path);
byte[] b = new byte[1026];
int numBytes = 0;
while ((numBytes = in.read(b)) > 0) {
out.write(b, 0, numBytes);
}
in.close();
out.close();
}
// fileSystem.close();
} catch (Exception e) {
e.printStackTrace();
}
}
Steps to Launch Oozie Workflow
1. Move the above mentioned files to HDFS location.
2. Specify the HDFS location of the workflow in the job.coordinator.properties file(“oozie_app_path” attribute).
3. Command to launch the workflow:
oozie job -oozie http://<oozie-url>:<oozie-port>/oozie -config job.properties -run
4. Link to check the status of the workflow:
http://<oozie-url>:<oozie-port>/oozie
Output
1.The content of the file to be read.
$ hadoop dfs -cat write.txt
2. The path where the new file has to be created.
a Before the execution of the code the path does not contain any trigger file.
$ hadoop dfs -ls
b Execution of the code.
$ oozie job -oozie http://localhost:11000/oozie -config job.coordinator.properties -run
c After the execution of the code, a trigger file “done.txt” is created.
$ hadoop dfs -ls
d Displaying the contents of the created file. The trigger file and the read file data is same.
$ hadoop dfs -cat done.txt
The workflow status can be checked using an Oozie web console.
Oozie Console for Workflow Job Status
http://localhost:11000/oozie
References
http://archive.cloudera.com/cdh/3/oozie/index.html
hi, im working on oozie from HUE. I have to trigger a job based on the event. i.e if my file name starts with abc, a java job should be triggered when abc file is moved to hdfs.And if it is efg, a pig kob should be triggered.
ReplyDeleteNOTE: if i move five file to hdfs which starts with abc, then five java action should be triggered.
Is this possible using oozie in HUE???