...

Big Data - Apache Oozie

Back to Course

Lesson Description


Lession - #751 Apache oozie verses Apache airflow


Oozie

Apache Oozie is a workflow scheduler which uses Directed Acyclic Graphs( DAG>
to schedule Map Reduce Jobs(e.g. Pig, Hive, Sqoop, Distcp, Java functions>
. It’s an open source project written in Java. When we develop Oozie jobs, we write bundle, coordinator, workflow, properties file. A workflow file is required whereas others are optional.

The workflow file contains the actions needed to complete the job. Some of the common actions we use in our team are the Hive action to run hive scripts, ssh action, shell action, pig action and fs action for creating, moving, and removing files folders

The coordinator file is used for dependency checks to execute the workflow.

The bundle file is used to launch multiple coordinators.

The properties file contains configuration parameters like start date, end date and metastore configuration information for the job.

At GoDaddy, we use Hue UI for monitoring Oozie jobs.

Pros

Uses XML, which is easy to learn

Does n’t require learning a programming language

Retry on failure is available

Alerts on failure

SLA checks can be added

Cons

Less flexibility with actions and dependency, for illustration Dependency check for partitions should be in MM, dd, YY format, if you have integer partitions in M or d, it ’ll not work.

Actions are limited to allowed actions in Oozie like fs action, pig action, hive action, ssh action and shell action.

All the code should be on HDFS for map reduce jobs.

Limited amount of data( 2KB>
can be passed from one action to another.

Supports time- based triggers but doesn't support event- based triggers. Can't automatically trigger dependent jobs. For illustration, if job B is dependent on job A, job B does n’t get triggered automatically when job A completes. The workaround is to trigger both jobs at the same time, and after completion of job A, write a success flag to a directory which is added as a dependency in coordinator for job B. You must also make sure job B has a large enough timeout to prevent it from being abandoned before it runs.

Airflow

Apache Airflow is another workflow scheduler which also uses DAGs. It’s an open source project written in python. Some of the features in Airflow are

Operators, which are job tasks similar to actions in Oozie.

Hooks to connect to various databases.

Sensors to check if a dependency exists, for illustration If your job needs to trigger when a file exists then you have to use sensor which polls for the file.

At GoDaddy, Customer Knowledge Platform team is working on creating docker for Airflow, so other teams can develop and maintain their own Airflow scheduler.

Pros

The Airflow UI is much better than Hue( Oozie UI>
, for illustration Airflow UI has a Tree view to track task failures unlike Hue, which tracks only job failure.

The Airflow UI also lets you view your workflow code, which the Hue UI does not.

More flexibility in the code, you can write your own operator plugins and import them in the job.

Allows dynamic pipeline generation which means you could write code that instantiates a pipeline dynamically.

Contains both event- based trigger and time- based trigger. Event based trigger is so easy to add in Airflow unlike Oozie. Event based trigger is particularly useful with data quality checks. Suppose you have a job to insert records into database but you want to verify whether an insert operation is successful so you would write a query to check record count isn't zero. In Airflow, you could add a data quality operator to run after insert is complete where as in Oozie, since it’s time based, you could only specify time to trigger data quality job.

Lots of functionalities like retry, SLA checks, Slack notifications, all the functionalities in Oozie and more.

Disable jobs easily with an on/ off button in WebUI whereas in Oozie you have to remember the jobid to pause or kill the job.

Cons

In 2018, Airflow code is still an incubator. There's large community working on the code.

Manually delete the filename from meta information if you change the filename.

You need to learn python programming language for scheduling jobs. For Business analysts who do n’t have coding experience might find it hard to pick up writing Airflow jobs but once you get hang of it, it becomes easy.

When concurrency of the jobs increases, no new jobs will be scheduled. Sometimes even though job is running, tasks aren't running, this is due to number of jobs running at a time can affect new jobs scheduled. This also causes confusion with Airflow UI because although your job is in run state, tasks aren't in run state.