Image description

If you’ve ventured into big data — and by extension — the data engineering space, you’d come across workflow schedulers such as Apache Airflow. Though Airflow quickly rose to prominence as the ‘golden standard’ for data engineering, the code-first philosophy kept many enthusiasts at bay.

This is especially true for beginners, who’ve been put away by the steeper learning curves of Airflow.

Out of sheer frustration, Apache DolphinScheduler was born. Developers of the platform adopted a visual drag-and-drop interface, thus changing the way users interact with data.

I’ve tested out Apache DolphinScheduler, and I can see why many big data engineers and analysts prefer this platform over its competitors. I’ll show you the advantages of DS, and draw the similarities and differences among other platforms.

Happy reading!

What is Apache Dolphin Scheduler?

Firstly, the basics.

Apache DolphinScheduler is “a distributed and extensible workflow scheduler platform with powerful DAG visual interfaces.”

Image description
Often touted as the next generation of big-data schedulers, DolphinScheduler solves complex job dependencies in the data pipeline through various out-of-the-box jobs.

Woah, That’s a lot of technical jargon!

In a nutshell, DolphinScheduler lets data scientists and analysts author, schedule, and monitor batch data pipelines quickly without the need for heavy scripts.

Workflows in the platform are expressed through Direct Acyclic Graphs (DAG). Here, each node of the graph represents a specific task.

Image description
The project was started at Analysys Mason — a global TMT management consulting firm — in 2017 and quickly rose to prominence, mainly due to its visual DAG interface.

Why I Love DolphinScheduler

To understand why data engineers and scientists (including me, of course) love the platform so much, let’s take a step back in time.

In 2016, Apache Airflow (another open-source workflow scheduler) was conceived to help Airbnb become a full-fledged data-driven company. The developers of Apache Airflow adopted a code-first philosophy, believing that data pipelines are best expressed through code.

But developers and engineers quickly became frustrated. According to users: “scientists and developers found it unbelievably hard to create workflows through code. Often, they had to wake up at night to fix the problem.”
After a few weeks of playing around with these platforms, I share the same sentiment. This is primarily because Airflow does not work well with massive amounts of data and multiple workflows.

This led to the birth of DolphinScheduler, which reduced the need for code by using a visual DAG structure. Users can now drag-and-drop to create complex data workflows quickly, thus drastically reducing errors. In addition, DolphinScheduler has good stability even in projects with multi-master and multi-worker scenarios.

DS’s error handling and suspension features won me over, something I couldn’t do with Airflow.

In addition, the platform has also gained Top-Level Project status at the Apache Software Foundation (ASF), which shows that the project’s products and community are well-governed under ASF’s meritocratic principles and processes.

Image description

DolphinScheduler is used by various global conglomerates, including Lenovo, Dell, IBM China, and more. This is a testament to its merit and growth.

Key Features of DolphinScheduler

So why do you need DolphinScheduler?

Dolphin scheduler uses a master/worker design with a non-central and distributed approach. This approach favors expansibility as more nodes can be added easily. Furthermore, the failure of one node does not result in the failure of the entire system.

Here are the key features that make it stand out:

  • High Reliability: DS offers decentralized multi-worker and multimaster, higher availability, self-support, and overload processing.

  • User-Friendly: Since all the process definitions are visualized, DS flattens the learning curve by letting people without prior coding knowledge, such as data analysts, create complex workflows — this is a personal favorite of mine.

  • Rich Scenarios: DS offers multiple task types, including Spark, Hive, Python, Flink, and MapReduce. In addition, the platform’s multitenancy promotes efficiency and provides high levels of scalability.

  • High Expansibility: DS is capable of linearly increasing its overall scheduling capability depending on the scale of the cluster.

In addition, users can also predetermine solutions for various error codes, thus automating the workflow and mitigating problems. DS also offers sub-workflows to support complex deployments.

5 Similar Platforms for You To Try Out

I’ve also compared DolphinScheduler with other workflow scheduling platforms ,and I’ve shared the pros and cons of each of them. Let’s look at five of the best ones in the industry:

  1. Apache Airflow
  2. AWS step Functions
  3. Google Workflows
  4. Apache Azkaban
  5. Kubeflow

Apache Airflow

Apache Airflow is an open-source platform to help users programmatically author, schedule, and monitor workflows. Here, users author workflows in the form of DAG, or Directed Acyclic Graphs.

Image description
It’s an amazing platform for data engineers and analysts as they can visualize data pipelines in production, monitor stats, locate issues, and troubleshoot them.
Airflow follows a code-first philosophy with the idea that complex data pipelines are best expressed through code. The platform mitigated issues that arose in previous workflow schedulers ,such as Oozie — which had limitations surrounding jobs in end-to-end workflows.

Companies that use Apache Airflow: Airbnb, Walmart, Trustpilot, Slack, and Robinhood.

Use Cases

Airflow is perfect for building jobs with complex dependencies in external systems. That said, the platform is usually suitable for data pipelines that are pre-scheduled, have specific time intervals, and those that change slowly.
Here are some specific Airflow use cases:

  • Machine learning (ML) model training
  • Backup tasks and DevOps
  • Automated report generation
  • ETL pipelines with data extraction from multiple points
  • Tackling product upgrades with minimal downtime

Cons

Though Airflow is an excellent product for data engineers and scientists, it has its own disadvantages:

  • No versioning of data pipelines
  • Code-first approach has a steeper learning curve; new users may not find the platform intuitive
  • Setting up an Airflow architecture for production is hard
  • Difficult to use locally, especially in Windows systems
  • Scheduler requires time before a particular task is scheduled

AWS Step Functions

AWS Step Functions is “a low-code, visual workflow service” used by developers to automate IT processes, build distributed applications, and design machine learning pipelines through AWS services.

Image description

The service offers a drag-and-drop visual editor to help you design individual microservices into workflows.

Step Functions micromanages input, error handling, output, and retries at each step of the workflows. This means users can focus on more important high-value business processes for their projects.

Step Functions offers two types of workflows: Standard and Express.
While Standard workflows are used for long-running workflows, Express workflows support high-volume event processing workloads.

Companies that use AWS Step Functions: Zendesk, Coinbase, Yelp, The CocaCola Company, and Home24.

Use Cases

The service is excellent for processes and workflows that need coordination from multiple points to achieve higher-level tasks.

Here are some key use cases:

  • Automation of Extract, Transform, and Load (ETL) processes
  • Preparation of data for machine learning — Step Functions streamlines the sequential steps required to automate ML pipelines
  • Step Functions can be used to combine multiple AWS Lambda functions into responsive serverless microservices and applications
  • Invoking business processes in response to events through Express Workflows
  • Building data processing pipelines for streaming data
  • Splitting and transcoding videos using massive parallelization

Cons

  • Workflow configuration requires proprietary Amazon States Language — this is only used in Step Functions

  • Decoupling business logic from task sequences makes the code harder for developers to comprehend

  • Creates vendor lock-in because state machines and step functions that define workflows can only be used for the Step Functions platform

Google Workflows

Google Workflows combines Google’s cloud services and APIs to help developers build reliable large-scale applications, process automation, and deploy machine learning and data pipelines.

Image description

In short, Workflows is a “fully managed orchestration platform that executes services in an order that you define.”

The workflows can combine various services, including Cloud vision AI, HTTP-based APIs, Cloud Run, and Cloud Functions.

Google is a leader in big data and analytics, and it shows in the services

The developers can make service dependencies explicit and observable end-to-end by incorporating Workflows into their solutions. A Workflow can retry, hold state, poll, and even wait for up to one year.

The platform offers the first 5,000 internal steps for free and charges $0.01 for every 1,000 steps. For external HTTP calls, the first 2,000 calls are free, and Google charges $0.025 for every 1,000 calls.

Companies that use Google Workflows: Verizon, SAP, Twitch Interactive, and Intel.

Use Cases

This list shows some key use cases of Google Workflows:

  • Offers service orchestration to help developers create solutions by combining services. It can also be event-driven

  • It can operate on a set of items or batch data and is often scheduled. Examples include sending emails to customers daily, preparing and running machine learning jobs, and generating reports

  • Scripting sequences of Google Cloud service operations, like turning down resources on a schedule or provisioning new tenant projects

  • Encoding steps of a business process, including actions, human-in-the-loop events, and conditions. Tracking an order from request to fulfillment is an example

Cons

  • Hefty support fees
  • Not open-sourced
  • Google Cloud only offers 5,000 steps for free
  • Web interface is a bit clunky
  • Expensive to download data from Google Cloud Storage

Apache Azkaban

Apache Azkaban is a batch workflow job scheduler to help developers run Hadoop jobs. The open-sourced platform “resolves ordering through job dependencies” and offers an intuitive web interface to help users maintain and track workflows.

Image description

Though it was created at LinkedIn to run Hadoop jobs, it is extensible to meet any project that requires plugging and scheduling. It consists of an AzkabanWebServer, an Azkaban ExecutorServer, and a MySQL database.

The platform is compatible with any version of Hadoop and offers a distributed multiple-executor. Features of Apache Azkaban include project workspaces, authentication, user action tracking, SLA alerts, and scheduling of workflows.

Azkaban has one of the most intuitive and simple interfaces, making it easy for newbie data scientists and engineers to deploy projects quickly.

Companies that use Apache Azkaban: Apple, Doordash, Numerator, and Applied Materials.

Use Cases

Here are some of the use cases of Apache Azkaban:

  • Handles project management, authentication, monitoring, and scheduling executions
  • Three modes for various scenarios: trial mode for a single server, a two-server mode for production environments, and a multiple-executor distributed mode
  • Mainly used for time-based dependency scheduling of Hadoop batch jobs

Cons

  • When Azkaban fails, all running workflows are lost
  • No visual drag-and-drop functionality
  • Does not offer quick deployment
  • Does not have adequate overload processing capabilities

Kubeflow

Kubeflow is an open-source toolkit “dedicated to making deployments of machine learning workflows on Kubernetes simple, portable, and scalable.” It focuses on detailed project management, monitoring, and in-depth analysis of complex projects.

Image description

The platform converts steps in your workflows into jobs on Kubernetes by offering a cloud-native interface for your machine learning libraries, pipelines, notebooks, and frameworks.

WIth Kubeflow, data scientists and engineers can build full-fledged data pipelines with segmented steps.

Kubeflow’s mission is to help developers deploy and manage loosely-coupled microservices, while also making it easy to deploy on various infrastructures.
Companies that use Kubeflow: CERN, Uber, Shopify, Intel, Lyft, PayPal, and Bloomberg.

Use Cases

Let’s take a look at the core use cases of Kubeflow:

  • Deploying large-scale complex machine learning systems and managing them
  • R&D using various machine learning models
  • Data loading, verification, splitting, and processing
  • Automated hyperparameters optimization and tuning through Katib
  • Multi-cloud and hybrid ML workloads through the standardized environment
  • Model training, validation, and serving

Cons

  • Difficult to implement and setup
  • It is not designed to handle big data explicitly
  • Incomplete documentation makes implementation and setup even harder
  • Data scientists may need the help of Ops to troubleshoot issues
  • Some components and libraries are outdated
  • Not optimized for running triggers and setting dependencies
  • Orchestrating Spark and Hadoop jobs is not easy with Kubeflow
  • Problems may arise while integrating components — incompatible versions of various components can break the system, and the only way to recover might be to reinstall Kubeflow

Wrapping Up

I love how easy it is to schedule workflows with DolphinScheduler. The visual DAG interface meant I didn’t have to scratch my head overwriting perfectly correct lines of Python code.

From a single window, I could visualize critical information, including task status, type, retry times, visual variables, and more.

But what frustrates me the most is that the majority of platforms do not have a suspension feature — you have to kill the workflow before re-running it. With DS, I could pause and even recover operations through its error handling tools.

Also, the overall scheduling capability increases linearly with the scale of the cluster as it uses distributed scheduling. In addition, DolphinScheduler also supports both traditional shell tasks and big data platforms owing to its multi-tenant support feature, including Spark, Hive, Python, and MR.

The platform made processing big data that much easier with one-click deployment and flattened the learning curve - making it a disruptive platform in the data engineering sphere.

This ease-of-use made me choose DolphinScheduler over the likes of Airflow, Azkaban, and Kubeflow.

This curated article covered the features, use cases, and cons of five of the best workflow schedulers in the industry. I hope this article was helpful and motivated you to go out and get started!

Join the Community

There are many ways to participate and contribute to the DolphinScheduler community, including:
Documents, translation, Q&A, tests, codes, articles, keynote speeches, etc.
We assume the first PR (document, code) to contribute to be simple and should be used to familiarize yourself with the submission process and community collaboration style.
So the community has compiled the following list of issues suitable for novices: https://github.com/apache/dolphinscheduler/issues/5689

List of non-newbie issues: https://github.com/apache/dolphinscheduler/issues?q=is%3Aopen+is%3Aissue+label%3A%22volunteer+wanted%22

How to participate in the contribution: https://dolphinscheduler.apache.org/en-us/community/development/contribute.html

GitHub Code Repository: https://github.com/apache/dolphinscheduler

Official Website:https://dolphinscheduler.apache.org/

Mail List:dev@dolphinscheduler@apache.org

Twitter:@dolphinschedule

YouTube:https://www.youtube.com/channel/UCmrPmeE7dVqo8DYhSLHa0vA

Slack:https://s.apache.org/dolphinscheduler-slack

Contributor Guide:https://dolphinscheduler.apache.org/en-us/community/index.html

Your Star for the project is important, don’t hesitate to lighten a Star for Apache DolphinScheduler ❤️

Logo

ModelScope旨在打造下一代开源的模型即服务共享平台,为泛AI开发者提供灵活、易用、低成本的一站式模型服务产品,让模型应用更简单!

更多推荐