A Brief Comparison of Apache DolphinScheduler With Other Alternatives

开源小助理

578人浏览 · 2022-08-11 04:48:00

开源小助理 · 2022-08-11 04:48:00 发布

Image description

If you’ve ventured into big data — and by extension — the data engineering space, you’d come across workflow schedulers such as Apache Airflow. Though Airflow quickly rose to prominence as the ‘golden standard’ for data engineering, the code-first philosophy kept many enthusiasts at bay.

This is especially true for beginners, who’ve been put away by the steeper learning curves of Airflow.

Out of sheer frustration, Apache DolphinScheduler was born. Developers of the platform adopted a visual drag-and-drop interface, thus changing the way users interact with data.

I’ve tested out Apache DolphinScheduler, and I can see why many big data engineers and analysts prefer this platform over its competitors. I’ll show you the advantages of DS, and draw the similarities and differences among other platforms.

Happy reading!

What is Apache Dolphin Scheduler?

Firstly, the basics.

Apache DolphinScheduler is “a distributed and extensible workflow scheduler platform with powerful DAG visual interfaces.”

Image description
Often touted as the next generation of big-data schedulers, DolphinScheduler solves complex job dependencies in the data pipeline through various out-of-the-box jobs.

Woah, That’s a lot of technical jargon!

In a nutshell, DolphinScheduler lets data scientists and analysts author, schedule, and monitor batch data pipelines quickly without the need for heavy scripts.

Workflows in the platform are expressed through Direct Acyclic Graphs (DAG). Here, each node of the graph represents a specific task.

Image description
The project was started at Analysys Mason — a global TMT management consulting firm — in 2017 and quickly rose to prominence, mainly due to its visual DAG interface.

Why I Love DolphinScheduler

To understand why data engineers and scientists (including me, of course) love the platform so much, let’s take a step back in time.

In 2016, Apache Airflow (another open-source workflow scheduler) was conceived to help Airbnb become a full-fledged data-driven company. The developers of Apache Airflow adopted a code-first philosophy, believing that data pipelines are best expressed through code.

But developers and engineers quickly became frustrated. According to users: “scientists and developers found it unbelievably hard to create workflows through code. Often, they had to wake up at night to fix the problem.”
After a few weeks of playing around with these platforms, I share the same sentiment. This is primarily because Airflow does not work well with massive amounts of data and multiple workflows.

This led to the birth of DolphinScheduler, which reduced the need for code by using a visual DAG structure. Users can now drag-and-drop to create complex data workflows quickly, thus drastically reducing errors. In addition, DolphinScheduler has good stability even in projects with multi-master and multi-worker scenarios.

DS’s error handling and suspension features won me over, something I couldn’t do with Airflow.

In addition, the platform has also gained Top-Level Project status at the Apache Software Foundation (ASF), which shows that the project’s products and community are well-governed under ASF’s meritocratic principles and processes.

Image description

DolphinScheduler is used by various global conglomerates, including Lenovo, Dell, IBM China, and more. This is a testament to its merit and growth.

Key Features of DolphinScheduler

So why do you need DolphinScheduler?

Dolphin scheduler uses a master/worker design with a non-central and distributed approach. This approach favors expansibility as more nodes can be added easily. Furthermore, the failure of one node does not result in the failure of the entire system.

Here are the key features that make it stand out:

High Reliability: DS offers decentralized multi-worker and multimaster, higher availability, self-support, and overload processing.
User-Friendly: Since all the process definitions are visualized, DS flattens the learning curve by letting people without prior coding knowledge, such as data analysts, create complex workflows — this is a personal favorite of mine.
Rich Scenarios: DS offers multiple task types, including Spark, Hive, Python, Flink, and MapReduce. In addition, the platform’s multitenancy promotes efficiency and provides high levels of scalability.
High Expansibility: DS is capable of linearly increasing its overall scheduling capability depending on the scale of the cluster.

In addition, users can also predetermine solutions for various error codes, thus automating the workflow and mitigating problems. DS also offers sub-workflows to support complex deployments.

5 Similar Platforms for You To Try Out

I’ve also compared DolphinScheduler with other workflow scheduling platforms ,and I’ve shared the pros and cons of each of them. Let’s look at five of the best ones in the industry:

Apache Airflow
AWS step Functions
Google Workflows
Apache Azkaban
Kubeflow

Apache Airflow

Apache Airflow is an open-source platform to help users programmatically author, schedule, and monitor workflows. Here, users author workflows in the form of DAG, or Directed Acyclic Graphs.

Image description
It’s an amazing platform for data engineers and analysts as they can visualize data pipelines in production, monitor stats, locate issues, and troubleshoot them.
Airflow follows a code-first philosophy with the idea that complex data pipelines are best expressed through code. The platform mitigated issues that arose in previous workflow schedulers ,such as Oozie — which had limitations surrounding jobs in end-to-end workflows.

Companies that use Apache Airflow: Airbnb, Walmart, Trustpilot, Slack, and Robinhood.

Use Cases

Airflow is perfect for building jobs with complex dependencies in external systems. That said, the platform is usually suitable for data pipelines that are pre-scheduled, have specific time intervals, and those that change slowly.
Here are some specific Airflow use cases:

Machine learning (ML) model training
Backup tasks and DevOps
Automated report generation
ETL pipelines with data extraction from multiple points
Tackling product upgrades with minimal downtime

Cons

Though Airflow is an excellent product for data engineers and scientists, it has its own disadvantages:

No versioning of data pipelines
Code-first approach has a steeper learning curve; new users may not find the platform intuitive
Setting up an Airflow architecture for production is hard
Difficult to use locally, especially in Windows systems
Scheduler requires time before a particular task is scheduled

AWS Step Functions

AWS Step Functions is “a low-code, visual workflow service” used by developers to automate IT processes, build distributed applications, and design machine learning pipelines through AWS services.

Image description

The service offers a drag-and-drop visual editor to help you design individual microservices into workflows.

Step Functions micromanages input, error handling, output, and retries at each step of the workflows. This means users can focus on more important high-value business processes for their projects.

Step Functions offers two types of workflows: Standard and Express.
While Standard workflows are used for long-running workflows, Express workflows support high-volume event processing workloads.

Companies that use AWS Step Functions: Zendesk, Coinbase, Yelp, The CocaCola Company, and Home24.

Use Cases

The service is excellent for processes and workflows that need coordination from multiple points to achieve higher-level tasks.

Here are some key use cases:

Automation of Extract, Transform, and Load (ETL) processes
Preparation of data for machine learning — Step Functions streamlines the sequential steps required to automate ML pipelines
Step Functions can be used to combine multiple AWS Lambda functions into responsive serverless microservices and applications
Invoking business processes in response to events through Express Workflows
Building data processing pipelines for streaming data
Splitting and transcoding videos using massive parallelization

Cons

Workflow configuration requires proprietary Amazon States Language — this is only used in Step Functions
Decoupling business logic from task sequences makes the code harder for developers to comprehend
Creates vendor lock-in because state machines and step functions that define workflows can only be used for the Step Functions platform

Google Workflows

Google Workflows combines Google’s cloud services and APIs to help developers build reliable large-scale applications, process automation, and deploy machine learning and data pipelines.

Image description

In short, Workflows is a “fully managed orchestration platform that executes services in an order that you define.”

The workflows can combine various services, including Cloud vision AI, HTTP-based APIs, Cloud Run, and Cloud Functions.

Google is a leader in big data and analytics, and it shows in the services

The developers can make service dependencies explicit and observable end-to-end by incorporating Workflows into their solutions. A Workflow can retry, hold state, poll, and even wait for up to one year.

The platform offers the first 5,000 internal steps for free and charges $0.01 for every 1,000 steps. For external HTTP calls, the first 2,000 calls are free, and Google charges $0.025 for every 1,000 calls.

Companies that use Google Workflows: Verizon, SAP, Twitch Interactive, and Intel.

Use Cases

This list shows some key use cases of Google Workflows:

Offers service orchestration to help developers create solutions by combining services. It can also be event-driven
It can operate on a set of items or batch data and is often scheduled. Examples include sending emails to customers daily, preparing and running machine learning jobs, and generating reports
Scripting sequences of Google Cloud service operations, like turning down resources on a schedule or provisioning new tenant projects
Encoding steps of a business process, including actions, human-in-the-loop events, and conditions. Tracking an order from request to fulfillment is an example

Cons

Hefty support fees
Not open-sourced
Google Cloud only offers 5,000 steps for free
Web interface is a bit clunky
Expensive to download data from Google Cloud Storage

Apache Azkaban

Apache Azkaban is a batch workflow job scheduler to help developers run Hadoop jobs. The open-sourced platform “resolves ordering through job dependencies” and offers an intuitive web interface to help users maintain and track workflows.

Image description

Though it was created at LinkedIn to run Hadoop jobs, it is extensible to meet any project that requires plugging and scheduling. It consists of an AzkabanWebServer, an Azkaban ExecutorServer, and a MySQL database.

The platform is compatible with any version of Hadoop and offers a distributed multiple-executor. Features of Apache Azkaban include project workspaces, authentication, user action tracking, SLA alerts, and scheduling of workflows.

Azkaban has one of the most intuitive and simple interfaces, making it easy for newbie data scientists and engineers to deploy projects quickly.

Companies that use Apache Azkaban: Apple, Doordash, Numerator, and Applied Materials.

Use Cases

Here are some of the use cases of Apache Azkaban:

Handles project management, authentication, monitoring, and scheduling executions
Three modes for various scenarios: trial mode for a single server, a two-server mode for production environments, and a multiple-executor distributed mode
Mainly used for time-based dependency scheduling of Hadoop batch jobs

Cons

When Azkaban fails, all running workflows are lost
No visual drag-and-drop functionality
Does not offer quick deployment
Does not have adequate overload processing capabilities

Kubeflow

Kubeflow is an open-source toolkit “dedicated to making deployments of machine learning workflows on Kubernetes simple, portable, and scalable.” It focuses on detailed project management, monitoring, and in-depth analysis of complex projects.

Image description

The platform converts steps in your workflows into jobs on Kubernetes by offering a cloud-native interface for your machine learning libraries, pipelines, notebooks, and frameworks.

WIth Kubeflow, data scientists and engineers can build full-fledged data pipelines with segmented steps.

Kubeflow’s mission is to help developers deploy and manage loosely-coupled microservices, while also making it easy to deploy on various infrastructures.
Companies that use Kubeflow: CERN, Uber, Shopify, Intel, Lyft, PayPal, and Bloomberg.

Use Cases

Let’s take a look at the core use cases of Kubeflow:

Deploying large-scale complex machine learning systems and managing them
R&D using various machine learning models
Data loading, verification, splitting, and processing
Automated hyperparameters optimization and tuning through Katib
Multi-cloud and hybrid ML workloads through the standardized environment
Model training, validation, and serving

Cons

Difficult to implement and setup
It is not designed to handle big data explicitly
Incomplete documentation makes implementation and setup even harder
Data scientists may need the help of Ops to troubleshoot issues
Some components and libraries are outdated
Not optimized for running triggers and setting dependencies
Orchestrating Spark and Hadoop jobs is not easy with Kubeflow
Problems may arise while integrating components — incompatible versions of various components can break the system, and the only way to recover might be to reinstall Kubeflow

Wrapping Up

I love how easy it is to schedule workflows with DolphinScheduler. The visual DAG interface meant I didn’t have to scratch my head overwriting perfectly correct lines of Python code.

From a single window, I could visualize critical information, including task status, type, retry times, visual variables, and more.

But what frustrates me the most is that the majority of platforms do not have a suspension feature — you have to kill the workflow before re-running it. With DS, I could pause and even recover operations through its error handling tools.

Also, the overall scheduling capability increases linearly with the scale of the cluster as it uses distributed scheduling. In addition, DolphinScheduler also supports both traditional shell tasks and big data platforms owing to its multi-tenant support feature, including Spark, Hive, Python, and MR.

The platform made processing big data that much easier with one-click deployment and flattened the learning curve - making it a disruptive platform in the data engineering sphere.

This ease-of-use made me choose DolphinScheduler over the likes of Airflow, Azkaban, and Kubeflow.

This curated article covered the features, use cases, and cons of five of the best workflow schedulers in the industry. I hope this article was helpful and motivated you to go out and get started!

Join the Community

There are many ways to participate and contribute to the DolphinScheduler community, including:
Documents, translation, Q&A, tests, codes, articles, keynote speeches, etc.
We assume the first PR (document, code) to contribute to be simple and should be used to familiarize yourself with the submission process and community collaboration style.
So the community has compiled the following list of issues suitable for novices: https://github.com/apache/dolphinscheduler/issues/5689

List of non-newbie issues: https://github.com/apache/dolphinscheduler/issues?q=is%3Aopen+is%3Aissue+label%3A%22volunteer+wanted%22

How to participate in the contribution: https://dolphinscheduler.apache.org/en-us/community/development/contribute.html

GitHub Code Repository: https://github.com/apache/dolphinscheduler

Official Website:https://dolphinscheduler.apache.org/

Mail List:dev@dolphinscheduler@apache.org

Twitter:@dolphinschedule

YouTube:https://www.youtube.com/channel/UCmrPmeE7dVqo8DYhSLHa0vA

Slack:https://s.apache.org/dolphinscheduler-slack

Contributor Guide:https://dolphinscheduler.apache.org/en-us/community/index.html

Your Star for the project is important, don’t hesitate to lighten a Star for Apache DolphinScheduler ❤️

向您推荐>>ModelScope魔搭中文开源社区

ModelScope旨在打造下一代开源的模型即服务共享平台，为泛AI开发者提供灵活、易用、低成本的一站式模型服务产品，让模型应用更简单！

更多推荐

一个开源应用程序导致 XSS 到 RCE 漏洞缺陷

跨站点脚本 (XSS) 是 Web 应用程序中最常遇到的攻击之一。如果攻击者可以将 JavaScript 代码注入到应用程序输出中,这不仅会导致 cookie 盗窃、重定向或网络钓鱼,而且在某些情况下还会导致系统完全受损。在本文中,我将通过 Evolution CMS、FUDForum 和 GitBucket 的示例展示如何通过 XSS 实现远程代码执行。进化CMS v3.1.8 链接:git

开源

我在校园 DevRel 展上的主要收获乔恩·戈特弗里德 E1

嘿嘿👋,欢迎来到校园DevRel 秀的这个博客系列,重点是来自神奇嘉宾DevRels 的经验教训。在我们开始之前!让我们花点时间了解一下 DevRel 到底是什么。顾名思义,开发者关系(或 DevRel)专注于维护与负责开发组织技术或产品的人员的关系。根据公司及其目标,该领域的角色可以采取各种形式和任务。组织和开发人员之间的沟通通常是 DevRel 的责任,以确保更好的信息流和反馈循环。这是对

开源

克服心理障碍,为开源做贡献

为开源做贡献是一项艰巨的任务。我已经当了 3 年的软件工程师,并且已经构建了近十年的软件,但我从来没有能够为开源项目做出贡献。它总是显得如此大胆和令人困惑。我以前开过 PR,接受过几十个比我优秀的开发者的批评,从不流汗。但开源似乎总是我无法企及的。为什么重要? 🤔 我 90% 的代码都是闭源的。很多公司都希望看到我的代码和我从事的项目,但除了部署的应用程序和我 3 年前做的一些全栈项目之