Before we unravel what Apache Airflow is, let’s first explain what Data Engineers and Data Scientists do.
Data Engineers build systems that extract, collect, manage, and convert raw data to ultimately organize it into useful information which Data Scientists can interpret.
Data Scientists use that same data to create a detailed analysis to predict trends and answer questions that are relevant to the organization.
What is Apache Airflow?
Let’s dive right into Apache Airflow and start our journey as Data Engineer and Data Scientist.
What is Apache Airflow and what is not?
Apache Airflow is an orchestrator that allows you to execute tasks at the right time, in the right way, and in the right order. You definitely shouldn’t think of Apache Airflow as a data streaming solution or a data processing framework such as Apache Spark or Flink. Apache Airflow’s primary goal isn’t processing large amounts of data.
In a previous Airflow’s version (version 1.0) all operators (providers) were built into the same giant Python package called Apache-Airflow. It was a huge problem when you needed to upgrade them individually (i.e. to use the next version of the operator).
A common issue that occurs is that one of the operators may have a problem that can’t be solved by the available configuration.
In this case, you can’t downgrade just the package of the problematic provider. The only available solution was to wait for the next Airflow update which fixes it.
With Airflow version 2.0 this is over!
This issue is the main reason to move forward with Airflow 2.0.
In this version, each provider is built as a separate python package. This gives you more flexibility by allowing the installation of just the packages that are required for your use case.
Some of the most used operators, such as PythonOperator and BashOperator are packaged as part of the base apache-airflow package.
What are the benefits of Apache Airflow?
The data pipeline is dynamic (in case you use Python). If you want to use Airflow you should write your data pipeline in Python.
Airflow Web UI is useful and gives you a lot of options to monitor, configure and troubleshoot your data pipelines.
Let’s say some words about products, rivals of Apache Airflow, and some of their pros and cons.
One product from the last few years is Argo Workflows and the main difference between Airflow and Argo Workflow is the language that you can use to write your data pipelines.
While in Airflow you can write your data pipeline in Python, for Argo you should write the data pipeline in a YAML format. Here’s a comparison table with some other similar products.
Which are Airflow’s core components and how many different architectures can it have?
Let’s examine Airflow’s core components and the different types of Airflow’s architectures.
Each Airflow 2.0 deployment should have the below-mentioned core components. Depending on what kind of architecture (One Node or Multi Nodes) and Executor types you choose, core components will have a different place in the nodes. The main Airflow’s core components are:
- Web server – provides Web UI;
- Scheduler – schedules your tasks, pipelines, and workers;
- Metastore – stores metadata from tasks – usually uses PostgreSQL;
- Executor – An executor is a definition of how your task will be executed;
- Queue – The queue defines the order in which tasks will be executed. It is part of the Executor in the one-node architecture;
- Worker – This is the actual process when a task is being executed; It is different from the Executor.
If you choose the One Node architecture then Airflow’s core components would be combined in one node. Moreover, the Queue would be a part of the Executor.
If you choose the Multi-Node architecture (with Executor type Celery) Airflow’s core components would be organized in a way that can be seen in the diagram below. Essentially, you should use Redis or RabbitMQ for the Queue (which is not the same as the database you used for Metastore). Moreover, it is useful to deploy Airflow Flower which provides a Web UI to manage Worker Nodes and see the status of each task in the Queue.
What is the main provider in Airflow and how can you install different providers?
A provider is an independent python package that brings everything you need in order to interact with a service or a tool such as Spark or AWS. It contains different connection types, operators, hooks, etc. You can check out the official Airflow documentation page for an exhaustive list of all Airflow’s providers.
What are Airflow Operators – types and characteristics
Little by little, we get into the essence of the tool and we are already very close to the first data pipeline. Before that, we should discuss some of the most important components called Operators.
One general rule you can keep in mind is: “Each Operator only defines a Single Task in a data pipeline”
Airflow has three types of Operators:
- Action Operator – This type executes а function or а command. For example, BashOperator will execute а bash command, PythonOperator will execute а python function;
- Transfer Operator – The Transfer Operator takes care of transferring data between source and destination;
- Sensor Operator – The Sensor Operator waits for an event to happen before moving to the next task.
What are Task and Task instances and what is the difference between them?
The main difference between a task and a task instance is the following: A task becomes a “task instance” when it is scheduled for execution.
- A Task is equal to the operator inside the pipeline. Always keep in mind the rule “One Operator defines One task in the data pipeline”;
- Task instance – When task execution is scheduled to be executed by an operator it becomes a task instance.
How many Executor types are there in Apache Airflow?
Depending on which Airflow’s architecture you use for your Airflow deployment, you can select from the list below:
- SequentialExecutor is the simplest Executor type to execute your tasks in a sequential manner. It is recommended for debugging and testing only.
- LocalExecutor runs multiple sub-processes to execute your tasks concurrently on the same host. It scales quite well (by vertical scaling) and can be used in production.
- CeleryExecutor allows you to scale your Airflow cluster horizontally. You basically run your tasks on multiple nodes (airflow workers) and each task is first queued through the use of a RabbitMQ or Redis. The whole distribution is managed by Celery. This is the recommended architecture for production because it is able to handle higher workloads.
- DaskExecutor – Dask basically is another Python distributed task processing system, just like Celery. You must choose either Dask or Celery according to the framework fitting most of your needs.
- KubernetesExecutor, which is relatively new, allows you to run your tasks in Kubernetes and thus makes your Airflow cluster scale according to your workload. An additional benefit is that you’re only using the resources you need.
What is Directed Acyclic Graph or DAG?
What would Apache Airflow be without DAGs?
DAG (Directed Acyclic Graph) is a data pipeline that contains one or more tasks without loops. They can be separate or dependent on each other. In the picture below, you can see what a DAG looks like and how it differs from the standard pipeline.
Execution process overview of one task in Airflow (One Node architecture)
In the picture below you can see the end-to-end process – an overview of how a task is executed. After adding a new DAG file in the DAGs folder, Airflow monitors to discover new DAGs.
What is Xcoms and how to use it?
XComs is used for “cross-communication” and it allows you to exchange a small amount of data between tasks.
The limit of data depends on what kind of database is used for Metastore. Using:
- SQLite, the data limit is 2 GB;
- PostgreSQL – 1 GB;
- MySQL – 64 kB.
Which are the main Airflow’s configuration parameters?
The main Airflow’s configuration parameters are:
- Catchup (if is TRUE) – DAGs have a variable schedule_interval which determines the interval at which they are executed. With the catchup parameter, a DAG Run will be scheduled for any intervals that have not been run since the last regularly scheduled execution (or the run has been cleared);
- Parallelism (default is 32) – determines the max number of tasks that can be run in parallel for an entire Airflow instance;
- DAG_concurrency (default is 16) – max number of tasks that can be run concurrently per DAG run;
- MAX_ACTIVE_RUNS_PER_DAG (default is 16)– max number of DAG runs per DAG;
You can configure these parameters to achieve better performance for your Airflow instance and data pipelines.
If you’re considering a position as a Data Engineer or Data Scientist you should definitely get familiar with Apache Airflow. It’s a great tool that can save you a lot of time, speeds up processes, and adds value to your organization.
If you want to master writing DAGs and learn all the best practices, then stay tuned for our next extensive guide on Apache Airflow in our blog.
Author:Tihomir Gramov, DevOps engineer @ ITGix