What Is Apache Airflow Scheduler?
Airflow is a platform on which you can build and run workflows. It is commonly used to implement machine learning operations (MLOps) pipelines. Workflows are represented as directed acyclic graphs (DAGs) and contain individual Tasks, which can be arranged into a complete workflow while taking into account the order of Task execution, retries, dependencies, and data flows.
A Task in Airflow describes the work to be performed, such as importing data, performing analysis, or triggering other systems.
The Airflow scheduler is a component that monitors all jobs and DAGs and triggers job instances when dependencies are complete. Behind the scenes, the scheduler starts a child process that monitors all DAGs in the specified DAG directory and keeps them synchronized. By default, the scheduler collects DAG analysis results every minute to see if an active Task can be triggered.
Related content: Read our detailed guide to Airflow
In this article:
How Does Airflow Scheduler Work?
The Airflow scheduler modifies entries in the Airflow metadata database, which stores configurations such as variables, connections, user information, roles, and policies. It is the source of truth of all metadata about DAGs and also stores statistics about runs that were performed.
Here is the general process followed by the Airflow scheduler:
- When the Airflow scheduler service is started, the scheduler first checks the DAGs folder and instantiates all DAG objects in the metadata database.
- The scheduler parses the DAG file and generates the required DAG runs based on the schedule parameters.
- The scheduler creates a TaskInstance for each Task in the DAG that needs to be performed. These TaskInstances are assigned the status "scheduled" in the metadata database.
- The primary scheduler searches the database for all jobs in the "scheduled" state and forwards them to the executor. Their status then changes to "queued".
- Workers start running by pulling jobs from the queue according to the run configuration. When a Task is removed from the queue, it transitions to “execution” status.
- After the job completes, the worker changes the job status to its final status, typically “completed” or “failed”. This update is reflected in the Airflow scheduler.
Airflow Scheduling And Triggers
A trigger is a small piece of Python code that runs alongside other triggers within one Python process. Triggers can co-exist efficiently because they are asynchronous. Here is how this process works:
- A running operator (task instance) reaches a point where it must wait. Once it happens, it defers itself with a trigger tied to the pending event. Freed up of this work, the worker can run something else.
- Once a new trigger instance is registered in Airflow, it is picked up by a triggered process. This trigger runs until it fires, and then its source task is rescheduled. Next, the scheduler queues this task to resume on a worker node.
Airflow offers several trigger rules, which you can specify in a task. The scheduler uses these rules to determine whether to run a given task or not. Here is a list of the available trigger rules:
Fine-Tuning Scheduler Performance
The Airflow scheduler continuously parses DAG files to synchronize with the database’s DAG and schedules tasks for execution. It runs these two operations independently and simultaneously.
You must consider the following factors when fine-tuning the scheduler:
- The type of deployment—this includes the type of file system that shares the DAGs, its speed, processing memory, available CPU, and networking throughput.
- The definition and logic of the DAG structure—this includes the number of DAG files (and the number of DAGs within the files), their size and complexity, and whether you must import many libraries to parse a DAG file.
- The configuration of the scheduler—this includes the number of schedulers and parsing processes, the time interval until the scheduler re-parses the DAG, the number of Task instances processed per loop, and the DAG runs per loop, and the frequency of checks and cleanups.
Airflow offers many ways to fine-tune the scheduler’s performance, but it is your responsibility to decide how to configure it. Deployment management involves deciding the points you want to optimize. You might tolerate a 30-second delay when parsing a new DAG, or you might require near-instant parsing, which requires higher CPU usage.
Choose which performance consideration is the most important and fine-tune accordingly. You must monitor the system to capture the relevant data (you can use your existing monitoring tools). Prioritize the performance aspect you want to improve and check the system for bottlenecks. Performance monitoring and improvement should be a continuous process.
Once you’ve built an idea of your resource usage, you might consider the following improvements:
- Improve parsing efficiency—you can reduce your DAG code’s complexity to make it easier to parse continuously.
- Improve resource utilization—identify any underutilized CPU, memory, or networking capacity and add more actions and parsing processes.
- Expand your hardware capacity—sometimes, the system cannot perform smoothly. You might add resources such as another machine and scheduler.
- Fine-tune the scheduler’s values—you might experiment with the tunable aspects or exchange a performance aspect for another. Performance optimization usually requires balancing various aspects.
- Tweak the scheduler’s behavior—you might change the parsing order, etc., to achieve better results for a specific deployment.
Machine Learning Workflow Automation with Run:ai
If you are using Airflow to automate machine learning workflows, Run:ai can help automate resource management and orchestration. With Run:ai, you can automatically run as many compute intensive experiments as needed.
Here are some of the capabilities you gain when using Run:ai:
- Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
- No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
- A higher level of control—Run:ai enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.
Run:ai simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.
Learn more about the Run:ai GPU virtualization platform.