Mastering Airflow Variables | Towards Data Science

The way you retrieve variables from Airflow can impact the performance of your DAGs

Mastering Airflow Variables | Towards Data Science - image on https://aiquantumintelligence.com — Photo by Daniele Franchi on Unsplash

What happens if multiple data pipelines need to interact with the same API endpoint? Would you really have to declare this endpoint in every pipeline? In case this endpoint changes in the near future, you will have to update its value in every single file.

Airflow variables are simple yet valuable constructs, used to prevent redundant declarations across multiple DAGs. They are simply objects consisting of a key and a JSON serializable value, stored in Airflow’s metadata database.

And what if your code uses tokens or other type of secrets? Hardcoding them in plain-text doesn’t seem to be a secure approach. Beyond reducing repetition, Airflow variables also aid in managing sensitive information. With six different ways to define variables in Airflow, selecting the appropriate method is crucial for ensuring security and portability.

An often overlooked aspect is the impact that variable retrieval has on Airflow performance. It can potentially strain the metadata database with requests, every time the Scheduler parses the DAG files (defaults to thirty seconds).

It’s fairly easy to fall into this trap, unless you understand how the Scheduler parses DAGs and how Variables are retrieved from the database.

Before getting into the discussion of how Variables are fetched from the metastore and what best practices to apply in order to optimise DAGs , it’s important to get the basics right. For now, let’s just focus on how we can actually declare variables in Airflow.

As mentioned already, there are several different ways to declare variables in Airflow. Some of them turn out to be more secure and portable than others, so let’s examine all and try to understand their pros and cons.