InfoQ HomepageArticlesScalable Cloud Environment for Distributed Data Pipelines with Apache Airflow
Scalable Cloud Environment for Distributed Data Pipelines with Apache Airflow
Sep 23, 202021min read
reviewed by
Write for InfoQ
Feed your curiosity.Help 550k+ globalsenior developers
each month stay ahead.Get in touch
Key Takeaways
- Selecting the right data orchestration tool is important due to differences and trade-offs between the variety of options for building distributed data pipelines.
- Apache Airflow is one of the most popular and widely adopted OSS projects for programmatic orchestration of data workflows
- Main Airflow concepts include Directed Acyclic Graph, scheduler, webserver, executor, and metadata database
- You can provision a distributed Airflow setup in a cloud-agnostic environment, as well as in the cloud on Azure, AWS, and GCP.
Data pipelines, movement, workflows, transformation, and ETL have always been important for a broad range of businesses. Data pipeline tools can be grouped into several categories:
- Simple utilities, such ascron, excellent for trivial data pipelines.
- Platform-specific tools, or cloud-based managed services, such asAWS Glue orData Pipeline, orAzure Data Factory, providing tight integration with specific products, moving data between them.
- Open-source projects, such asApache Airflow, ApacheOozie orAzkaban,Dagster,Prefect, offering flexibility, extensibility, and rich programmable control flow.
What is covered in this article
In this article, I am focusing onApache Airflow as one of the most flexible, extensible, and popular in the community projects for reliable and scalable data and AI pipelines. I will cover:
- Key Apache Airflow concepts and why use it in distributed setting
- General architecture for Apache Airflow environment in the cloud
- Detailed guide for provisioning Apache Airflow on Kubernetes on Azure
Apache Airflow: Key Concepts
Airflow is based on the following concepts and components:
- DAG (Directed Acyclic Graph)- a set of tasks and dependencies between them, defined using Python code.
- Scheduler- discovers the DAGs that are ready to be triggered according to the associated schedule. When the scheduler discovers task failures, it can optionally retry the DAG a certain number of times.
- Webserver - a handy UI interface for viewing, scheduling, or triggering DAGs, offering useful information on task success or failure status, progress, duration, retries, logs, and more.
- Executor - component that runs a task, assigns it to a specific node, and updates other components of its progress.
- Metadata database - where Airflow can store metadata, configuration, and information on task progress.
Scalable data workflows with Airflow on Kubernetes
An Engineering Blueprint for Real‑Time Recommendation Systems
Related Sponsors
Related Sponsor
%2ffilters%3ano_upscale()%2fsponsorship%2ftopic%2fb4c9039c-eac9-4066-8fab-83c48efcd6cf%2fsnowplow-logo+purple-1761807934129.png&f=jpg&w=240)
Snowplow enables digital-first companies to turn behavioral data into fuel for real-time advanced analytics, predictive modeling, hyper-personalization, and customer-facing AI agent context.Learn More.
Airflow can be run on a single-machine or in a distributed mode, to keep up with the scale of data pipelines.
Airflow can be distributed withCelery - a component that uses task queues as a mechanism to distribute work across machines. There is a requirement to have a number of nodes running up front to schedule tasks across them. Celery is deployed as an extra component in your system and requires a message transport to send and receive messages, such asRedis orRabbitMQ.
Airflow supports Kubernetes as a distributed executor. it doesn’t require any additional components, like Redis.Kubernetes executor doesn’t need to always keep a certain number of workers alive as it creates a new pod for every job.
General architecture of Apache Airflow environment in the cloud

Figure 1. Cloud-agnostic architecture.

Figure 2. Azure architecture.

Figure 3. GCP and AWS architectures.
Provisioning Apache Airflow on Kubernetes in the cloud
To get started, you will need access to a cloud subscription, such as Azure, AWS, or Google Cloud. The example in this article is based on Azure, however, you should be able to successfully follow the same steps for AWS or GCP with minor changes.
To follow the example on Azure, feel free to create anaccount, and installAzure CLI or useAzure Portal to create necessary resources. If using Azure CLI, don’t forget to login and initialize your session with subscription ID:
az loginaz account set --subscription <subscription-id>To make sure you can easily delete all the resources at once after giving it a try, create an Azure Resource Group that will serve as a grouping unit:
RESOURCE_GROUP_NAME="airflow"REGION="East US"az group create --name $RESOURCE_GROUP_NAME --region $REGIONAfter you are done with resources, feel free to delete the entire resource group:
az group delete --name $RESOURCE_GROUP_NAMEPostgreSQL
For Apache Airflow, a database is required to store metadata information about the status of tasks. Airflow is built to work with a metadata database through SQLAlchemy abstraction layer.SQLAlchemy is a Python SQL toolkit and Object Relational Mapper. Any database that supports SQLAlchemy should work with Airflow. MySQL or PostgreSQL are some of the most common choices.
To create and successfully connect to an instance of PostgreSQL on Azure, please follow detailed instructionshere.
Feel free to use the same resource group name and location for the PostgreSQL instance. Make sure to indicate a unique server name. I choseGP_Gen5_2as instance size, as 2 vCores and 100 GB of storage is more than enough for the example, but feel free to pick thesize that fits your own requirements.
It is important to remember what your PostgreSQL server name, fully qualified domain name, username, and password are. This information will be required during the next steps. Note: you can get a fully qualified domain name by looking at thefullyQualifiedDomainName after executing the command:
POSTGRESQL_SERVER_NAME="<your-server-name>"az postgres server show --resource-group $RESOURCE_GROUP_NAME --name $POSTGRESQL_SERVER_NAME# example value of fullyQualifiedDomainName on Azure: airflowazuredb.postgres.database.azure.comCheck connection to your database usingPSQL:
P_USER_NAME="your-postgres-username"psql --host=$POSTGRESQL_SERVER_NAME.postgres.database.azure.com --port=5432 --username=$P_USER_NAME@$POSTGRESQL_SERVER_NAME --dbname=postgresFor AWS or GCP, feel free to useAmazon RDS orGoogle Cloud SQL for PostgreSQL respectively.
File Share
When running data management workflows, we need to store Apache Airflow Directed Acyclic Graph (DAG) definitions somewhere. When running Apache Airflow locally, we can store them in a local filesystem directory and point to it through the configuration file. When running Airflow in a Docker container (either locally or in the cloud), we have several options.
Storing data pipeline DAGs directly within the container image. The downside of this approach is when there is a possibility and likelihood of frequent changes to DAGs. This would imply the necessity to rebuild the image each time your DAGs change.
Storing DAG definitions in a remote Git repository. When there are changes within DAG definitions, usingGit-Sync sidecar can automatically synchronize the repository with the volume in your container.
Storing DAGs in a shared remote location, such as remote filesystem. Same as with a remote Git repository, we can mount a remote filesystem to a volume in our container and mirror DAG changes automatically. Kubernetes supports a variety ofCSI drivers for many remote filesystems, includingAzure Files,AWS Elastic File System, orGoogle Cloud Filestore. This approach is great if you also want to store logs somewhere in a remote location.
I am using Azure Files for storing DAG definitions. To create an Azure fileshare, execute the following commands:
STORAGE_ACCOUNT="storage-account-name"FILESHARE_NAME="fileshare-name"az storage account create \ --resource-group $RESOURCE_GROUP_NAME \ --name $STORAGE_ACCOUNT \ --kind StorageV2 \ --sku Standard_ZRS \ --output noneSTORAGE_ACCOUNT_KEY=$(az storage account keys list \ --resource-group $RESOURCE_GROUP_NAME \ --account-name $STORAGE_ACCOUNT \ --query "[0].value" | tr -d '"')az storage share create \ --account-name $STORAGE_ACCOUNT \ --account-key $STORAGE_ACCOUNT_KEY \ --name $FILESHARE_NAME \ --quota 1024 \ --output noneFor Azure Files, detailed creation instructions are locatedhere. Feel free to create a fileshare on any of the other platforms to follow along.
After the fileshare is created, you can copy one or several DAGs you might have to your newly created storage. If you don’t have any DAGs yet, you can use one of those available online. For example,the following DAG from one of the GitHub repositories calledairflow_tutorial_v01, which you can also findhere.
To copy files to Azure Files share, you can use Azure Portal, orAzCopy util for programmatic operations.
Kubernetes cluster
For Apache Airflow scheduler, UI, and executor workers, we need to create a cluster. For Azure Kubernetes Service, detailed cluster creation instructions arehere. Make sure to indicate that you’d like the cluster to be provisioned in a Virtual Network (default installation doesn’t include it). I created the cluster with Azure Portal with 3 nodes of size Standard DS2 v2 in East US, with RBAC enabled, and the following network configuration:

Figure 4. Kubernetes Cluster Network Configuration.
Azure Portal has an amazing feature to help you connect to your cluster. On the AKS cluster resource, click Connect at the top, and follow the instructions.
Allow cluster to access database
To make sure the AKS cluster can communicate with the PostgreSQL on Azure database, we need to add a service endpoint on the AKS Virtual Network side and a Virtual Network rule on the PostgreSQL side.
Go to the AKS virtual network resource in Azure Portal, it’s located in the same resource group where the cluster is. Under Service Endpoints settings menu, select Add, and chooseMicrosoft.SQL from the dropdown:

Figure 5. Add a Service Endpoint to the Virtual Network.
Go to the PostgreSQL on Azure resource, and under Connection Security settings menu VNET rules section, selectAddexisting virtual network. Specify a name for the Virtual Network rule, select your subscription and the AKS Virtual Network. These actions will make sure Apache Airflow pods on AKS are able to communicate with the database successfully.

Figure 6. Add a Virtual Network Rule.
Prepare fileshare to be used within Kuberneres
Install a CSI driver corresponding to your platform. For Azure, follow instructions to installAzure Files CSI driver.
Create a secret to store Azure storage account name and key (make sureSTORAGE_ACCOUNTand STORAGE_ACCOUNT_KEYcontain your own values for storage account name and key). This secret will be used later in Persistent Volumes definitions for DAGs and logs.
kubectl create secret generic azure-secret --from-literal accountname=$STORAGE_ACCOUNT --from-literal accountkey=$STORAGE_ACCOUNT_KEY --type=OpaquePrepare resource files to run Airflow components on Kubernetes

Figure 7. List of Resource Definition Files.
You can clone theGitHub repository to get these files. Before you apply them to your own clusters, make sure to review them and read through the notes in this article, as there are quite a few values that might need to be customized.
Note: I initially got the files from the official Airflow GitHub repositoryhere. I ran the helm template command to generate YAML files and deleted those that weren’t relevant for this use case. MyGitHub repository now contains all the necessary files adjusted for this scenario, so you can skip the helm template step if you’d like to use & modify files under my repository.
Namespace
Having a namespace is a good way to identify resources that belong to the same logical group. In this case, we can create a namespace to assign to all resources relevant for our Apache Airflow setup. This is especially convenient for distinguishing groups of resources when there are multiple components living within the same cluster.
Definition of the namespace for all Airflow resources is in theairflow-namespace.yaml file on GitHub.
Create the namespace:
kubectl create secret generic azure-secret --from-literal accountname=$STORAGE_ACCOUNT --from-literal accountkey=$STORAGE_ACCOUNT_KEY --type=OpaqueMake sure to remember to use the right namespace name when creating other resources for Airflow. We will be usingairflow-on-k8s in this guide.
Persistent Volumes for logs and DAGs
As one of the ways to work with storage within the cluster, there is a resource calledPersistent Volumes. Its goal is to abstract away the details of the underlying storage provider it corresponds to, be it a shared network file system, external volume, or other cloud-specific storage type. Persistent Volume resource is configured with connection information or settings that would define how the cluster will connect to and work with the storage. Persistent Volume lifecycle is independent, and doesn’t have to be attached to the lifecycle of a pod that uses it.
In this case, we need an abstraction to represent storage for DAG definitions and for log files. Definition of a Persistent Volume resource for DAGs is inpv-azurefile-csi.yamlfile on GitHub. Definition of a Persistent Volume resource for logs is inpv-azurefile-csi-logs.yaml file on GitHub.
Openpv-azurefile-csi.yaml andpv-azurefile-csi-logs.yaml files to edit them to include your own fileshare name and storage account name. If you followed the steps above, assign the parameter shareName to the value of$FILESHARE_NAME variable, and parameter server to the value of$STORAGE_ACCOUNT.file.core.windows.net.
If you are not using Azure, make sure to change the CSI driver settings to correspond to AWS, GCP, or any other platform. Don’t forget to create any necessary secrets to store sensitive data corresponding to the fileshare system you are using.
Create persistent volumes:
kubectl create -f pv/pv-azurefile-csi.yamlkubectl create -f pv/pv-azurefile-csi-logs.yamlPersistent Volume Claims for logs and DAGs
In addition to Persistent Volumes as a storage abstraction, we also have an abstraction that represents a request for storage, calledPersistent Volume Claim. The goal of Persistent Volume Claim is to represent a specific pod’s request for storage combined with specific details, such as exact amount of storage, and access mode.
We need to create Persistent Volume Claims for DAGs and logs, to make sure pods that interact with storage can use these claims to request access to the storage (Azure Files in this case).
Definition of a Persistent Volume Claim resource for DAGs is inpvc-azurefile-csi-static.yaml file on GitHub. Definition of a Persistent Volume Claim resource for logs is inpvc-azurefile-csi-static-logs.yamlfile on GitHub.
Create persistent volume claims:
kubectl create -f pv/pvc-azurefile-csi-static.yaml
kubectl create -f pv/pvc-azurefile-csi-static-logs.yaml
After a few minutes, make sure your Persistent Volume Claims are in Bound state:
$ kubectl get pvc -n airflow-on-k8sNAME STATUS VOLUME CAPACITY ACCESS MODESdags-pvc Bound dags-pv 10Gi RWXlogs-pvc Bound logs-pv 10Gi RWXService Accounts for scheduler and workers
To allow certain processes perform certain tasks, there is a notion ofService Accounts. For example, we’d like to make sure Airflow scheduler is able to programmatically create, view, and manage pod resources for workers. To ensure this is possible, we need to create a Service Account for each component we want to grant certain privileges to. Later, we can associate the Service Accounts with Cluster Roles by usingCluster Rolebindings.
Definition of Scheduler Service Account is in thescheduler-serviceaccount.yamlfile on GitHub. Definition of Worker Service Account is in theworker-serviceaccount.yaml file on GitHub.
Create the service accounts:
kubectl create -f scheduler/scheduler-serviceaccount.yamlkubectl create -f workers/worker-serviceaccount.yamlCluster Role for scheduler and workers to dynamically operate pods
ACluster Role represents a set of rules or permissions.
Definition of the Cluster Role we want to create is in thepod-launcher-role.yamlfile on GitHub.
Create the cluster role:
kubectl create -f rbac/pod-launcher-role.yamlCluster Role Binding for scheduler and workers
ACluster Rolebinding is a connection between a Cluster Role and accounts that need it.
Definition of the Cluster Role is in thepod-launcher-rolebinding.yaml file on GitHub.
Create the cluster role binding:
kubectl create -f rbac/pod-launcher-rolebinding.yamlSecrets
Secrets is a mechanism for managing sensitive data, such as tokens, passwords, or keys that other resources in the cluster may require.
Note: if you configure the secret through a manifest (JSON or YAML) file which has the secret data encoded as base64, sharing this file or checking it into a source repository means the secret is compromised. Base64 encoding is not an encryption method and is considered the same as plain text.
Apache Airflow needs to know what is the fernet key and how to connect to the metadata database. We will use secrets to represent this information.
Fernet key
Apache Airflow requires fernet key to make sure it cansecure connection information that isn’t protected by default.
Start Python shell:
$ pythonPython 3.7.5 (default, Nov 1 2019, 02:16:32)[Clang 11.0.0 (clang-1100.0.33.8)] on darwinType "help", "copyright", "credits" or "license" for more information.>>>Generate fernet key from Python shell:
from cryptography.fernet import Fernetfernet_key= Fernet.generate_key()print('Generated fernet key: ', fernet_key.decode())Example output of this command is:
Generatedfernet key: ESqYUmi27Udn6wxY83KoM9kuvt9rDcelghHbAgGZW9g=
Convert the value to the base64 encoded value:
$ echo -n "<your-generated-fernet-key-value>" | base64Example output (it will be the value we will use for fernet-key withinfernet-secret.yamlfile in this example):
Slk0OThJbHE4R0xRNEFuRlJWT3FUR3lBeDg3bG5BWWhEdWx1ekhHX2RJQT0=Definition of Fernet Key secret is in thefernet-secret.yaml file on GitHub.
Replace the value of the fernet-key parameter in the file with your generated fernet-key value.
Create the fernet secret:
kubectl create -f secrets/fernet-secret.yaml
Database connection information
Prepare your PostgreSQL database connection. Generally, it follows the format of:
postgresql+psycopg2://user:password@hostname:5432/dbnameNote: when using Azure SQL Database for PostgreSQL, the connection string requires user to be in format of user@host, where @ sign should be escaped as %40 (moredetails):
postgresql+psycopg2://user%40host:password@host.postgres.database.azure.com:5432/dbnameEncode your connection using base64 (replaceairflowusername with your username,airflowpassword with your password,airflowhost with your host, and airflow with your database name):
# For general PostgreSQLecho -n "postgresql+psycopg2://airflowuser:airlowpassword@airflowhost.postgres.database.azure.com:5432/airflow" | base64# For Azure PostgreSQLecho -n "postgresql+psycopg2://airflowuser%40airflowhost:airlowpassword@airflowhost.postgres.database.azure.com:5432/airflow" | base64Example output (this will be the value we will use for connection withinmetadata-connection-secret.yaml file in this example):
cG9zdGdyZXNxbCtwc3ljb3BnMjovL2xlbmElNDBhaXJmbG93YXp1cmVkYjpQYXNzd29yZDEhQGFpcmZsb3dhenVyZWRiLnBvc3RncmVzLmRhdGFiYXNlLmF6dXJlLmNvbTo1NDMyL3Bvc3RncmVzDefinition of the connection secret is in themetadata-connection-secret.yaml file on GitHub.
Replace the value of the connection parameter in the file with your base64-encoded connection value.
Create the Airflow connection metadata secret:
kubectl create -f secrets/metadata-connection-secret.yaml
Config Map
AConfig Map is a resource that allows storing non-sensitive data in a key-value format. This is convenient for any type of configuration settings. Since Apache Airflow generally requires a configuration file calledairflow.cfg, we can use a Config Map to populate it with important parameters. For example:
*dags_folderpoints to the folder within a pod that can be used to DAGs mounted from remote storage.
*dags_in_image setting can be True or False. If False it will look at mounted volumes or git repository to find DAGs.
*dags_volume_claim is the name of the Persistent Volume Claim for DAGs.
* Settings that start withgit_ are relevant if you’re planning to use Git repository to sync DAGs.
* worker_service_account_name can be used to set the name of the worker service account, and so on.
Definition of Config Map is in theconfigmap.yaml file on GitHub.
Note: there are quite many key-value pairs that can be adjusted in the Config Map, so if you’re doing a lot of experimentation, feel free to tweak some of them. However, also be sure to appropriately modify the necessary resource files as well, as many of the values in the Config Map will affect other resources.
Create config map:
kubectl create -f configmap.yaml
StatsD
StatsD is a process that listens for various statistics, such as counters and timers. Apache Airflow has built-in support for StatsD and is using its Python client to expose metrics. StatsD received information about the number of job successes or failures, number of jobs that are in line waiting for execution, and similar coming from Airflow. If you’re interested in specific types of metrics, take a look at thispage.
Definition of a StatsD deployment is instatsd-deployment.yaml file on GitHub. Definition of a StatsD service is instatsd-service.yaml file on GitHub.
Create StatsD resources:
kubectl create -f statsd/statsd-deployment.yamlkubectl create -f statsd/statsd-service.yamlCheck the status of the StatsD instance and get its TCP port:
kubectl get services -n airflow-on-k8sNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEstatsd ClusterIP 10.0.9.199 <none> 9125/UDP,9102/TCP 24hMap the StatsD instance to local port (for example 7000):
kubectl port-forward service/statsd 7000:9102 -n airflow-on-k8s
Output:
Forwarding from 127.0.0.1:7000 -> 9102
Forwarding from [::1]:7000 -> 9102
Open the 127.0.0.1:7000page in your browser to see StatsD main page or metrics page:

Figure 8. StatsD metrics page
Scheduler
Scheduler is one of the main components behind Apache Airflow.
From documentation: > The Airflowscheduler monitors all tasks and all DAGs and triggers the Task instances whose dependencies have been met. Behind the scenes, it spins up a subprocess, which monitors and stays in sync with a folder for all DAG objects it may contain, and periodically (every minute or so) collects DAG parsing results and inspects active tasks to see whether they can be triggered.
Definition of the Scheduler deployment is inscheduler-deployment.yaml file on GitHub.
Create scheduler deployment:
kubectl create -f scheduler/scheduler-deployment.yaml
Webserver
Webserver and UI component of Apache Airflow enables us to kickstart, schedule, monitor, and troubleshoot our data pipelines, as well as many other convenient functions.
Definition of Webserver deployment is inwebserver-deployment.yaml file on GitHub. Definition of Webserver service is inwebserver-service.yaml file on GitHub.
If you’d like the Webserver to have an external IP, replaceClusterIP withLoadBalancer in thewebserver-service.yaml, and you will be able to access from the outside of the cluster without proxies or port forwarding.
Create webserver deployment and service:
kubectl create -f webserver/webserver-deployment.yaml
kubectl create -f webserver/webserver-service.yaml
Check the status of the Airflow UI instance and get its TCP port:
kubectl get services -n airflow-on-k8sNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEstatsd ClusterIP 10.0.9.199 <none> 9125/UDP,9102/TCP 24hwebserver ClusterIP 10.0.9.175 <none> 8080:31003/TCP 19hMap the Airflow UI instance to local port (for example 8080):
kubectl port-forward service/webserver 8080:8080 -n airflow-on-k8s
Output:
Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080
Open the127.0.0.1:8080page in your browser to see the Airflow UI page.
If you’d like to create a new user for Airflow Webserver, you can connect to the webserver pod:
$ kubectl exec --stdin --tty webserver-647fdcb7c-4qkx9 -n airflow-on-k8s -- /bin/bashairflow@webserver-647fdcb7c-4qkx9:/opt/airflow$And create a user from Airflow CLI. ReplaceUSERNAME, PASSWORD, FIRSTNAME, LASTNAME, EMAIL, ROLE with your own values. Note: existing Airflow roles can be one of the following - Admin, User, Op, Viewer, and Public:
airflow create_user -u USERNAME -p PASSWORD -f FIRSTNAME -l LASTNAME -e EMAIL -r ROLEExample output:
[2020-08-08 00:00:40,140] {__init__.py:51} INFO - Using executor KubernetesExecutor[2020-08-08 00:00:40,143] {dagbag.py:396} INFO - Filling up the DagBag from /opt/airflow/dags[2020-08-08 00:00:41,834] {security.py:475} INFO - Start syncing user roles.[2020-08-08 00:00:42,458] {security.py:385} INFO - Fetching a set of all permission, view_menu from FAB meta-table[2020-08-08 00:00:42,833] {security.py:328} INFO - Cleaning faulty permsViewer user newuser created.Afterward, you can log in to Airflow UI with credentials of any of the users you have provisioned. You should see a page displaying Airflow DAGs:

Figure 9. Airflow UI showing DAGs
You can further explore the Graph View, Tree View, logs, and other details of any particular DAGs if you click on it.

Figure 10. Airflow UI - Logs
Checking health of resources provisioned
Check if scheduler, webserver, and statsd deployments are in a healthy state:
$ kubectl get deployments -n airflow-on-k8sNAME READY UP-TO-DATE AVAILABLE AGEscheduler 1/1 1 1 123mstatsd 1/1 1 1 24hwebserver 1/1 1 1 122mCheck if all corresponding pods are healthy:
$ kubectl get po -n airflow-on-k8sNAME READY STATUS RESTARTS AGEscheduler-7584f4b4b7-5zhvw 2/2 Running 0 125mstatsd-d6d5bcd7c-dg26n 1/1 Running 0 24hwebserver-647fdcb7c-4qkx9 1/1 Running 0 124mCheck status/events of any particular pod (scheduler-7584f4b4b7-5zhvw pod in this example):
$ kubectl describe pod scheduler-7584f4b4b7-5zhvw -n airflow-on-k8s
Check pod logs (where -c parameter refers to the name ofinit-containerwe want to check on, scheduler in this case):
kubectl logs scheduler-7584f4b4b7-5zhvw -n airflow-on-k8s -c scheduler
Check logs from init-container of a pod (where -c parameter refers to the name of init-container we want to check on, run-airflow-migrations in this case):
$ kubectl logs scheduler-7584f4b4b7-5zhvw -n airflow-on-k8s -c run-airflow-migrations
Connect to a pod to execute commands from within (webserver-647fdcb7c-4qkx9 pod in this example):
kubectl exec --stdin --tty webserver-647fdcb7c-4qkx9 -n airflow-on-k8s -- /bin/bash
After getting connected, we can execute commands (for example check the dags directory):
airflow@webserver-647fdcb7c-4qkx9:/opt/airflow$ cd dagsairflow@webserver-647fdcb7c-4qkx9:/opt/airflow/dags$ ls__pycache__ dag_processor_manager dags first-dag.py hello_dag.py hello_world.py outfile schedulerOverview of resources created
We can see which resources are running in the cluster by running the following command:
kubectl get all -n airflow-on-k8sNAME READY STATUS RESTARTS AGEpod/scheduler-7584f4b4b7-jdfzl 2/2 Running 0 13mpod/statsd-d6d5bcd7c-mjdm7 1/1 Running 0 17mpod/webserver-647fdcb7c-ft72t 1/1 Running 0 8m26sNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEservice/statsd ClusterIP 10.0.47.229 <none> 9125/UDP,9102/TCP 16mservice/webserver ClusterIP 10.0.197.80 <none> 8080/TCP 6sNAME READY UP-TO-DATE AVAILABLE AGEdeployment.apps/scheduler 1/1 1 1 13mdeployment.apps/statsd 1/1 1 1 17mdeployment.apps/webserver 1/1 1 1 8m26sNAME DESIRED CURRENT READY AGEreplicaset.apps/scheduler-7584f4b4b7 1 1 1 13mreplicaset.apps/statsd-d6d5bcd7c 1 1 1 17mreplicaset.apps/webserver-647fdcb7c 1 1 1 8m26sNote: it doesn’t show the secrets, persistent volumes, persistent volume claims, service accounts, cluster roles, or cluster role bindings that we also created.
In Azure Portal, you can see all the resources within the main resource group:

Figure 11. Cloud Resources
To clean up your environment, just run:
az group delete --name $RESOURCE_GROUP_NAMENext Steps
As a next step, experiment and take a look at some of the DAG definitions andintegrations available!
About the Author
Lena Hall is a Director of Engineering at Microsoft working on Azure, where she focuses on large-scale distributed systems and modern architectures. She is leading a team and technical strategy for product improvement efforts across Big Data services at Microsoft. Lena is the driver behind engineering initiatives and strategies to advance, facilitate and push forward further acceleration of cloud services. Lena has 10 years of experience in solution architecture and software engineering with a focus on distributed cloud programming, real-time system design, highly scalable and performant systems, big data analysis, data science, functional programming, and machine learning. Previously, she was a Senior Software Engineer at Microsoft Research. She co-organizes a conference called ML4ALL, and is often an invited member of program committees for conferences like Kafka Summit, Lambda World, and others. Lena holds a master’s degree in computer science. Twitter:@lenadroid LinkedIn:Lena Hall
This content is in theAI, ML & Data Engineering topic
Related Topics:
Related Editorial
Popular across InfoQ
Google Launches Code Wiki, an AI-Driven System for Continuous, Interactive Code Documentation
AWS Lambda Rust Support Reaches General Availability
New Token-Oriented Object Notation (TOON) Hopes to Cut LLM Costs by Reducing Token Consumption
AnalogJS 2.0: Angular Full Stack Framework Introduces Content Resources & Leaner Builds
Stripe's Zero-Downtime Data Movement Platform Migrates Petabytes with Millisecond Traffic Switches
How to Use Apache Spark to Craft a Multi-Year Data Regression Testing and Simulations Framework
The InfoQ Newsletter
A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers.View an example