InfoQ HomepageArticlesScalable Cloud Environment for Distributed Data Pipelines with Apache Airflow

How AI is reshaping connector architecture: From REST APIs to dynamic ingestion (Webinar, Dec 4th)

Scalable Cloud Environment for Distributed Data Pipelines with Apache Airflow

Sep 23, 202021min read

Lena Hall

reviewed by

Srini Penchikala

Write for InfoQ

Feed your curiosity.Help 550k+ global
senior developers
each month stay ahead.Get in touch

Reading list

Key Takeaways

Selecting the right data orchestration tool is important due to differences and trade-offs between the variety of options for building distributed data pipelines.
Apache Airflow is one of the most popular and widely adopted OSS projects for programmatic orchestration of data workflows
Main Airflow concepts include Directed Acyclic Graph, scheduler, webserver, executor, and metadata database
You can provision a distributed Airflow setup in a cloud-agnostic environment, as well as in the cloud on Azure, AWS, and GCP.

Data pipelines, movement, workflows, transformation, and ETL have always been important for a broad range of businesses. Data pipeline tools can be grouped into several categories:

Simple utilities, such ascron, excellent for trivial data pipelines.
Platform-specific tools, or cloud-based managed services, such asAWS Glue orData Pipeline, orAzure Data Factory, providing tight integration with specific products, moving data between them.
Open-source projects, such asApache Airflow, ApacheOozie orAzkaban,Dagster,Prefect, offering flexibility, extensibility, and rich programmable control flow.

What is covered in this article

In this article, I am focusing onApache Airflow as one of the most flexible, extensible, and popular in the community projects for reliable and scalable data and AI pipelines. I will cover:

Key Apache Airflow concepts and why use it in distributed setting
General architecture for Apache Airflow environment in the cloud
Detailed guide for provisioning Apache Airflow on Kubernetes on Azure

Apache Airflow: Key Concepts

Airflow is based on the following concepts and components:

DAG (Directed Acyclic Graph)- a set of tasks and dependencies between them, defined using Python code.
Scheduler- discovers the DAGs that are ready to be triggered according to the associated schedule. When the scheduler discovers task failures, it can optionally retry the DAG a certain number of times.
Webserver - a handy UI interface for viewing, scheduling, or triggering DAGs, offering useful information on task success or failure status, progress, duration, retries, logs, and more.
Executor - component that runs a task, assigns it to a specific node, and updates other components of its progress.
Metadata database - where Airflow can store metadata, configuration, and information on task progress.

Scalable data workflows with Airflow on Kubernetes

Related Sponsors

An Engineering Blueprint for Real‑Time Recommendation Systems

Related Sponsor

Snowplow enables digital-first companies to turn behavioral data into fuel for real-time advanced analytics, predictive modeling, hyper-personalization, and customer-facing AI agent context.Learn More.

Airflow can be run on a single-machine or in a distributed mode, to keep up with the scale of data pipelines.

Airflow can be distributed withCelery - a component that uses task queues as a mechanism to distribute work across machines. There is a requirement to have a number of nodes running up front to schedule tasks across them. Celery is deployed as an extra component in your system and requires a message transport to send and receive messages, such asRedis orRabbitMQ.

Airflow supports Kubernetes as a distributed executor. it doesn’t require any additional components, like Redis.Kubernetes executor doesn’t need to always keep a certain number of workers alive as it creates a new pod for every job.

General architecture of Apache Airflow environment in the cloud

Figure 1. Cloud-agnostic architecture.

Figure 2. Azure architecture.

Figure 3. GCP and AWS architectures.

Provisioning Apache Airflow on Kubernetes in the cloud

To get started, you will need access to a cloud subscription, such as Azure, AWS, or Google Cloud. The example in this article is based on Azure, however, you should be able to successfully follow the same steps for AWS or GCP with minor changes.

To follow the example on Azure, feel free to create anaccount, and installAzure CLI or useAzure Portal to create necessary resources. If using Azure CLI, don’t forget to login and initialize your session with subscription ID:

az loginaz account set --subscription <subscription-id>

To make sure you can easily delete all the resources at once after giving it a try, create an Azure Resource Group that will serve as a grouping unit:

RESOURCE_GROUP_NAME="airflow"REGION="East US"az group create --name $RESOURCE_GROUP_NAME --region $REGION

After you are done with resources, feel free to delete the entire resource group:

az group delete --name $RESOURCE_GROUP_NAME

PostgreSQL

For Apache Airflow, a database is required to store metadata information about the status of tasks. Airflow is built to work with a metadata database through SQLAlchemy abstraction layer.SQLAlchemy is a Python SQL toolkit and Object Relational Mapper. Any database that supports SQLAlchemy should work with Airflow. MySQL or PostgreSQL are some of the most common choices.

To create and successfully connect to an instance of PostgreSQL on Azure, please follow detailed instructionshere.

Feel free to use the same resource group name and location for the PostgreSQL instance. Make sure to indicate a unique server name. I choseGP_Gen5_2as instance size, as 2 vCores and 100 GB of storage is more than enough for the example, but feel free to pick thesize that fits your own requirements.

It is important to remember what your PostgreSQL server name, fully qualified domain name, username, and password are. This information will be required during the next steps. Note: you can get a fully qualified domain name by looking at thefullyQualifiedDomainName after executing the command:

POSTGRESQL_SERVER_NAME="<your-server-name>"az postgres server show --resource-group $RESOURCE_GROUP_NAME --name $POSTGRESQL_SERVER_NAME# example value of fullyQualifiedDomainName on Azure: airflowazuredb.postgres.database.azure.com

Check connection to your database usingPSQL:

P_USER_NAME="your-postgres-username"psql --host=$POSTGRESQL_SERVER_NAME.postgres.database.azure.com --port=5432 --username=$P_USER_NAME@$POSTGRESQL_SERVER_NAME --dbname=postgres

For AWS or GCP, feel free to useAmazon RDS orGoogle Cloud SQL for PostgreSQL respectively.

File Share

When running data management workflows, we need to store Apache Airflow Directed Acyclic Graph (DAG) definitions somewhere. When running Apache Airflow locally, we can store them in a local filesystem directory and point to it through the configuration file. When running Airflow in a Docker container (either locally or in the cloud), we have several options.

Storing data pipeline DAGs directly within the container image. The downside of this approach is when there is a possibility and likelihood of frequent changes to DAGs. This would imply the necessity to rebuild the image each time your DAGs change.
Storing DAG definitions in a remote Git repository. When there are changes within DAG definitions, usingGit-Sync sidecar can automatically synchronize the repository with the volume in your container.
Storing DAGs in a shared remote location, such as remote filesystem. Same as with a remote Git repository, we can mount a remote filesystem to a volume in our container and mirror DAG changes automatically. Kubernetes supports a variety ofCSI drivers for many remote filesystems, includingAzure Files,AWS Elastic File System, orGoogle Cloud Filestore. This approach is great if you also want to store logs somewhere in a remote location.

I am using Azure Files for storing DAG definitions. To create an Azure fileshare, execute the following commands:

STORAGE_ACCOUNT="storage-account-name"FILESHARE_NAME="fileshare-name"az storage account create \    --resource-group $RESOURCE_GROUP_NAME \    --name $STORAGE_ACCOUNT \    --kind StorageV2 \    --sku Standard_ZRS \    --output noneSTORAGE_ACCOUNT_KEY=$(az storage account keys list \    --resource-group $RESOURCE_GROUP_NAME \    --account-name $STORAGE_ACCOUNT \    --query "[0].value" | tr -d '"')az storage share create \    --account-name $STORAGE_ACCOUNT \    --account-key $STORAGE_ACCOUNT_KEY \    --name $FILESHARE_NAME \    --quota 1024 \    --output none

For Azure Files, detailed creation instructions are locatedhere. Feel free to create a fileshare on any of the other platforms to follow along.

After the fileshare is created, you can copy one or several DAGs you might have to your newly created storage. If you don’t have any DAGs yet, you can use one of those available online. For example,the following DAG from one of the GitHub repositories calledairflow_tutorial_v01, which you can also findhere.

To copy files to Azure Files share, you can use Azure Portal, orAzCopy util for programmatic operations.

Kubernetes cluster

For Apache Airflow scheduler, UI, and executor workers, we need to create a cluster. For Azure Kubernetes Service, detailed cluster creation instructions arehere. Make sure to indicate that you’d like the cluster to be provisioned in a Virtual Network (default installation doesn’t include it). I created the cluster with Azure Portal with 3 nodes of size Standard DS2 v2 in East US, with RBAC enabled, and the following network configuration:

Figure 4. Kubernetes Cluster Network Configuration.

Azure Portal has an amazing feature to help you connect to your cluster. On the AKS cluster resource, click Connect at the top, and follow the instructions.

Allow cluster to access database

To make sure the AKS cluster can communicate with the PostgreSQL on Azure database, we need to add a service endpoint on the AKS Virtual Network side and a Virtual Network rule on the PostgreSQL side.

Go to the AKS virtual network resource in Azure Portal, it’s located in the same resource group where the cluster is. Under Service Endpoints settings menu, select Add, and chooseMicrosoft.SQL from the dropdown:

Figure 5. Add a Service Endpoint to the Virtual Network.

Go to the PostgreSQL on Azure resource, and under Connection Security settings menu VNET rules section, selectAddexisting virtual network. Specify a name for the Virtual Network rule, select your subscription and the AKS Virtual Network. These actions will make sure Apache Airflow pods on AKS are able to communicate with the database successfully.

Figure 6. Add a Virtual Network Rule.

Prepare fileshare to be used within Kuberneres

Install a CSI driver corresponding to your platform. For Azure, follow instructions to installAzure Files CSI driver.

Create a secret to store Azure storage account name and key (make sureSTORAGE_ACCOUNTand STORAGE_ACCOUNT_KEYcontain your own values for storage account name and key). This secret will be used later in Persistent Volumes definitions for DAGs and logs.

kubectl create secret generic azure-secret --from-literal accountname=$STORAGE_ACCOUNT --from-literal accountkey=$STORAGE_ACCOUNT_KEY --type=Opaque

Prepare resource files to run Airflow components on Kubernetes

Figure 7. List of Resource Definition Files.

You can clone theGitHub repository to get these files. Before you apply them to your own clusters, make sure to review them and read through the notes in this article, as there are quite a few values that might need to be customized.

Note: I initially got the files from the official Airflow GitHub repositoryhere. I ran the helm template command to generate YAML files and deleted those that weren’t relevant for this use case. MyGitHub repository now contains all the necessary files adjusted for this scenario, so you can skip the helm template step if you’d like to use & modify files under my repository.

Namespace

Having a namespace is a good way to identify resources that belong to the same logical group. In this case, we can create a namespace to assign to all resources relevant for our Apache Airflow setup. This is especially convenient for distinguishing groups of resources when there are multiple components living within the same cluster.

Definition of the namespace for all Airflow resources is in theairflow-namespace.yaml file on GitHub.

Create the namespace:

kubectl create secret generic azure-secret --from-literal accountname=$STORAGE_ACCOUNT --from-literal accountkey=$STORAGE_ACCOUNT_KEY --type=Opaque

Make sure to remember to use the right namespace name when creating other resources for Airflow. We will be usingairflow-on-k8s in this guide.

Persistent Volumes for logs and DAGs

As one of the ways to work with storage within the cluster, there is a resource calledPersistent Volumes. Its goal is to abstract away the details of the underlying storage provider it corresponds to, be it a shared network file system, external volume, or other cloud-specific storage type. Persistent Volume resource is configured with connection information or settings that would define how the cluster will connect to and work with the storage. Persistent Volume lifecycle is independent, and doesn’t have to be attached to the lifecycle of a pod that uses it.

In this case, we need an abstraction to represent storage for DAG definitions and for log files. Definition of a Persistent Volume resource for DAGs is inpv-azurefile-csi.yamlfile on GitHub. Definition of a Persistent Volume resource for logs is inpv-azurefile-csi-logs.yaml file on GitHub.

Openpv-azurefile-csi.yaml andpv-azurefile-csi-logs.yaml files to edit them to include your own fileshare name and storage account name. If you followed the steps above, assign the parameter shareName to the value of$FILESHARE_NAME variable, and parameter server to the value of$STORAGE_ACCOUNT.file.core.windows.net.

If you are not using Azure, make sure to change the CSI driver settings to correspond to AWS, GCP, or any other platform. Don’t forget to create any necessary secrets to store sensitive data corresponding to the fileshare system you are using.

Create persistent volumes:

kubectl create -f pv/pv-azurefile-csi.yamlkubectl create -f pv/pv-azurefile-csi-logs.yaml

Persistent Volume Claims for logs and DAGs

In addition to Persistent Volumes as a storage abstraction, we also have an abstraction that represents a request for storage, calledPersistent Volume Claim. The goal of Persistent Volume Claim is to represent a specific pod’s request for storage combined with specific details, such as exact amount of storage, and access mode.

We need to create Persistent Volume Claims for DAGs and logs, to make sure pods that interact with storage can use these claims to request access to the storage (Azure Files in this case).

Definition of a Persistent Volume Claim resource for DAGs is inpvc-azurefile-csi-static.yaml file on GitHub. Definition of a Persistent Volume Claim resource for logs is inpvc-azurefile-csi-static-logs.yamlfile on GitHub.

Create persistent volume claims:

kubectl create -f pv/pvc-azurefile-csi-static.yaml
kubectl create -f pv/pvc-azurefile-csi-static-logs.yaml

After a few minutes, make sure your Persistent Volume Claims are in Bound state:

$ kubectl get pvc -n airflow-on-k8sNAME       STATUS   VOLUME    CAPACITY   ACCESS MODESdags-pvc   Bound    dags-pv   10Gi       RWXlogs-pvc   Bound    logs-pv   10Gi       RWX

Service Accounts for scheduler and workers

To allow certain processes perform certain tasks, there is a notion ofService Accounts. For example, we’d like to make sure Airflow scheduler is able to programmatically create, view, and manage pod resources for workers. To ensure this is possible, we need to create a Service Account for each component we want to grant certain privileges to. Later, we can associate the Service Accounts with Cluster Roles by usingCluster Rolebindings.

Definition of Scheduler Service Account is in thescheduler-serviceaccount.yamlfile on GitHub. Definition of Worker Service Account is in theworker-serviceaccount.yaml file on GitHub.

Create the service accounts:

kubectl create -f scheduler/scheduler-serviceaccount.yamlkubectl create -f workers/worker-serviceaccount.yaml

Cluster Role for scheduler and workers to dynamically operate pods

ACluster Role represents a set of rules or permissions.

Definition of the Cluster Role we want to create is in thepod-launcher-role.yamlfile on GitHub.

Create the cluster role:

kubectl create -f rbac/pod-launcher-role.yaml

Cluster Role Binding for scheduler and workers

ACluster Rolebinding is a connection between a Cluster Role and accounts that need it.

Definition of the Cluster Role is in thepod-launcher-rolebinding.yaml file on GitHub.

Create the cluster role binding:

kubectl create -f rbac/pod-launcher-rolebinding.yaml

Secrets

Secrets is a mechanism for managing sensitive data, such as tokens, passwords, or keys that other resources in the cluster may require.

Note: if you configure the secret through a manifest (JSON or YAML) file which has the secret data encoded as base64, sharing this file or checking it into a source repository means the secret is compromised. Base64 encoding is not an encryption method and is considered the same as plain text.

Apache Airflow needs to know what is the fernet key and how to connect to the metadata database. We will use secrets to represent this information.

Fernet key

Apache Airflow requires fernet key to make sure it cansecure connection information that isn’t protected by default.

Start Python shell:

$ pythonPython 3.7.5 (default, Nov  1 2019, 02:16:32)[Clang 11.0.0 (clang-1100.0.33.8)] on darwinType "help", "copyright", "credits" or "license" for more information.>>>

Generate fernet key from Python shell:

from cryptography.fernet import Fernetfernet_key= Fernet.generate_key()print('Generated fernet key: ', fernet_key.decode())

Example output of this command is:

Generatedfernet key: ESqYUmi27Udn6wxY83KoM9kuvt9rDcelghHbAgGZW9g=

Convert the value to the base64 encoded value:

$ echo -n "<your-generated-fernet-key-value>" | base64

Example output (it will be the value we will use for fernet-key withinfernet-secret.yamlfile in this example):

Slk0OThJbHE4R0xRNEFuRlJWT3FUR3lBeDg3bG5BWWhEdWx1ekhHX2RJQT0=

Definition of Fernet Key secret is in thefernet-secret.yaml file on GitHub.

Replace the value of the fernet-key parameter in the file with your generated fernet-key value.

Create the fernet secret:

kubectl create -f secrets/fernet-secret.yaml

Database connection information

Prepare your PostgreSQL database connection. Generally, it follows the format of:

postgresql+psycopg2://user:password@hostname:5432/dbname

Note: when using Azure SQL Database for PostgreSQL, the connection string requires user to be in format of user@host, where @ sign should be escaped as %40 (moredetails):

postgresql+psycopg2://user%40host:password@host.postgres.database.azure.com:5432/dbname

Encode your connection using base64 (replaceairflowusername with your username,airflowpassword with your password,airflowhost with your host, and airflow with your database name):

# For general PostgreSQLecho -n "postgresql+psycopg2://airflowuser:airlowpassword@airflowhost.postgres.database.azure.com:5432/airflow" | base64# For Azure PostgreSQLecho -n "postgresql+psycopg2://airflowuser%40airflowhost:airlowpassword@airflowhost.postgres.database.azure.com:5432/airflow" | base64

Example output (this will be the value we will use for connection withinmetadata-connection-secret.yaml file in this example):

cG9zdGdyZXNxbCtwc3ljb3BnMjovL2xlbmElNDBhaXJmbG93YXp1cmVkYjpQYXNzd29yZDEhQGFpcmZsb3dhenVyZWRiLnBvc3RncmVzLmRhdGFiYXNlLmF6dXJlLmNvbTo1NDMyL3Bvc3RncmVz

Definition of the connection secret is in themetadata-connection-secret.yaml file on GitHub.

Replace the value of the connection parameter in the file with your base64-encoded connection value.

Create the Airflow connection metadata secret:

kubectl create -f secrets/metadata-connection-secret.yaml

Config Map

AConfig Map is a resource that allows storing non-sensitive data in a key-value format. This is convenient for any type of configuration settings. Since Apache Airflow generally requires a configuration file calledairflow.cfg, we can use a Config Map to populate it with important parameters. For example:

*dags_folderpoints to the folder within a pod that can be used to DAGs mounted from remote storage.

*dags_in_image setting can be True or False. If False it will look at mounted volumes or git repository to find DAGs.

*dags_volume_claim is the name of the Persistent Volume Claim for DAGs.

* Settings that start withgit_ are relevant if you’re planning to use Git repository to sync DAGs.

* worker_service_account_name can be used to set the name of the worker service account, and so on.

Definition of Config Map is in theconfigmap.yaml file on GitHub.

Note: there are quite many key-value pairs that can be adjusted in the Config Map, so if you’re doing a lot of experimentation, feel free to tweak some of them. However, also be sure to appropriately modify the necessary resource files as well, as many of the values in the Config Map will affect other resources.

Create config map:

kubectl create -f configmap.yaml

StatsD

StatsD is a process that listens for various statistics, such as counters and timers. Apache Airflow has built-in support for StatsD and is using its Python client to expose metrics. StatsD received information about the number of job successes or failures, number of jobs that are in line waiting for execution, and similar coming from Airflow. If you’re interested in specific types of metrics, take a look at thispage.

Definition of a StatsD deployment is instatsd-deployment.yaml file on GitHub. Definition of a StatsD service is instatsd-service.yaml file on GitHub.

Create StatsD resources:

kubectl create -f statsd/statsd-deployment.yamlkubectl create -f statsd/statsd-service.yaml

Check the status of the StatsD instance and get its TCP port:

kubectl get services -n airflow-on-k8sNAME        TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)             AGEstatsd      ClusterIP      10.0.9.199    <none>        9125/UDP,9102/TCP   24h

Map the StatsD instance to local port (for example 7000):

kubectl port-forward service/statsd 7000:9102 -n airflow-on-k8s

Output:

Forwarding from 127.0.0.1:7000 -> 9102 Forwarding from [::1]:7000 -> 9102

Open the 127.0.0.1:7000page in your browser to see StatsD main page or metrics page:

Figure 8. StatsD metrics page

Scheduler

Scheduler is one of the main components behind Apache Airflow.

From documentation: > The Airflowscheduler monitors all tasks and all DAGs and triggers the Task instances whose dependencies have been met. Behind the scenes, it spins up a subprocess, which monitors and stays in sync with a folder for all DAG objects it may contain, and periodically (every minute or so) collects DAG parsing results and inspects active tasks to see whether they can be triggered.

Definition of the Scheduler deployment is inscheduler-deployment.yaml file on GitHub.

Create scheduler deployment:

kubectl create -f scheduler/scheduler-deployment.yaml

Webserver

Webserver and UI component of Apache Airflow enables us to kickstart, schedule, monitor, and troubleshoot our data pipelines, as well as many other convenient functions.

Definition of Webserver deployment is inwebserver-deployment.yaml file on GitHub. Definition of Webserver service is inwebserver-service.yaml file on GitHub.

If you’d like the Webserver to have an external IP, replaceClusterIP withLoadBalancer in thewebserver-service.yaml, and you will be able to access from the outside of the cluster without proxies or port forwarding.

Create webserver deployment and service:

kubectl create -f webserver/webserver-deployment.yaml kubectl create -f webserver/webserver-service.yaml

Check the status of the Airflow UI instance and get its TCP port:

kubectl get services -n airflow-on-k8sNAME        TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)             AGEstatsd      ClusterIP      10.0.9.199    <none>        9125/UDP,9102/TCP   24hwebserver   ClusterIP      10.0.9.175    <none>        8080:31003/TCP      19h

Map the Airflow UI instance to local port (for example 8080):

kubectl port-forward service/webserver 8080:8080 -n airflow-on-k8s

Output:

Forwarding from 127.0.0.1:8080 -> 8080 Forwarding from [::1]:8080 -> 8080

Open the127.0.0.1:8080page in your browser to see the Airflow UI page.

If you’d like to create a new user for Airflow Webserver, you can connect to the webserver pod:

$ kubectl exec --stdin --tty webserver-647fdcb7c-4qkx9 -n airflow-on-k8s -- /bin/bashairflow@webserver-647fdcb7c-4qkx9:/opt/airflow$

And create a user from Airflow CLI. ReplaceUSERNAME, PASSWORD, FIRSTNAME, LASTNAME, EMAIL, ROLE with your own values. Note: existing Airflow roles can be one of the following - Admin, User, Op, Viewer, and Public:

airflow create_user -u USERNAME -p PASSWORD -f FIRSTNAME -l LASTNAME -e EMAIL -r ROLE

Example output:

[2020-08-08 00:00:40,140] {__init__.py:51} INFO - Using executor KubernetesExecutor[2020-08-08 00:00:40,143] {dagbag.py:396} INFO - Filling up the DagBag from /opt/airflow/dags[2020-08-08 00:00:41,834] {security.py:475} INFO - Start syncing user roles.[2020-08-08 00:00:42,458] {security.py:385} INFO - Fetching a set of all permission, view_menu from FAB meta-table[2020-08-08 00:00:42,833] {security.py:328} INFO - Cleaning faulty permsViewer user newuser created.

Afterward, you can log in to Airflow UI with credentials of any of the users you have provisioned. You should see a page displaying Airflow DAGs:

Figure 9. Airflow UI showing DAGs

You can further explore the Graph View, Tree View, logs, and other details of any particular DAGs if you click on it.

Figure 10. Airflow UI - Logs

Checking health of resources provisioned

Check if scheduler, webserver, and statsd deployments are in a healthy state:

$ kubectl get deployments -n airflow-on-k8sNAME        READY   UP-TO-DATE   AVAILABLE   AGEscheduler   1/1     1            1           123mstatsd      1/1     1            1           24hwebserver   1/1     1            1           122m

Check if all corresponding pods are healthy:

$ kubectl get po -n airflow-on-k8sNAME                         READY   STATUS    RESTARTS   AGEscheduler-7584f4b4b7-5zhvw   2/2     Running   0          125mstatsd-d6d5bcd7c-dg26n       1/1     Running   0          24hwebserver-647fdcb7c-4qkx9    1/1     Running   0          124m

Check status/events of any particular pod (scheduler-7584f4b4b7-5zhvw pod in this example):

$ kubectl describe pod scheduler-7584f4b4b7-5zhvw -n airflow-on-k8s

Check pod logs (where -c parameter refers to the name ofinit-containerwe want to check on, scheduler in this case):

kubectl logs scheduler-7584f4b4b7-5zhvw -n airflow-on-k8s -c scheduler

Check logs from init-container of a pod (where -c parameter refers to the name of init-container we want to check on, run-airflow-migrations in this case):

$ kubectl logs scheduler-7584f4b4b7-5zhvw -n airflow-on-k8s -c run-airflow-migrations

Connect to a pod to execute commands from within (webserver-647fdcb7c-4qkx9 pod in this example):

kubectl exec --stdin --tty webserver-647fdcb7c-4qkx9 -n airflow-on-k8s -- /bin/bash

After getting connected, we can execute commands (for example check the dags directory):

airflow@webserver-647fdcb7c-4qkx9:/opt/airflow$ cd dagsairflow@webserver-647fdcb7c-4qkx9:/opt/airflow/dags$ ls__pycache__  dag_processor_manager  dags  first-dag.py  hello_dag.py  hello_world.py  outfile  scheduler

Overview of resources created

We can see which resources are running in the cluster by running the following command:

kubectl get all -n airflow-on-k8sNAME                             READY   STATUS    RESTARTS   AGEpod/scheduler-7584f4b4b7-jdfzl   2/2     Running   0          13mpod/statsd-d6d5bcd7c-mjdm7       1/1     Running   0          17mpod/webserver-647fdcb7c-ft72t    1/1     Running   0          8m26sNAME                TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)             AGEservice/statsd      ClusterIP   10.0.47.229   <none>        9125/UDP,9102/TCP   16mservice/webserver   ClusterIP   10.0.197.80   <none>        8080/TCP            6sNAME                        READY   UP-TO-DATE   AVAILABLE   AGEdeployment.apps/scheduler   1/1     1            1           13mdeployment.apps/statsd      1/1     1            1           17mdeployment.apps/webserver   1/1     1            1           8m26sNAME                                   DESIRED   CURRENT   READY   AGEreplicaset.apps/scheduler-7584f4b4b7   1         1         1       13mreplicaset.apps/statsd-d6d5bcd7c       1         1         1       17mreplicaset.apps/webserver-647fdcb7c    1         1         1       8m26s

Note: it doesn’t show the secrets, persistent volumes, persistent volume claims, service accounts, cluster roles, or cluster role bindings that we also created.

In Azure Portal, you can see all the resources within the main resource group:

Figure 11. Cloud Resources

To clean up your environment, just run:

az group delete --name $RESOURCE_GROUP_NAME

Next Steps

As a next step, experiment and take a look at some of the DAG definitions andintegrations available!

About the Author

Lena Hall is a Director of Engineering at Microsoft working on Azure, where she focuses on large-scale distributed systems and modern architectures. She is leading a team and technical strategy for product improvement efforts across Big Data services at Microsoft. Lena is the driver behind engineering initiatives and strategies to advance, facilitate and push forward further acceleration of cloud services. Lena has 10 years of experience in solution architecture and software engineering with a focus on distributed cloud programming, real-time system design, highly scalable and performant systems, big data analysis, data science, functional programming, and machine learning. Previously, she was a Senior Software Engineer at Microsoft Research. She co-organizes a conference called ML4ALL, and is often an invited member of program committees for conferences like Kafka Summit, Lambda World, and others. Lena holds a master’s degree in computer science. Twitter:@lenadroid LinkedIn:Lena Hall

This content is in theAI, ML & Data Engineering topic

The InfoQ Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers.View an example

We protect your privacy.

Movatterモバイル変換

InfoQ Software Architects' Newsletter