Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

Josh Holbrook
Josh Holbrook

Posted on • Edited on

     

How To Run Airflow on Windows (with Docker)

A problem I've noticed a lot of aspiring data engineers running into recently is trying to runAirflow on Windows. This is harder than it sounds.

For many (most?) Python codebases, running on Windows is reasonable enough. For data,Anaconda even makes it easy - create an environment, install your library and go. Unfortunately, Airbnb handed us a pathologically non-portable codebase. I was flabbergasted to find that casually trying to run Airflow on Windows resulted in a bad shim script, a really chintzy pathing bug, a symlinking issue*and an attempt to use the Unix-only passwords database.

Hilarity!

So running Airflow in Windowsnatively is dead in the water, unless you want to spend a bunch of months rewriting a bunch of the logic and arguing with the maintainers**. Luckily, there are two fairly sensible alternate approaches to consider which will let you run Airflow on a Windows machine: WSL and Docker.

WSL

WSL stands for the "Windows Subsystem for Linux", and it's actually really cool. Basically, steps look something like this:

Ubuntu running on Windows

I have WSL 2 installed, which is faster and better in many ways aside but which (until recently? unclear) needs an insider build of Windows.

Given that this is a fully operational Ubuntu environment, any tutorial that you follow for Ubuntu should also work in this environment.

Docker

The alternative, and the one I'm going to demo in this post, is to use Docker.

Docker is a tool for managing Linuxcontainers, which are a little like virtual machines without the virtualization, making them act like self-contained machines but much more lightweight than a full VM. Surprisingly it works on Windows - casually, even.

Brief sidebar: Docker isn't a silver bullet, and honestly it's kind of a pain in the butt. I personally find it tough to debug and its aggressive caching makes both cache busting and resource clearing difficult. Even so, the alternatives - such as Vagrant - are generally worse. Docker is also a pseudo-standard andKubernetes - the heinously confusing thing your DevOps team makes you deploy to - works with Docker images, so it's overall a useful tool to reach for especially for problems like this one.

Setting up Docker Compose

Docker containers can be ran in two ways: either in a bespoke capacity via the command line, or using a tool calledDocker Compose that takes a yaml file which specifies which containers to run and how, and then does what's needed. For a single container the command line is often the thing you want - and we use it later on - but for a collection of services that need to talk to each other, Docker Compose is what we need.

So to get started, create a directory somewhere - mine's in~\software\jfhbrook\airflow-docker-windows but yours can be anywhere - and create adocker-compose.yml file that looks like this:

version:'3.8'services:metadb:image:postgresenvironment:POSTGRES_USER:airflowPOSTGRES_PASSWORD:airflowPOSTGRES_DB:airflownetworks:-airflowrestart:unless-stoppedvolumes:-./data:/var/lib/postgresql/datascheduler:image:apache/airflowcommand:schedulerdepends_on:-metadbnetworks:-airflowrestart:unless-stoppedvolumes:-./airflow:/opt/airflowwebserver:image:apache/airflowcommand:webserverdepends_on:-metadbnetworks:-airflowports:-8080:8080restart:unless-stoppedvolumes:-./airflow:/opt/airflownetworks:airflow:
Enter fullscreen modeExit fullscreen mode

There's a lot going on here. I'll try to go over the highlights, but I recommend referring to thefile format reference docs.

First of all, we create three services: a metadb, a scheduler and a webserver. Architecturally, Airflow stores its state in a database (the metadb), the scheduler process connects to that database to figure out what to run when, and the webserver process puts a web UI in front of the whole thing. Individual jobs can connect to other databases, such as RedShift, to do actual ETL.

Docker containers are created based on Dockerimages, which hold the starting state for a container. We use two images here:apache/airflow, the official Airflow image, andpostgres, the official PostgreSQL image.

Airflow also reads configuration, DAG files and so on, out of a directory specified by an environment variable calledAIRFLOW_HOME. The default if installed on your MacBook is~/airflow, but in the Docker image it's set to/opt/airflow.

We use Docker'svolumes functionality to mount the directory./airflow under/opt/airflow. We'll revisit the contents of this directory before trying to start the cluster.

The metadb implementation is pluggable and supports most SQL databases viaSQLAlchemy. Airflow uses SQLite by default, but in practice most people either use MySQL or PostgreSQL. I'm partial to the latter, so I chose to set it up here.

On the PostgreSQL side: you need to configure it to have a user and database that Airflow can connect to. The Docker image supports this via environment variables. There are many variables that are supported, but the ones I used arePOSTGRES_USER,POSTGRES_PASSWORD andPOSTGRES_DB. By setting all of these toairflow, I ensured that there was a superuser namedairflow, with a password ofairflow and a default database ofairflow.

Note that you'll definitely want to think about this harder before you go to production. Database security is out of scope of this post, but you'll probably want to create a regular user for Airflow, set up secrets management with your deploy system, and possibly change the authentication backend. Your DevOps team, if you have one, can probably help you here.

PostgreSQL stores all of its data in a volume as well. The location in the container is at/var/lib/postgresql/data, and I put it in./data on my machine.

Docker has containers connect over virtual networks. Practically speaking, this means that you have to make sure that any containers that need to talk to each other are all connected to the same network (named "airflow" in this example), and that any containers that you need to talk to from outside have their ports explicitly exposed. You'll definitely want to expose port 8080 of the webserver to your host so that you can visit the UI in your browser. You may want to expose PostgreSQL as well, though I haven't done that here.

Finally, by default Docker Compose won't bother to restart a container if it crashes. This may be desired behavior, but in my case I wanted them to restart unless I told them to stop, and so set it tounless-stopped.

Setting Up Your Filesystem

As mentioned, a number of directories need to exist and be populated in order for Airflow to do something useful.

First, let's create thedata directory, so that PostgreSQL has somewhere to put its data:

mkdir./data
Enter fullscreen modeExit fullscreen mode

Next, let's create theairflow directory, which will contain the files inside Airflow'sAIRFLOW_HOME:

mkdir./airflow
Enter fullscreen modeExit fullscreen mode

When Airflow starts it looks for a file calledairflow.cfg inside of theAIRFLOW_HOME directory, which is ini-formatted and which is used to configure Airflow. This file supportsa number of options, but the only one we need for now iscore.sql_alchemy_conn. This field contains a SQLAlchemy connection string for connecting to PostgreSQL.

Crack open./airflow/airflow.cfg in your favorite text editor and make it look like this:

[core]sql_alchemy_conn=postgresql+psycopg2://airflow:airflow@metadb:5432/airflow
Enter fullscreen modeExit fullscreen mode

Some highlights:

  • The protocol is "postgresql+psycopg2", which tells SQLAlchemy to use the psycopg2 library when making the connection
  • The username is airflow, the password is airflow, the port is 5432 and the database is airflow.
  • The hostname ismetadb. This is unintuitive and tripped me up - what's important here is that when Docker Compose sets up all of the networking stuff, it sets the hostnames for the containers to be the same as the name of the container as typed into thedocker-compose.yml file. This service was called "metadb", so the hostname is likewise "metadb".

Initializing the Database

Once you have those pieces together, you can let 'er rip:

docker-composeup
Enter fullscreen modeExit fullscreen mode

However, you'll notice that the Airflow services start crash-looping immediately, complaining that various tables don't exist. (If it complains that the db isn't up, shrug, ctrl-c and try again. Computers amirite?)

This is because we need to initialize the metadb to have all of the tables that Airflow expects. Airflow ships with a CLI command that will do this - unfortunately, our compose file doesn't handle it.

Keep the Airflow containers crash-looping in the background; we can use the Docker CLI to connect to the PostgreSQL instance running in our compose setup and ninja in a fix.

Create a file called./Invoke-Airflow.ps1 with the following contents:

$Network="{0}_airflow"-f@(Split-Path$PSScriptRoot-Leaf)dockerrun--rm--network$Network--volume"${PSScriptRoot}\airflow:/opt/airflow"apache/airflow@Args
Enter fullscreen modeExit fullscreen mode

The--rm flag removes the container after it's done running so it doesn't cutter things up. The--network flag tells docker to connect to the virtual network you created in yourdocker-compose.yml file, and the--volume flag tells Docker how to mount yourAIRFLOW_HOME. Finally,@Args uses a feature of PowerShell calledsplatting to pass arguments to your script through to Airflow.

Once that's saved, we can runinitdb against our Airflow install:

.\Invoke-Airflow.ps1initdb
Enter fullscreen modeExit fullscreen mode

You should notice that Airflow is suddenly a lot happier. You should also be able to connect to Airflow by visitinglocalhost:8080 in your browser:

Alt Text

For bonus points, we can use the postgres container to connect to the database using thepsql CLI using a very similar trick. Put this inInvoke-Psql.ps1:

$Network="{0}_airflow"-f@(Split-Path$PSScriptRoot-Leaf)dockerrun-it--rm--network$Networkpostgrespsql-hmetadb-Uairflow--dbairflow@Args
Enter fullscreen modeExit fullscreen mode

and then run.\Invoke-Psql in the terminal.

Now you should be able to run\dt at the psql prompt and see all of the tables thatairflow initdb created:

psql (12.3 (Debian 12.3-1.pgdg100+1))
Type "help" for help.

airflow=# \dt
List of relations
Schema | Name | Type | Owner
--------+-------------------------------+-------+---------
public | alembic_version | table | airflow
public | chart | table | airflow
public | connection | table | airflow
public | dag | table | airflow
public | dag_code | table | airflow
public | dag_pickle | table | airflow
public | dag_run | table | airflow
public | dag_tag | table | airflow
public | import_error | table | airflow
public | job | table | airflow
public | known_event | table | airflow
public | known_event_type | table | airflow
public | kube_resource_version | table | airflow
public | kube_worker_uuid | table | airflow
public | log | table | airflow
public | rendered_task_instance_fields | table | airflow
public | serialized_dag | table | airflow
public | sla_miss | table | airflow
public | slot_pool | table | airflow
public | task_fail | table | airflow
public | task_instance | table | airflow
public | task_reschedule | table | airflow
public | users | table | airflow
public | variable | table | airflow
public | xcom | table | airflow
(25 rows)

Enter fullscreen modeExit fullscreen mode




Conclusions

Now we have a working Airflow install that we can mess with. You'll notice that I didn't really go into how to write a DAG - there areother tutorials for that which should now be follow-able - whenever they say to run theairflow CLI tool, runInvoke-Airflow.ps1 instead.

Using Docker, Docker Compose and a few wrapper PowerShell scripts, we were able to get Airflow running on Windows, a platform that's otherwise unsupported. In addition, we were able to build tooling to run multiple services in a nice, self-contained way, including a PostgreSQL database. Finally, by using a little PowerShell, we were able to make using these tools easy.

Cheers!

* Symbolic links in Windows are a very long story. Windows traditionally has had no support for them at all - however, recent versions of NTFS technically allow symlinks but require Administrator privileges to create them, and none of the tooling works with them.

** I'm not saying that the Airflow maintainers would be hostile towards Windows support - I don't know them for one, but also I have to assume they would be stoked. However, I also have to assume that they would have opinions. Big changes require a lot of discussion.

Top comments(3)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss
CollapseExpand
 
jfhbrook profile image
Josh Holbrook
I'm an Alaskan software developer with expertise in data, distributed systems and site reliability.
  • Location
    Alaska
  • Education
    M.S. Mechanical Engineering
  • Work
    Staff Engineer
  • Joined

Addendum: Running in Production

I had someone ask me today about using this process to run Airflow in production. It should be noted that Docker doesn't work on all Windows installs. In particular, this reportedly won't work with server instances on Azure.

That said, if you're trying to run Airflow in production, you should probably deploy to Linux - or, if using Docker, to a managed Kubernetes product such asAKS on Azure orGKE on Google Cloud. Luckily, the only Windows-specific aspects of the procedure laid out here are the PowerShell snippets, and even PowerShell can run on Linux/MacOS if you install it.

CollapseExpand
 
ovokpus profile image
Ovo Okpubuluku
  • Joined

I think Airflow now comes with an authentication requirement too...

CollapseExpand
 
jfhbrook profile image
Josh Holbrook
I'm an Alaskan software developer with expertise in data, distributed systems and site reliability.
  • Location
    Alaska
  • Education
    M.S. Mechanical Engineering
  • Work
    Staff Engineer
  • Joined

I don't have time to run through this tutorial to update the directions, but if someone tells me what changed and what they did I'm happy to post an update (with a /ht!)

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

I'm an Alaskan software developer with expertise in data, distributed systems and site reliability.
  • Location
    Alaska
  • Education
    M.S. Mechanical Engineering
  • Work
    Staff Engineer
  • Joined

Trending onDEV CommunityHot

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp