Airflow

This is meant as a template for getting up-and running with apache airflow quickly using docker compose for local development and docker swarm on Google Cloud for deployment.

What this is meant to do is help you establish a baseline deployment/development environment with sane defaults.

There are many things that could be improved, but this should get you up-and-running quickly with some good patterns.

Some of the features

invoke for orchestration and configuration
traefik as edge proxy
grafana as a metrics front-end for your cluster
pip installable flow_toolz package for library code
a recipe for creating new dags that can easily be extended inv new-dag

Requirements

docker brew cask install docker-edge
python3 brew install python

Quickstart

# create a virtual environment
python3 -m venv venv
# activate virtual environment
. venv/bin/activate
# install the flow_toolz package
pip install 'airflow/[dev]'
# generate self-signed tls cert and other filestubs
inv bootstrap
# bring up the server for local development
docker-compose up

Authentication

You'll need to create two files at the project root for the purposes of authentication. They can be empty at first, just to get the server running, since docker-compose will expect them to exist.

aws-credentials.ini
default-service-account.json

AWS

aws-credentials.ini

[default]
aws_access_key_id = <your access key>
aws_secret_access_key = <your secret key>

GCP

default-service-account.json

The default-service-account.json service account key at the project root will be used to authenticate with Google cloud by default.

TLS

In the reverse-proxy folder, you will need a certificate.crt and key.key file that you can generate with the inv create-certificate command.

This is really here just to get you started, you'll want to configure traefik to use letsencrypt or other means to establish HTTPS on your production deployment.

Other

For other string-based secrets, you'll need a .secrets.env[./airflow/.secrets.env] i.e.:

AIRFLOW_CONN_POSTGRES_MASTER={{password}}

In general:

Authentication strings should be a the secrets file

Authentication files should be set as a docker secret in the docker compose file

Secrets SHOULD NOT be checked into version control.

Local Development

Initialize the development server (once you have the authentication files described earlier)

docker-compose up

Note: it may take some time for the docker images to build at first

The airflow ui will now render to localhost

The reverse proxy admin panel will be at localhost:8080

Grafana dashboard will be at localhost:3000

user: admin
pw: admin

DAGs, and libraries in the airflow folder will automatically be mounted onto the the services on your local deployment and updated on the running containers in real-time.

Writing a new DAG

There exists a handy dag template for new dags.

You can use this template to quickly write new dags by using the task runner:

# invoke the new-dag task
# you will be prompted to provide parameters 
# such as `dag_id` and `owner`
inv new-dag

Library Code

In the airflow folder, there is a flow_toolz directory. That directory is a Python package, meaning it can be pip installed.

Code that is shared between dags, or that you want to use outside of airflow (for testing/development) purposes should be put there.

. venv/bin/activate

pip install -e './airflow'

# in python, I can now

import flow_toolz
...

Configuration

The infrastructure -- services and how they'll communicate -- are all described in docker-compose.yaml

Cross-service configuration -- environment variables that will exist across different services/machines -- will be in either a .env file or .secrets.env -- the latter for sensitive information that should not exist in version control.

You'll notice some of these environment variables follow the pattern AIRFLOW__{foo}__{bar}.

That tells airflow to configure itself with those variables as opposed to their analog in its default config file. More information on how Airflow reads configuration can be found at this link

For configuration related to automated cli tasks executed via invoke, those are in invoke.yaml files and can be overridden by environment variables as well. For more information on how invoke configuration works, follow this link.

Deployment

Create the swarm (spins up machines on GCP)

inv swarm-up

Deploy to our swarm

inv deploy --prod

Notes

you'll want to change the names of the images in the docker-compose file for your own deployment
invoke tasks that make use of google cloud i.e. inv deploy will expect a project element in the configuration. I have this set in my /etc/invoke.yaml

Here's an example:

gcp:
  project: myproject

You'll likely also want to change your default host bind ip in Docker for Mac

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
airflow		airflow
grafana		grafana
reverse-proxy		reverse-proxy
.gitignore		.gitignore
docker-compose.prod.yaml		docker-compose.prod.yaml
docker-compose.yaml		docker-compose.yaml
invoke.yaml		invoke.yaml
license.txt		license.txt
pyproject.toml		pyproject.toml
readme.md		readme.md
tasks.py		tasks.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Airflow

Some of the features

Requirements

Quickstart

Authentication

AWS

GCP

TLS

Other

Local Development

Writing a new DAG

Library Code

Configuration

Deployment

Create the swarm (spins up machines on GCP)

Deploy to our swarm

Notes

About

Releases

Packages

Languages

License

dsilabs/docker-compose-airflow

Folders and files

Latest commit

History

Repository files navigation

Airflow

Some of the features

Requirements

Quickstart

Authentication

AWS

GCP

TLS

Other

Local Development

Writing a new DAG

Library Code

Configuration

Deployment

Create the swarm (spins up machines on GCP)

Deploy to our swarm

Notes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages