Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate all analytics pipelines to Airflow #84

Open
momegas opened this issue Jan 3, 2023 · 1 comment
Open

Migrate all analytics pipelines to Airflow #84

momegas opened this issue Jan 3, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@momegas
Copy link
Member

momegas commented Jan 3, 2023

Since whitebox workflows become more and more complex, we need a way to orchestrate them. We can use the Airflow Python API to define and execute workflows. A workflow is defined as a directed acyclic graph (DAG) in Airflow. Each node in the DAG represents a task, and the edges between nodes represent dependencies between tasks.

The architecture should become roughly as follows:

sequenceDiagram

participant API as API
participant Database as Database
participant Airflow as Airflow

API ->> Database: Store data

loop Cron Workflows
    Database ->> Airflow: Extract data
    Airflow ->> Airflow: Analyze data
    Airflow ->> Database: Store result
end
Loading

Some implementation notes:

  • Airflow should be a different deployment that the API. Approach it as a different service that runs all the workflows.
  • It should use the same instance of SQL but have its own database as a database backend
  • All current (and future) metrics calculations should happen inside airflow.
@momegas momegas added the needs analysis This issue needs analysis label Jan 3, 2023
@momegas momegas added this to the Whitebox Roadmap milestone Jan 3, 2023
@momegas momegas added enhancement New feature or request and removed needs analysis This issue needs analysis labels Jan 4, 2023
@momegas momegas changed the title Airflow integration [Roadmap] Airflow integration Jan 4, 2023
@momegas momegas changed the title [Roadmap] Airflow integration [Roadmap] Migrate all analytics pipelines to Airflow Jan 4, 2023
@momegas momegas modified the milestones: 🐻‍❄️ Whitebox Roadmap, 😻 Q2 2023 Jan 18, 2023
@momegas momegas modified the milestones: 😻 Q2 2023, 🐶 Q1 2023 Feb 7, 2023
@momegas
Copy link
Member Author

momegas commented Feb 9, 2023

To start development, You can use this compose and dockerfile that I made for some local tests.
Just place your dags in a dags folder in the root of the projects (or change the path in the volumes of webserver to ./whitebox/dags:/usr/local/airflow/dags)

Dockerfile:

FROM puckel/docker-airflow:1.10.9
RUN pip install evidently==0.2.0  

Compose

version: "3.7"

services:
  postgres:
    image: postgres:9.6
    environment:
      - POSTGRES_USER=airflow
      - POSTGRES_PASSWORD=airflow
      - POSTGRES_DB=airflow
    logging:
      options:
        max-size: 10m
        max-file: "3"

  webserver:
    build: .
    restart: always
    depends_on:
      - postgres
    environment:
      - LOAD_EX=n
      - EXECUTOR=Local
    logging:
      options:
        max-size: 10m
        max-file: "3"
    volumes:
      - ./dags:/usr/local/airflow/dags
      - ./evidently_reports:/usr/local/airflow/reports
      - ./data:/usr/local/airflow/data
      - ./logs:/usr/local/airflow/logs
    ports:
      - "8080:8080"
    command: webserver
    healthcheck:
      test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
      interval: 30s
      timeout: 30s
      retries: 3
volumes:
  evidently_reports:

@momegas momegas changed the title [Roadmap] Migrate all analytics pipelines to Airflow Migrate all analytics pipelines to Airflow Feb 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants