Skip to content

sherpa-lakpa/airflow-emr-snowflake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Airflow, EMR, and Snowflake Data Pipeline Project

Overview

This project demonstrates the integration of Apache Airflow, Amazon EMR, and Snowflake to process and manage data workflows. Airflow orchestrates the workflow, EMR handles big data processing, and Snowflake is used as the data warehouse. Airflow dashboard showing the DAG status.

Project Structure

  • config/: Contains configuration files for Airflow and other services.
  • dags/: Contains DAG files that define the workflows executed by Airflow.
  • docker-compose.yaml: Used to set up and run the environment using Docker.
  • Dockerfile: Defines the image that Docker builds for this project.
  • logs/: Contains log files generated by Airflow and related processes.
  • plugins/: Stores custom Airflow plugins, if any are used in the project.
  • requirements.txt: Lists the Python dependencies needed to run the project.
  • scripts/: Contains custom scripts for interacting with Snowflake, EMR, and other services.

Prerequisites

  • Docker and Docker Compose installed on your machine.
  • AWS credentials configured for accessing EMR.
  • Snowflake account credentials configured.

Setup Instructions

  1. Clone or Download the Project
    Clone the repository or download the zip file and extract it to your working directory.

  2. Install Docker and Docker Compose
    Ensure Docker and Docker Compose are installed on your machine.

  3. Configure AWS and Snowflake Credentials
    Update the relevant configuration files inside the config/ directory with your AWS and Snowflake credentials.

  4. Install Required Dependencies
    The project dependencies are listed in requirements.txt. These will be automatically installed when running with Docker.

  5. Run the Project with Docker Compose
    Use the following command to start the environment:

    docker-compose up

    This will spin up the Airflow web server, scheduler, and other necessary services.

  6. Access the Airflow UI
    Once the containers are up and running, you can access the Airflow UI at http://localhost:8080.

  7. Trigger DAGs
    Navigate to the "DAGs" tab in the Airflow UI and trigger the available workflows manually.

File Processing

  1. Data Ingestion with EMR
    The workflow involves ingesting data files using Amazon EMR for processing large datasets. Amazon S3 files. Amazon EMR running a data processing job.

  2. Data Warehousing with Snowflake
    After processing, the data is uploaded to Snowflake for storage and querying. Snowflake tables Querying processed data in Snowflake.

Logs and Monitoring

The logs for the Airflow processes can be found in the logs/ directory. Monitor the logs for debugging and checking workflow statuses.

License

This project is licensed under the MIT License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published