This project demonstrates the integration of Apache Airflow, Amazon EMR, and Snowflake to process and manage data workflows. Airflow orchestrates the workflow, EMR handles big data processing, and Snowflake is used as the data warehouse.
- config/: Contains configuration files for Airflow and other services.
- dags/: Contains DAG files that define the workflows executed by Airflow.
- docker-compose.yaml: Used to set up and run the environment using Docker.
- Dockerfile: Defines the image that Docker builds for this project.
- logs/: Contains log files generated by Airflow and related processes.
- plugins/: Stores custom Airflow plugins, if any are used in the project.
- requirements.txt: Lists the Python dependencies needed to run the project.
- scripts/: Contains custom scripts for interacting with Snowflake, EMR, and other services.
- Docker and Docker Compose installed on your machine.
- AWS credentials configured for accessing EMR.
- Snowflake account credentials configured.
-
Clone or Download the Project
Clone the repository or download the zip file and extract it to your working directory. -
Install Docker and Docker Compose
Ensure Docker and Docker Compose are installed on your machine. -
Configure AWS and Snowflake Credentials
Update the relevant configuration files inside theconfig/
directory with your AWS and Snowflake credentials. -
Install Required Dependencies
The project dependencies are listed inrequirements.txt
. These will be automatically installed when running with Docker. -
Run the Project with Docker Compose
Use the following command to start the environment:docker-compose up
This will spin up the Airflow web server, scheduler, and other necessary services.
-
Access the Airflow UI
Once the containers are up and running, you can access the Airflow UI athttp://localhost:8080
. -
Trigger DAGs
Navigate to the "DAGs" tab in the Airflow UI and trigger the available workflows manually.
-
Data Ingestion with EMR
The workflow involves ingesting data files using Amazon EMR for processing large datasets. -
Data Warehousing with Snowflake
After processing, the data is uploaded to Snowflake for storage and querying.
The logs for the Airflow processes can be found in the logs/
directory. Monitor the logs for debugging and checking workflow statuses.
This project is licensed under the MIT License.