Automated Data Pipeline with Airflow for Sparkify

Introduction

Sparkify, a music streaming company, has embarked on an initiative to automate and monitor their data warehouse ETL pipelines using Apache Airflow.

Project Overview

This project involves creating dynamic, reusable data pipelines that are monitorable and facilitate easy backfilling. Ensuring data quality is crucial, as analyses performed on the data warehouse rely on the integrity of the data. The project will use custom Airflow operators for tasks like staging data, filling the data warehouse, and conducting data quality checks.

The source data resides in S3 and needs to be processed in Sparkify's data warehouse in Amazon Redshift. The source datasets consist of JSON logs that tell about user activity in the application and JSON metadata about the songs the users listen to.

Getting Started

Clone the Repository: Clone this repository to get the project files.
Set Up Airflow: Ensure Apache Airflow is properly installed and configured.
Configure Airflow Connections: Set up the necessary connections in Airflow for Amazon Redshift and AWS S3.
Deploy the DAG: Place the DAG file in the Airflow DAGs folder and start the Airflow scheduler.
Monitor the Pipeline: Use Airflow's UI to monitor the pipeline's execution and troubleshoot as necessary.

Structure

dags/sparkify_dag.py contains the main and only DAG for this project.
dags/operators folder containing all the custom operators:
- Data Stage Operator (stage_redshift.py): Loads JSON formatted files from S3 to Amazon Redshift.
- Fact Operator (load_fact.py): Utilizes the provided SQL helper class to run data transformations for fact table.
- Dimension Operator (load_dimension.py): Utilizes the provided SQL helper class to run data transformations for dimension tables.
- Data Quality Operator (data_quality.py): Runs checks on the data to identify discrepancies post-ETL.
dags/create_tables.sql contains sql statements for creating fact and dimension tables.
dags/helpers/sql_queries.py contains all the sql statements for inserting data.

Datasets

The data resides in Amazon S3 and includes:

Log Data: Details about user activity in the application (s3://udacity-dend/log_data)
Song Data: Metadata about the songs listened to by users (s3://udacity-dend/song_data)

DAG Configuration

DAG Default Parameters

No dependencies on past runs
On failure, the task are retried 3 times
Retries happen every 5 minutes
Catchup is turned off
Do not email on retry

DAG Representation in the Airflow UI

Acknowledgments

The data and project information were provided by Udacity as part of their Data Engineering Nanodegree Program.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.idea		.idea
dags		dags
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
sparkify_DAG.png		sparkify_DAG.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automated Data Pipeline with Airflow for Sparkify

Introduction

Project Overview

Getting Started

Structure

Datasets

DAG Configuration

DAG Default Parameters

DAG Representation in the Airflow UI

Acknowledgments

About

Releases

Packages

Languages

nadyavoynich/DataEngineering-ND-DataPipelines-Airflow

Folders and files

Latest commit

History

Repository files navigation

Automated Data Pipeline with Airflow for Sparkify

Introduction

Project Overview

Getting Started

Structure

Datasets

DAG Configuration

DAG Default Parameters

DAG Representation in the Airflow UI

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages