Stock Data Processing

This project creates a data pipeline for stock market data. The pipeline has the following stages:

Verify whether the raw data processing paths exist, and create them if necessary.
Read the CSV files into dataframes, merge the data with a metadata file, and write the resulting dataset into a structured format such as Parquet.
Verify whether the feature engineering paths exist, and create them if necessary.
Calculate the rolling average of the trading volume and the rolling median of the Adjusted Close, and write the resulting dataset into a staging Parquet file.
Train a RandomForestRegressor model on the feature-engineered data, and calculate the model's performance metrics.

DAG components

verify_raw_data_path_task: Verify whether the raw data processing paths exist, and create them if necessary.
raw_data_processing_task_group: Read the CSV files into Pandas dataframes, merge the data with a metadata file, and write the resulting dataset into a structured format such as Parquet.
verify_feature_data_path_task: Verify whether the feature engineering paths exist, and create them if necessary.
feature_engineering_task_group: Calculate the rolling average of the trading volume and the rolling median of the Adjusted Close, and write the resulting dataset into a structured format such as Parquet.
train_model_task: Train a RandomForestRegressor model on the feature-engineered data, and calculate the model's performance metrics.

Architecture components

How to use

Before running this program, you will need to follow below steps:

Step 1: Clone repo

Please clone or download this repository to your local machine.

Step 2: Install Docker and Docker Compose

You can follow this link.

Step 3: Create two additional directories in repo

mkdir ./data ./staging

Step 4: Download the data from Kaggle in ./data directory

Download the ETF and stock datasets from the primary dataset available at Kaggle and move to data directory.

Step 5: Build the Images

docker build -f Dockerfile.Spark . -t airflow-spark

docker build -f Dockerfile.Flask . -t flask-app

Step 5: Initialise the Airflow Database

docker-compose up airflow-init

Step 6: Start Airflow services

docker-compose up

Access necessary links

Airflow:

localhost:8080

By default, username and password will be airflow and hit ‘Sign in’.

Create a new spark connection with detail as shown in the image.

Spark Master:

http://localhost:8090/

Jupyter Notebook:

http://127.0.0.1:8888

For Jupyter notebook, you must copy the URL with the token generated when the container is started and paste in your browser. The URL with the token can be taken from container logs using:

docker logs -f de-rt-jupyter-spark-1

Model Serving API:

http://127.0.0.1:8008/ You will get 'Welcome!' response on default link. You can use /predict API endpoint which takes two values, vol_moving_avg and adj_close_rolling_med, and return with an integer value that represents the trading volume.

http://127.0.0.1:8008/predict?vol_moving_avg=12345&adj_close_rolling_med=55

Author

This data pipeline was created by Lakpa Sherpa.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
api		api
dags		dags
docs		docs
logs		logs
notebooks		notebooks
spark/app		spark/app
.env		.env
.gitignore		.gitignore
Data-Engineer.md		Data-Engineer.md
Dockerfile		Dockerfile
Dockerfile.Flask		Dockerfile.Flask
Dockerfile.Spark		Dockerfile.Spark
README.md		README.md
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt
requirements_spark.txt		requirements_spark.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stock Data Processing

DAG components

Architecture components

How to use

Step 1: Clone repo

Step 2: Install Docker and Docker Compose

Step 3: Create two additional directories in repo

Step 4: Download the data from Kaggle in ./data directory

Step 5: Build the Images

Step 5: Initialise the Airflow Database

Step 6: Start Airflow services

Access necessary links

Airflow:

Spark Master:

Jupyter Notebook:

Model Serving API:

Author

About

Releases

Packages

Languages

sherpa-lakpa/stock-data-processing

Folders and files

Latest commit

History

Repository files navigation

Stock Data Processing

DAG components

Architecture components

How to use

Step 1: Clone repo

Step 2: Install Docker and Docker Compose

Step 3: Create two additional directories in repo

Step 4: Download the data from Kaggle in ./data directory

Step 5: Build the Images

Step 5: Initialise the Airflow Database

Step 6: Start Airflow services

Access necessary links

Airflow:

Spark Master:

Jupyter Notebook:

Model Serving API:

Author

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages