This project creates a data pipeline for stock market data. The pipeline has the following stages:
- Verify whether the raw data processing paths exist, and create them if necessary.
- Read the CSV files into dataframes, merge the data with a metadata file, and write the resulting dataset into a structured format such as Parquet.
- Verify whether the feature engineering paths exist, and create them if necessary.
- Calculate the rolling average of the trading volume and the rolling median of the Adjusted Close, and write the resulting dataset into a staging Parquet file.
- Train a RandomForestRegressor model on the feature-engineered data, and calculate the model's performance metrics.
- verify_raw_data_path_task: Verify whether the raw data processing paths exist, and create them if necessary.
- raw_data_processing_task_group: Read the CSV files into Pandas dataframes, merge the data with a metadata file, and write the resulting dataset into a structured format such as Parquet.
- verify_feature_data_path_task: Verify whether the feature engineering paths exist, and create them if necessary.
- feature_engineering_task_group: Calculate the rolling average of the trading volume and the rolling median of the Adjusted Close, and write the resulting dataset into a structured format such as Parquet.
- train_model_task: Train a RandomForestRegressor model on the feature-engineered data, and calculate the model's performance metrics.
Before running this program, you will need to follow below steps:
Please clone or download this repository to your local machine.
You can follow this link.
mkdir ./data ./staging
Download the ETF and stock datasets from the primary dataset available at Kaggle and move to data
directory.
docker build -f Dockerfile.Spark . -t airflow-spark
docker build -f Dockerfile.Flask . -t flask-app
docker-compose up airflow-init
docker-compose up
localhost:8080
By default, username and password will be airflow and hit ‘Sign in’.
Create a new spark connection with detail as shown in the image.
For Jupyter notebook, you must copy the URL with the token generated when the container is started and paste in your browser. The URL with the token can be taken from container logs using:
docker logs -f de-rt-jupyter-spark-1
http://127.0.0.1:8008/ You will get 'Welcome!' response on default link. You can use /predict API endpoint which takes two values, vol_moving_avg and adj_close_rolling_med, and return with an integer value that represents the trading volume.
http://127.0.0.1:8008/predict?vol_moving_avg=12345&adj_close_rolling_med=55
This data pipeline was created by Lakpa Sherpa.