Skip to content

Commit

Permalink
Update Documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
FranceBrescia committed Jan 7, 2024
1 parent b0c7f24 commit a9bd6f9
Show file tree
Hide file tree
Showing 113 changed files with 91 additions and 56 deletions.
78 changes: 33 additions & 45 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,13 +82,15 @@ Project Organization
| ├── index.html <- Frontend html
│ ├── logo.png <- Web Page logo
│ ├── nginx.conf <- Configuration file for nginx.
│ └── script.js <- Frontend script
│ ├── script.js <- Frontend script
│ └── README.md
|
├── models <- Trained and serialized models, model predictions, or model summaries
   │   ├── validation_a.pkl
│ ├── validation_b.pkl
│ ├── train_a.pkl
│   └── train_b.pkl
│   ├── train_b.pkl
│ └── README.md
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
Expand All @@ -98,52 +100,50 @@ Project Organization
| |
│ ├── deploy_doc
│ │ └── README.md
| ├
│ ├── docker_doc
│ │ └── README.md
│ │
│ ├── dvc_mlflow_doc
│ │ └── README.md
│ ├── great_expectations_doc
│ │ ├── expectations
│ │ ├── static
│ │ └── index.html
│ │
│ ├── monitoring_doc
│ │ └── README.md
│ └── images_doc
├── images_doc
│ └── READMME.md
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ ├── alibi_detect_logs <- Logs generated after data drift analysis.
│ │ ├── model_category.txt
│ │ └── model_sexsism.txt
│ │
│ ├── locust <- Logs generated after locust analysis.
│ ├── locust <- Logs generated after locust analysis.
│ │ ├── report_exceptions.csv
│ │ ├── report_stats_history.csv
│ │ ├── report_stats.csv
│ │ └── report_failures.csv
│ │
│   └── figures <- Generated graphics and figures to be used in reporting
│   ├── output_codecarbon <- Logs generated after code carbon analysis.
│  │ ├── output_train_a.csv
│   │ ├── output_train_a.csv.bak
│   │ ├── output_train_b.csv
│   │ └── output_train_b.csv.bak
│   ├── mlruns <- Logs generated after mlflow runs.
│   └── figures <- Generated graphics and figures to be used in reporting
├── src <- Source code for use in this project.
│   ├── __init__.py <- Makes src a Python module
   ├── README.md
│ ├── api <- Scripts to crate Api using FastAPI
│ │ ├── corpus_endpoint.py
│ │ ├── prometheus_monitoring.py
│ │ ├── README.md
│ │ ├── server_api.py
│ │ └── dashboards
│ │ └── grafana.json
│ │
│   ├── data <- Scripts to download or generate data
│   │   └── make_dataset.py
│ │
│   ├── features <- Scripts to turn raw data into features for modeling
│ │ ├── drift_detection.py
│ │ ├── build_features.py
│   │   └── README.md
│ │
│ │ └── build_features.py
│   ├── models <- Scripts to train models and then use trained models to make
│ │ │ predictions
│   │   ├── test_a.py
Expand All @@ -152,50 +152,38 @@ Project Organization
│   │   ├── train_b.py
│   │   ├── validation_a.py
│   │   ├── validation_b.py
│   │   ├── mlruns
│   │   ├── output_codecarbon
│   │   │ ├── output_train_a.csv
│   │   │ ├── output_train_a.csv.bak
│   │   │ ├── output_train_b.csv
│   │   │ ├── output_train_b.csv.bak
│   │   │ └── README.md
│ │ │
│   │   ├── .codecarbon.config
│ │ └── MLflow
│   │   ├── test_a.py
│   │   ├── test_b.py
│   │   ├── train_a.py
│   │   ├── train_b.py
│   │   ├── validation_a.py
│   │   └── validation_b.py
│   │
│   │   ├── mlflow_test_a.py
│   │   ├── mlflow_test_b.py
│   │   ├── mlflow_train_a.py
│   │   ├── mlflow_train_b.py
│   │   ├── mlflow_validation_a.py
│   │   └── mlflow_validation_b.py
│   └── visualization <- Scripts to create exploratory and results oriented visualizations
│   └── visualize.py
│  
├── tests <- Scripts to test using Pytest
│   ├── api_testing
│   │ └── test_api.py
│   │
│   ├── dataset_testing
│   │ ├── test_dataset_model_a.py
│   │ └── test_dataset_model_b.py
│   │
│   ├── model_training_testing
│   │ └── test_overfit.py
│   │
│   ├── preprocessing_testing
│   │ └── test_preprocessing.py
   │
── behavioral_testing
│   ├── test_directional_model_a.py
│   ├── test_directional_model_b.py
│   ├── test_invariance_model_a.py
│   ├── test_invariance_model_b.py
│   ── test_minimum_funcionality_model_a.py
   └── test_minimum_funcionality_model_b.py
├── behavioral_testing
   │ ├── test_directional_model_a.py
│   ├── test_directional_model_b.py
│   ├── test_invariance_model_a.py
│   ├── test_invariance_model_b.py
│   ├── test_minimum_funcionality_model_a.py
│   ── test_minimum_funcionality_model_b.py
│ └── README.md
|
├── .dockerignore <- Docker ignore file.
├── .dvcignore <- Data Version Control ignore file.
├── .flake8 <- Flake8 ignore file.
├── .gitignore <- Specifications of files to be ignored by Git.
├── docker-compose.yaml <- Docker Compose configuration.
├── Dockerfile <- Docker file for the backend.
Expand Down
47 changes: 37 additions & 10 deletions references/dvc_mlflow_doc/README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,25 @@
# DVC and MLflow Integration for Machine Learning Projects
# DVC, MLflow, and DagsHub Integration for Machine Learning Projects

In order to manage and track our machine learning project we use DVC (Data Version Control) and MLflow. DVC is an open-source version control system for data science and machine learning projects, while MLflow is an open-source platform for managing the end-to-end machine learning lifecycle.
This project employs DVC (Data Version Control), MLflow, and DagsHub to manage and track the machine learning lifecycle. DVC is an open-source version control system tailored for data science and machine learning projects. MLflow is an open-source platform that handles the end-to-end machine learning lifecycle. DagsHub complements these tools by providing a platform for collaboration on data science projects.

## Overview

Integrating DVC and MLflow offers a robust solution for handling large datasets, versioning data & models, experiment tracking, and model deployment. DVC assists in versioning data and models, while MLflow tracks and manages the machine learning experiments and deployments.
The integration of DVC, MLflow, and DagsHub provides a comprehensive solution for dataset management, versioning, experiment tracking, and model deployment. This synergy enhances the reproducibility, monitoring, and collaboration of machine learning projects.

## Features

- **Data Versioning with DVC**: Efficiently handle large datasets and version control models.
- **Experiment Tracking with MLflow**: Track experiments, log parameters, and compare results.
- **Model Deployment**: Utilize MLflow's model registry for model deployment.
- **Data Versioning with DVC**: Manages and version-controls large datasets and machine learning models, facilitating data sharing and collaboration.
- **Experiment Tracking with MLflow**: Records and compares experiments, parameters, and results, streamlining the model development process.
- **Model Deployment**: Leverages MLflow's model registry for consistent and organized deployment across various environments.
- **Collaboration with DagsHub**: Integrates with DVC and MLflow, offering a collaborative platform for team members to share, discuss, and track progress.
- **Reproducibility**: Ensure experiments are reproducible with version-controlled data and models.

## Installation

Before starting, ensure Python is installed. Then, install DVC, MLflow, and the necessary dependencies:

## Installation

Before you begin, ensure you have Python installed on your system. Then, install DVC and MLflow using pip:

```bash
Expand All @@ -30,14 +35,13 @@ pip install dvc mlflow
dvc init
git status
git commit -m "Initialize DVC"

```

2. **Add Data to DVC**:
Track large datasets or models with DVC:
```bash
dvc add data/Raw/dataset.csv
git add data/.gitignore data/dataset.csv.dvc
git add data/.gitignore data/Raw/dataset.csv.dvc
git commit -m "Add dataset to DVC"
```

Expand All @@ -61,27 +65,50 @@ pip install dvc mlflow
mlflow.log_artifact("path/to/artifact")
```

### Combining DVC and MLflow
## Integrating with DagsHub
DagsHub seamlessly integrates with DVC and MLflow, offering a platform for hosting and visualizing DVC-tracked datasets and MLflow experiments. Create a DagsHub repository to push and share your DVC and MLflow configurations and results. [DagsHub Repository](https://dagshub.com/se4ai2324-uniba/DetectionOfOnlineSexism)

1. **Set Up a DagsHub Repository**:
Create a repository on DagsHub and link it with your project.

2. **Push Changes to DagsHub**:
Commit and push your changes to the DagsHub repository to share your progress.

Use DVC to manage data and models, and MLflow for experiment tracking. For example, use DVC to pull the latest data version before running an experiment with MLflow.
## Combining DVC, MLflow, and DagsHub

Use DVC for data and model management, MLflow for experiment tracking, and DagsHub for collaboration:

```bash
dvc pull data/Raw/dataset.csv.dvc
python mlflow_experiment.py
git add .
git commit -m "Update experiment"
git push origin main
```

![image](../images_doc/PipelineA.png)

![image](../images_doc/PipelineB.png)

## Versioning Data and Models

DVC tracks changes in your data and models. Use `dvc push` and `dvc pull` commands to synchronize your large files with remote storage, ensuring consistency across environments.

![image](../images_doc/RegisteredModels.png)

## Experiment Tracking

MLflow tracks each experiment's parameters, metrics, and output models, making it easy to compare different runs and select the best model for deployment.
![image](../images_doc/Mlflow.png)
## Model Deployment
Utilize MLflow's model registry for deploying models to various production environments, ensuring a smooth transition from experimentation to deployment.

## Collaboration and Sharing

DagsHub provides a platform for sharing experiments, data, and progress with team members, enhancing collaboration and transparency in the project.

## Best Practices

- Regularly commit changes in data and code to ensure reproducibility.
Expand Down
Binary file added references/images_doc/Mlflow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added references/images_doc/RegisteredModels.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes.
22 changes: 21 additions & 1 deletion tests/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,27 @@ The tests conducted are categorized as follows:

- **Preprocessing Testing**: These tests are aimed at the preprocessing steps of our data pipeline. We validate the methods used for cleaning, normalizing, and transforming data to ensure they are correctly implemented and contribute positively to the performance of our models.

Through the use of `pytest`, a powerful testing framework, and `Great Expectations`, an advanced tool for validating and documenting data quality, we strive to build a project that is functional, dependable, and efficient. The subsequent sections will delve into each testing category, detailing their implementation and execution within our project.
## Tools

In our project, we place a strong emphasis on the reliability and quality of our software and data. To achieve this, we utilize two key tools: `Pytest` and `Great Expectations`. These tools form the backbone of our testing and validation framework, ensuring that our project meets high standards of functionality, dependability, and efficiency.

### Pytest

`Pytest` is a powerful and flexible testing framework for Python. It is used extensively for writing simple unit tests as well as complex functional tests. It offers features such as:

* A simple syntax for writing tests.
* The ability to run tests in parallel, significantly improving test execution time.
* Extensive support for fixtures, allowing for reusable test configurations.
* Easy integration with other tools and services for enhanced testing capabilities.

### Great Expectations

`Great Expectations` is an advanced tool that plays a crucial role in validating, documenting, and profiling our data quality. Great Expectations helps us by:

* Validating data against a predefined set of rules and criteria, ensuring that it meets the quality standards required for accurate analysis and modeling.
* Creating clear and understandable documentation of our data.
* Profiling data to provide insights into its characteristics, distribution, and structure of the data.


## Behavioral tests
### Directional Test
Expand Down

0 comments on commit a9bd6f9

Please sign in to comment.