In this hands-on workshop, we’ll take a prototype in a Jupyter Notebook and transform it into a DVC pipeline. We’ll then use that pipeline locally to run and compare a few experiments. Lastly, we’ll explore how CML will allow us to take our model training online. We’ll use it in conjunction with GitHub Actions to trigger our model training every time we push changes to our repository.
As an example project we'll use a Jupyter Notebook that trains a CNN to classify images of Pokémon. It will predict whether a Pokémon is of a predetermined type (default: water). It is a starting point that shows how a notebook might look before it is transformed into a DVC pipeline.
It is a fork of this example project: https://github.com/iterative/example-pokemon-classifier
Note: due to the limited size of the dataset, the evaluation dataset is the same data set as the train+test. Take the results of the model with a grain of salt.
-
Fork the repository and clone it to your local environment
-
Create a new virtual environment with
virtualenv -p python3 .venv
-
Activate the virtual environment with
source .venv/bin/activate
-
Install the dependencies with
pip install -r requirements.txt
-
Download the datasets from Kaggle into the
data/external/
directory -
Launch the notebook with
jupyter-notebook
and openpokemon_classifier.ipynb
The requirements specify tensorflow-macos
and tensorflow-metal
, which are
the appropriate requirements when you are using a Mac with an M1 CPU or later.
In case you are using a different system, you will need to replace these with
tensorflow
.
Now that we have the notebook up and running, go through the cells to see if everything works. If it does, you should get a model that generates predictions for all Pokémon images. Although admittedly the model performance isn't great...
This point may be familiar to you: a working prototype in a notebook. Now, how do we transform it into a reproducible DVC pipeline?
- Initialize DVC with
dvc init
- Start tracking the
data/external
directory with DVC (dvc add
) - Poke around with
git status
and see what DVC did in the background. Take a look atdata/external.dvc
to see the metadata file that DVC created - Commit the changes to Git (
git commit -m "Start tracking data directory with DVC"
)
Now that the data is part of the DVC cache, we can set up a remote for
duplicating it. Just like we git push
our local Git repository to GitHub,
Gitlab, etc., we can then dvc push
our cache to the remote.
-
Use
dvc remote add
to add your remote of choice (docs) -
Push the DVC cache to your remote with
dvc push
Once we start experimenting, we want to change parameters on the fly. For this,
we define a params.yaml
file. Create this in the root directory of the
project. For example:
base:
seed: 42
pokemon_type_train: "Water"
data_preprocess:
source_directory: 'data/external'
destination_directory: 'data/processed'
dataset_labels: 'stats/pokemon-gen-1-8.csv'
dataset_images: 'images'
train:
test_size: 0.2
learning_rate: 0.001
epochs: 15
batch_size: 120
Now it is time to move out of our familiar notebook environment. We will split
up the notebook into units that make sense as a step in a pipeline. In this
case, we will create four stages: data_preprocess
, data_load
, train
, and
evaluate
.
- Create an
src
directory for the modules - Create a
.py
file in thesrc
directory for every pipeline step (e.g.train.py
) - For convenience, also create
src/utils/find_project_root.py
(like so). - Copy the relevant code over to each module. Make sure to also include the imports needed in each section.
- Create a
main
function so that we can call the module using a command. We'll useargparse
so that we can pass our parameters:
import argparse
...
if __name__ == '__main__':
args_parser = argparse.ArgumentParser()
args_parser.add_argument('--params', dest='params', required=True)
args = args_parser.parse_args()
with open(args.params) as param_file:
params = yaml.safe_load(param_file)
PROJECT_ROOT = find_project_root()
Once we're done, we should be able to run the module from your command line:
python3 src/train.py --params params.yaml
.
If you'd like an example, check my implementation for train.py
here.
Just like we could run the cells in our notebook one-by-one, we can now run the
modules successively from our command line. But we can also create a dvc.yaml
file that defines a pipeline for us. We can then run the entire pipeline with a
single command. Your dvc.yaml
should look something like this:
stages:
data_preprocess:
cmd: python3 src/data_preprocess.py --params params.yaml
deps:
- [dependency 1]
- [dependency 2]
- ...
outs:
- [output 1]
- [output 2]
- ...
params:
- base
- [params section]
data_load:
...
train:
...
evaluate:
...
- Create a
dvc.yaml
file and set up the stages, their dependencies, and outputs (docs) - Check the pipeline DAG with
dvc dag
- Reproduce the pipeline with
dvc repro
- Add
outputs/metrics.yaml
as metrics so that DVC can easily compare them across experiments in the next step.
If you'd like an example, check my implementation for dvc.yaml
here
With our pipeline in place, we cannot only reproduce a pipeline run with a single command; we can also run entirely new experiments. Let's explore two ways:
- Update a parameter in
params.yaml
(for example:type: 'Bug'
) and usedvc repro
to trigger a new pipeline run. - Run a new experiment with
dvc exp run
and use the-S
option to set a parameter (for example:dvc exp run -S 'base.pokemon_type_train="Dragon"'
). - Compare the experiments with
dvc exp show
.
As you can see, only the second method actually generates a new experiment.
Using dvc repro
overwrites the active workspace. Therefore it's recommended to
use dvc exp run
. Once you're happy with the results of an experiment, you can
use dvc exp apply
to apply it to the workspace.
If you want to move beyond the command line for your experiments, take a look at the DVC extension for Visual Studio Code.
Now that we can run experiments with our pipeline, let's take our model training to the cloud! For this second part, we'll be using CML, which utilizes GitHub Actions (GitLab and Bitbucket equivalents also work).
-
Navigate to your repository on GitHub and enable Actions from the settings
-
Create a
.github/workflows
directory in your project root -
Create a
workflow.yaml
in the newly created directory and start with a basic template:name: CML on: [push, workflow_dispatch] jobs: train-and-report: runs-on: ubuntu-latest container: docker://ghcr.io/iterative/cml:0-dvc2-base1 steps: - uses: actions/checkout@v3 - run: | echo "The workflow is working!"
-
Create a personal access token for the GitHub repository and add it as an environment variable to your secrets (docs)
env: repo_token: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
-
Add any other environment variables CML will need to access the DVC remote to your GitHub secrets (such as
AWS_ACCES_KEY_ID
andAWS_SECRET_ACCESS_KEY
for an S3 remote). -
Adapt the workflow to provision a remote runner (e.g. an AWS instance) to run the model training on. Find a guide here.
-
Adapt the workflow to run
dvc repo
and publish the results as a PR. Find a guide here.