peopledoc/mlvtools-tutorialPublic

NotificationsYou must be signed in to change notification settings
Fork12
Star80

Tutorial for a new versioning Machine Learning pipeline

License

View license

80 stars 12 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github		.github
resources		resources
talks		talks
tutorial		tutorial
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
download_data.py		download_data.py
modify_input_data.py		modify_input_data.py
requirements.txt		requirements.txt
requirements.yml		requirements.yml
setup.cfg		setup.cfg
setup.py		setup.py

Repository files navigation

Machine Learning Pipeline Versioning Tutorial

The aim of this repository is to show a way to handlepipelining andversioningof aMachine Learning project.

Processes exposed during this tutorial are based on 3 tools:

Use cases are based on a text classification task on 20newsgroup dataset. Adummytutorial is also available to show tools mechanisms.

Prerequisites

For this tutorial, you must be familiar with the following tools:

virtualenv or condaenv
make
git
python

Tools Overview

DVC is an open-source version control system for Machine Learning projects. It is usedfor versioning and sharing Machine Learning data, and reproducing Machine Learningexperiments and pipeline stages.

mlvtools provides tools to generate Python scripts and DVC commands from JupyterNotebooks.

Please have a look at thepresentation.

Our main features

Notebook parametrized conversion (mlvtools)
Pipelining (DVC andmlvtools)
Data x Code x Hyperparameters versioning (DVC andmlvtools)

Standard Versioning Process Establishment

Goal: find a way to version code, data and pipelines.

Initial project

Starting from an existing project composed of multiple Python modules and a set ofJupyter notebooks, we want to create an automated pipeline in order to version, shareand reproduce experiments.

│── classifier│   ├── aggregate_classif.py│   ├── __init__.py│   ├── extract.py│   └── ...│── notebooks│   ├── Augment train data.ipynb│   ├── Check data and split and train.ipynb│   ├── Extract data.ipynb│   ├── Learn text classifier.ipynb│   ├── Learn aggregated model.ipynb│   ├── Preprocess image data.ipynb│   └── Train CNN classifier on image data.ipynb│── README.md│── requirements.yml│── setup.cfg│── setup.py

The data flow is processed by applying steps and intermediary results are versionedusing metadata files. These steps are defined in Jupyter notebooks, which are thenconverted to Python scripts.

Keep in mind that:

The reference for the code of the step remains in the Jupyter notebook
Pipelines are structured according to their inputs and outputs
Hyperparameters are pipeline inputs

Project after refactoring

│── classifier│   ├── aggregate_classif.py│   ├── __init__.py│   ├── extract.py│   └── ...│── notebooks│   ├── Augment train data.ipynb│   ├── Check data and split and train.ipynb│   ├── Extract data.ipynb│   ├── Learn text classifier.ipynb│   ├── Learn aggregated model.ipynb│   ├── Preprocess image data.ipynb│   └── Train CNN classifier on image data.ipynb│── pipeline│   ├── dvc                                        ** DVC pipeline steps│   │   ├─ mlvtools_augment_train_data_dvc│   │   ├─ ..│   ├── scripts                                    ** Notebooks converted into Python configurable scripts│   │   ├─ mlvtools_augment_train_data.py│   │   ├─ ..│── README.md│── requirements.yml│── setup.cfg│── setup.py

Applying the process

For each Jupyter notebook a Python parameterizable and executable script is generated.This script makes it easier to version code and automate pipeline executions.

Pipelines are composed of DVC steps. Those steps can be generated directly from theJupyter notebook based on parameters described in the Docstring. (notebook -> pythonscript -> DVC command)

Each time a DVC step is run a DVC meta file ([normalized_notebook_name].dvc) iscreated. This metadata file represents a pipeline step, it is the DVC result of a stepexecution. Those files must be tracked using Git. They are used to reproducea pipeline.

Application:

For each step in the tutorial the process remain the same.

Write a Jupyter notebook which corresponds to a pipeline step. (See Jupyter notebooksyntax section inmlvtools documentation)
Test your Jupyter notebook.
Add it under git.
Convert the Jupyter notebook into a configurable and executable Python scriptusingipynb_to_python.
```
ipynb_to_python -n ./pipeline/notebooks/[notebook_name] -o ./pipeline/steps/[python_script_name]
```
Ensure Python executable and configurable script is well created into./pipeline/steps/[python_script_name].
```
./pipeline/steps/[python_script_name] -h
```

Create a DVC commands to run the Python script using DVC.

gen_dvc -i ./pipeline/steps/[python_script_name] \        --out-dvc-cmd ./scripts/cmd/[dvc_cmd_name]

Ensure DVC command is well created.
Add generated command and Python script under git.
Add step inputs under DVC.
Run DVC command./scripts/cmd/[dvc_cmd_name].
Check DVC meta file is created./[normalized notebook _name].dvc
Add DVC meta file under git

Key Features

Need	Feature
Ignore notebook cell	`# No effect`
DVC input and ouptuts	`:dvc-in`,`:dvc-out`
Add extra parameters	`:dvc-extra`
Write DVC whole command	`:dvc-cmd`
Convert Jupiter Notebook to Python script	`ipynb_to_python`
Generate DVC command	`gen_dvc`
Create a pipeline step from a Jupiter Notebook	`ipynb_to_python`,`gen_dvc`
Add a pipeline step with different IO	Copy DVC step then edit inputs, outputs and meta file name
Reproduce a pipeline	`dvc repro [metafile]`
Reproduce a pipeline with no cache	`dvc repro -f [metafile]`
Reproduce a pipeline after an algo change	`dvc repro -f [metafile]` or run impacted step individually then complete the pipeline.

It is allowed to modify or duplicate a DVC command to change an hyperparameter or runa same step twice with different parameters.

It is a bad idea to modify generated Python scripts. They are generated from Jupyternotebooks, so changes should be done in Jupyter notebooks and then scripts should bere-generated.

Tutorial

Environment

To complete this tutorial clone this repository:

git clone https://github.com/peopledoc/mlvtools-tutorial

Create and a Python virtual environment, and activate it:

virtualenv --python python3 venvsource venv/bin/activate

Install requirements:

make develop

All other steps are explained in each use case.

Cases

How DVC works
mlvtools pipeline features (on simple cases)
Going further with more realistic use cases:

Talks

About

Tutorial for a new versioning Machine Learning pipeline

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Pipeline Versioning Tutorial

Tools Overview

Our main features

Standard Versioning Process Establishment

Initial project

Project after refactoring

Applying the process

Key Features

Tutorial

Environment

Cases

Talks

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors4

Uh oh!

Languages

Movatterモバイル変換

License

peopledoc/mlvtools-tutorial

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Pipeline Versioning Tutorial

Tools Overview

Our main features

Standard Versioning Process Establishment

Initial project

Project after refactoring

Applying the process

Key Features

Tutorial

Environment

Cases

Talks

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors4

Uh oh!

Languages

Packages