DeepLearning-python/example-get-startedPublic

forked fromiterative/example-get-started

NotificationsYou must be signed in to change notification settings
Fork0
Star0

Get started DVC project

dvc.org/doc/get-started

0 stars 183 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.dvc		.dvc
.github/workflows		.github/workflows
data		data
src		src
.dvcignore		.dvcignore
.gitignore		.gitignore
README.md		README.md
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
params.yaml		params.yaml
prc.json		prc.json
roc.json		roc.json
scores.json		scores.json

Repository files navigation

DVC Get Started

This is an auto-generated repository for use in DVCGet Started. It is a step-by-step quickintroduction into basic DVC concepts.

The project is a natural language processing (NLP) binary classifier problem ofpredicting tags for a given StackOverflow question. For example, we want oneclassifier which can predict a post that is about the Python language by taggingitpython.

🐛 Please report any issues found in this project here -example-repos-dev.

Installation

Python 3.6+ is required to run code from this repo.

$git clone https://github.com/iterative/example-get-started$cd example-get-started

Now let's install the requirements. But before we do that, westronglyrecommend creating a virtual environment with a tool such asvirtualenv:

$virtualenv -p python3 .env$source .env/bin/activate$pip install -r src/requirements.txt

This instruction assumes that DVC is already installed, as it is frequentlyused as a global tool like Git. If DVC is not installed, see theDVC installation guide on how to install DVC.

This DVC project comes with a preconfigured DVCremote storage that holds rawdata (input), intermediate, and final results that are produced. This is aread-only HTTP remote.

$dvc remote liststorage https://remote.dvc.org/get-started

You can rundvc pull to download the data:

$dvc pull

Running in your environment

Rundvc repro to reproduce thepipeline:

$dvc reproData and pipelines are up to date.

If you'd like to test commands likedvc push,that require write access to the remote storage, the easiest way would be to setup a "local remote" on your file system:

This kind of remote is located in the local file system, but is external tothe DVC project.

$mkdir -p /tmp/dvc-storage$dvc remote addlocal /tmp/dvc-storage

You should now be able to run:

$dvc push -rlocal

Existing stages

This project with the help of the Git tags reflects the sequence of actions thatare run in the DVCget started guide. Feelfree to checkout one of them and play with the DVC commands having theplayground ready.

0-git-init: Empty Git repository initialized.
1-dvc-init: DVC has been initialized..dvc/ with the cache directorycreated.
2-track-data: Raw data filedata.xml downloaded and tracked with DVC usingdvc add. First.dvc file created.
3-config-remote: Remote HTTP storage initialized. It's a shared read onlystorage that contains all data artifacts produced during next steps.
4-import-data: Usedvc import to get the samedata.xml from the DVC dataregistry.
5-source-code: Source code downloaded and put into Git.
6-prepare-stage: Createdvc.yaml and the first pipeline stage withdvc run. It transforms XML data into TSV.
7-ml-pipeline: Feature extraction and train stages created. It takes data inTSV format and produces two.pkl files that contain serialized featurematrices. Train runs random forest classifier and creates themodel.pkl file.
8-evaluation: Evaluation stage. Runs the model on a test dataset to produceits performance AUC value. The result is dumped into a DVC metric file so thatwe can compare it with other experiments later.
9-bigrams-model: Bigrams experiment, code has been modified to extract morefeatures. We rundvc repro for the first timeto illustrate how DVC can reuse cached files and detect changes along thecomputational graph, regenerating the model with the updated data.
10-bigrams-experiment: Reproduce the evaluation stage with the bigrams basedmodel.
11-random-forest-experiments: Reproduce experiments to tune the randomforest classifier parameters and select the best experiment.

There are three additional tags:

baseline-experiment: First end-to-end result that we have performance metricfor.
bigrams-experiment: Second experiment (model trained using bigramsfeatures).
random-forest-experiments: Best of additional experiments tuning randomforest parameters.

These tags can be used to illustrate-a or-T options across differentDVC commands.

Project structure

The data files, DVC files, and results change as stages are created one by one.After cloning and usingdvc pull to download datatracked by DVC, the workspace should look like this:

$tree.├── README.md├── data                  # <-- Directory with raw and intermediate data│   ├── data.xml          # <-- Initial XML StackOverflow dataset (raw data)│   ├── data.xml.dvc      # <-- .dvc file - a placeholder/pointer to raw data│   ├── features          # <-- Extracted feature matrices│   │   ├── test.pkl│   │   └── train.pkl│   └── prepared          # <-- Processed dataset (split and TSV formatted)│       ├── test.tsv│       └── train.tsv├── dvc.lock├── dvc.yaml              # <-- DVC pipeline file├── model.pkl             # <-- Trained model file├── params.yaml           # <-- Parameters file├── prc.json              # <-- Precision-recall curve data points├── roc.json              # <-- ROC curve data points├── scores.json           # <-- Binary classifier final metrics (e.g. AUC)└── src                   # <-- Source code to run the pipeline stages    ├── evaluate.py    ├── featurization.py    ├── prepare.py    ├── requirements.txt  # <-- Python dependencies needed in the project    └── train.py

About

Get started DVC project

dvc.org/doc/get-started

Releases

15tags

Packages

No packages published

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

DVC Get Started

Installation

Running in your environment

Existing stages

Project structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

DeepLearning-python/example-get-started

Folders and files

Latest commit

History

Repository files navigation

DVC Get Started

Installation

Running in your environment

Existing stages

Project structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages