Posted onSep 9, 2021 • Edited onSep 10, 2021 • Originally published atp3r.one

DVC (Git For Data): A Complete Intro

#datascience #deeplearning #machinelearning #monitoring

As a data scientist or ML engineer, have you ever faced the inconvenience of experimenting with the model? When we train the model, the model file is generated. Now, if you want to experiment with some different parameters or data, generally people rename the existing model file and train the model again and this process goes on.

The final directory looks something as shown in the below figure 😁. The same thing goes with data as well.

Fig 1: Typical models dir after experimentation

But if we talk about code, we don't have this problem there because we have Git to version the code. I mean that I can create a separate branch for changing some code to see its behaviour without altering the previous one. What if we can get a Git kind of versioning control system for data and model? Wouldn't that be amazing? Here comes the DVC (Data Version Control) which exactly does the thing which solves this problem.

Although, DVC is much more than just data and model versioning. It also helps create end-to-end pipelines, capturing metrics related to pipelines and experimenting with ML models. But in this blog, we will see the data versioning aspect of DVC.

Installing DVC

DVC can be used as a typical Python library and can be installed using package managers likepip orconda.

To install DVC using pip, you can execute below command in the python supported terminal.

$pipinstalldvc

To install DVC using conda, you need to execute below commands in conda terminal.

$condainstall-c conda-forge mamba# installs much faster than conda$mambainstall-c conda-forge dvc

Apart from these conventional installation techniques, DVC can be installed in other different ways too as describedhere on the official documentation website.

Getting Started

Code versioning is not really a new thing for us because we use Git in our daily life but the problem with Git is it becomes extremely slow when the tracking file size is huge as 100 GB. That is where DVC comes in. Git tracks the changes in code whereas DVC tracks the changes in data as well as models.

The foundation of DVC consists of a few commands you can run along with Git to track large files, directories, or ML model files. In short, you can call DVC"Git for data".

Let us understand data versioning by dvc using a simple demo project. Let us assume that we want to create an end-to-end pipeline to prepare, train and evaluate the MNIST dataset and we have code as well as data available and structured as shown in the below figure.

Fig 2 MNIST pipeline file structure.

By observing the file structure, we can see the MNIST dataset is stored in thedata directory as adata.xml file and the code corresponding to every stage of the pipeline is stored in thesrc directory. Now as we understand, the code tracking (src) is the responsibility of Git and the data tracking (data) is the responsibility of DVC.

Before we version the data, we need to initialise the repository as a DVC repository. Just like we run commandgit init to initialise the current directory as a Git repository, we need to execute the below command which will initialise the current directory as a DVC repository.

$dvc init

This command will create a new directory.dvc just likegit init creates.git directory. Two files.gitignore andconfig will be generated in.dvc directory. Now, we have successfully initialized this as DVC repository.

Start Tracking

In Git, if we want it to track the changes in the code, we run agit add command which adds the code changes in the local repository. DVC has a similar command as shown below to start tracking the data files.

$dvc add data/data.xml

Note here that advc add command accepts arguments which are data or model file names that you want DVC to track. All the filenames must be space-separated.

By executing this command, DVC will create two filesdata/data.xml.dvc anddata/.gitignore.

Filedata/.gitignore looks something like this:

/data.xml

It means that it forces Git not to track originaldata.xml file which is expected because the tracking ofdata.xml is now responsibility of DVC and not Git.

All files ending with extension .dvc are special files which contains the information about where the actual data files will be stored and cached. If you take a look at data/data.xml.dvc, it just stores an MD5 signature as shown below.

md5: a7cd139231cc35ed63541ce3829b96db

Now looking at the complete picture, Git will not trackdata/data.xml because it is added indata/.gitignore but Git will trackdata/data.xml.dvc which will have information about where the actualdata.xml is stored.

Now the Git tracked 2 files mentioned above will be stored in Git remote repository on Github, BitBucket or Gitlab and DVC tracked data file will be stored in DVC remote repository on any file system that we are going to describe in next section.

Once we do this, the directory structure for MNIST pipeline project looks something like this.

Fig 3 File structure after dvc add

Storing & Sharing

The actual data files (which are not tracked by Git) will be stored in any kind of file system which can also be called DVC repository. File system for DVC repository can be AWS S3 bucket, Google Drive, Google storage bucket, Azure storage, Object Storage Service or any custom created file system. Depending on where you will store the data, you will need to install external dependencies likedvc-s3,dvc-azure,dvc-gdrive,dvc-gs,dvc-oss,dvc-ssh. You can know more about installationhere on the documentation website.

To push the data files in DVC repository, first we need to configure the remote origin of the repository where the data will be stored. We can do that using below commands.

$dvc remote add-d storage s3://mybucket/dvcstore$git add .dvc/config$git commit-m"Configure remote storage"

dvc remote add command adds the remote origin of the DVC repository where the data will be stored. Executing this command will actually add this origin information in.dvc/config file which Git tracks. That is why we also need to commit the changes done in.dvc/config file as well.

Now, adding the remote origin doesn't automatically push the data in DVC remote repository. We can push the entire data file using below command:

$dvc push

It pushes all the data and model files on whichdvc add command is applied.

Retrieving

There is a reason behinddvc remote add stores remote origin information in.dvc/config. It does so because when next time somebody clones this Git repository, It will not have any data. It will only have MD5 signature of the data in corresponding.dvc file and it will also have DVC remote repository location information stored in.dvc/config file. So, the complete data can be pulled using below command:

$dvc pull

It pulls all the data and model files stored in DVC remote repository.

ML Pipeline & Versioning

While working with DVC, we have to create two filesdvc.yaml which contains information about the stages the ML pipeline will have andparams.yaml which contains the parameters that different stages of the ML pipeline would use. You can take a look at the structure of both files on theexample repository.

DVC is so flexible that you can use commanddvc exp run to run the complete pipeline with just single command. The command is not only limited to this.

If you want to run your experiments with different parameters, you can mention the values of the parameters which are defined inparams.yaml as shown below.

$dvc exp run-S prepare.test_split=0.2-S train.learning_rate=0.0001

Note here that whenever you run the experiments using the above command, DVC caches the output so well that it will only run those states which are affected by the parameters. Moreover, if you think one of the stages in the pipeline are too much time consuming and you want to checkpoint it so that in case the program crashes, it can resume from where it crashed, you can definecheckpoint: true in theoutput of that stage indvc.yaml file and it will keep checkpointing your results for you.

DVC keeps track of all the experiments that you perform and even you can see the history of the previously run experiments usingdvc exp show. It will show you all the previously run experiments. You can filter this history as well. For example,dvc exp show --include-params=train will only show you the past experiments where any of the parameters oftrain stage was changed. Not just that, but you can also see the difference between two experiments' parameters and their resultant metrics as well usingdvc exp diff exp-1dad0 exp-1df77. whereexp-1dad0 andexp-1df77 are tags of two experiments as shown in history given bydvc exp show. Moreover, You can navigate to any previously run experiment usingdvc exp apply. For example, if you run commanddvc exp apply exp-1dad0, then the files will be changed as per the parameter changes specified in that particular experiment.

DVC gives us the flexibility to commit the experiment into a separate Git branch. As shown in the below command, usingdvc exp branch command you can commit the experimentexp-1dad0 to separate branchmy_branch.

$dvc exp branch exp-1dad0 my_branch

Alternatively, you can also commit different experiments in different branches just by switching the branches as described below.

As we already understand switching between different versions of code by changing Git branch. Similarly, we can also change the version of the data as well.

$git checkout experiment_v2$dvc checkout data/data.xml

We can observe that changing the version of data is now as simple as changing the Git branch. Note here that to switch to another version of data, we first need to switch to that particular Git branch where that version of data resides. Also, note here that you must applydvc checkout to those files for which you want to switch the data version sincegit checkout only changes the code version.

How data versioning simplifies experimentation?

As described in the beginning of this blog, we don't need to keep multiple copies of model files anymore and keep renaming them. Now, we can have single model file which can be versioned using DVC during experimentation. Below image explains the same.

Fig 4 Model versioning for experimentation

Let us suppose we are working in Git branchv1 where the hyper-parametern_estimators to train model has value50 and we have trained a model which is stored asmodel.pkl. Now, we want to experiment with value100 for the same hyper-parametern_estimators. So, we don't need to rename the old model file anymore. We will switch to another Git branch v2 usinggit checkout v2 command and we will switch to different model file version usingdvc checkout model.pkl. We will change the value of hyper-parametern_estimators to100 now in code inv2 branch and we can train the model again. The newmodel.pkl file generated inv2 branch is corresponding ton_estimators = 100 andmodel.pkl file inv1 branch is corresponding ton_estimators = 50.

Although, this is very small experiment but the same thing can be applied in complex experimentation process. And with data versioning, the experimentation process becomes very flexible. We don't need to worry about tracking of data and model files.

Large datasets versioning

When the dataset is too large, we need very efficient mechanism in terms of space and performance to share different versions of data. That is why it is suggested to store the data in shared volume or on an external system. DVC supports both of these mechanisms as described below.

Ashared cache can be setup to store, version and access lot of data on a large shared volume efficiently.
As we described earlier, the more advanced approach is to store and version the data directly into remote storage (ex. S3 bucket, Google Drive, GCP storage bucket etc.). You can lookhere to know more about how to configure external data storage for DVC.

Conclusion

With this, I hope you were able to get a quick overview of how DVC does data versioning. DVC also has features to streamline ML reproducibility and support cloud provisioning through cloud providers. This would come in handy for the entire MLOps community as now they could interact with each other without worrying about monolithic tools.

If you like to know more about how to do it and get your hands dirty,the documentations are a great way to start.

If you prefer videos then they have ayoutube channel with detailed tutorials!

Happy Versioning!