Posted onJun 23, 2018

From Subversion to Git: Snapshots

What does it mean that we talk about snapshots of our Git repository, while in Subversion we think in terms of file changes? For me at least, the key to understanding Git is that every commit is, in fact, a snapshot of the entire project. Not a list of patches. Not a difference to the previous commit. Justa snapshot of the whole thing.

Git snapshots everything, they said. Coming from Subversion, this is hard to believe. How would a version control system scale if it stored the entire project state again and again, with each and every commit?

First, let's do a little experiment on how one might approach version controlintuitively, without considering neither Git nor Subversion.

Poor Man's Version Control

Let's say we have a project calledmyapp stored in a directory of that name. All it contains is a main.c file:

myapp  main.c

Without version control, how would you track changes in order to be able to restore a particular state later? Easy enough, you might say: Just create a copy of the entiremyapp directory and call it something likemyapp-<version>. After a while, you would end up with a bunch ofbackup directories:

myapp-01  main.cmyapp-<...>  main.cmyapp-<N>  main.c

To step back to a previous state in history, you might go and replace the entiremyapp directory by one of thesesnapshots created previously.

In order to avoid wasting space by keeping so much redundant information, you might consider putting everything into a gzipp'ed archive and deleting all the backup directories:

tar -cvzf myapp.tgz myapp-*rm -r myapp-*

Interestingly, this naive approach is not completely different from the way Git actually works.

How Git does it

Every time you create a commit, Git takes the content of each added or modified file, compresses it and stores it in an internalobject database together with acommit object that holds some meta information¹. This approach makes it easy to reason about, as it's no more difficult than what we've done in the simple attempt mentioned above.

You may realize a big drawback here though: Although individual file contents are compressed -- which is fine--, even small changesbetween commits will cause massive duplication inside the object database.

In Git, this scalability problem is simply ignored at the first stage and solved later on. In a process calledpacking, all the objects are delta compressed and moved into one or morepackfiles. This is done on several occasions; you can enforce it usinggit gc --aggressive.

The drawing below shows a simplified illustration of the storage of compressed file content into blob objects as well as the packfile generation.

Conclusion

Even for everyday Git usage it is vitally important to understand a bit of its inner workings; it's good to see that the basic idea is not inherently complex but more or less identical with what we might come up with anyway.

This understanding gives us the power to get a grasp of all the more advanced features like branching, merging and rebasing.

This post has originally been published onsteffen.ronalter.de

References

For all the details please refer to the sectionGit Objects of the excellent Pro Git book. ↩

Top comments(2)

Andreas Schnapp

Joined
Sep 18, 2017

• Jun 25 '18• Edited onJun 25• Edited

Copy link

In my opinion repacking is just a little optimization for saving some storage on the disc and not a fundamental concept of git

Git would work pretty well without this function. Git never store the exact same object twice. So if you have two commits which only differs by one file change, every file except the one who's changes is used again for the second commit. (underneath it's like a key value store which uses the hole content to generate a unique key (sha-1)).

The changing file will be stored a second time here as a complete new file. Git does does not track file changes. So, if this file is big it could be a little a waste of disc space (normally not so important today). But to make such situation more efficient, you can use the repack feature.

Steffen Ronalter

Hi, I’m Steffen. I create Embedded software for a living.I always strive to improve the quality of my code by studying new methods, paradigms and technologies.