NotificationsYou must be signed in to change notification settings
Fork119
Star1.5k

Content-Addressable Data Synchronization Tool

1.5k stars 119 forks Branches Tags Activity

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 684 Commits
coccinelle		coccinelle
doc		doc
shell-completion/bash		shell-completion/bash
src		src
test-files		test-files
test		test
tools		tools
.dir-locals.el		.dir-locals.el
.editorconfig		.editorconfig
.gitignore		.gitignore
.vimrc		.vimrc
LICENSE.LGPL2.1		LICENSE.LGPL2.1
NEWS		NEWS
README.md		README.md
TODO		TODO
meson.build		meson.build
meson_options.txt		meson_options.txt
mkosi.build		mkosi.build
mkosi.default		mkosi.default

Repository files navigation

casync — Content Addressable Data Synchronizer

What is this?

A combination of the rsync algorithm and content-addressable storage
An efficient way to store and retrieve multiple related versions of large file systems or directory trees
An efficient way to deliver and update OS, VM, IoT and container images over the Internet in an HTTP and CDN friendly way
An efficient backup system

See theAnnouncement BlogStory for acomprehensive introduction. The medium length explanation goes something likethis:

Encoding: Let's take a large linear data stream, split it intovariable-sized chunks (the size of each being a function of thechunk's contents), and store these chunks in individual, compressedfiles in some directory, each file named after a strong hash value ofits contents, so that the hash value may be used to as key forretrieving the full chunk data. Let's call this directory a "chunkstore". At the same time, generate a "chunk index" file that liststhese chunk hash values plus their respective chunk sizes in a simplelinear array. The chunking algorithm is supposed to create variable,but similarly sized chunks from the data stream, and do so in a waythat the same data results in the same chunks even if placed atvarying offsets. For more informationsee this blogstory.

Decoding: Let's take the chunk index file, and reassemble the largelinear data stream by concatenating the uncompressed chunks retrievedfrom the chunk store, keyed by the listed chunk hash values.

As an extra twist, we introduce a well-defined, reproducible,random-access serialization format for directory trees (think: a moremoderntar), to permit efficient, stable storage of complete directorytrees in the system, simply by serializing them and then passing theminto the encoding step explained above.

Finally, let's put all this on the network: for each image you want todeliver, generate a chunk index file and place it on an HTTPserver. Do the same with the chunk store, and share it between thevarious index files you intend to deliver.

Why bother with all of this? Streams with similar contents will resultin mostly the same chunk files in the chunk store. This means it isvery efficient to store many related versions of a data stream in thesame chunk store, thus minimizing disk usage. Moreover, whentransferring linear data streams chunks already known on the receivingside can be made use of, thus minimizing network traffic.

Why is this different fromrsync or OSTree, or similar tools? Well,one major difference betweencasync and those tools is that weremove file boundaries before chunking things up. This means thatsmall files are lumped together with their siblings and large filesare chopped into pieces, which permits us to recognize similarities infiles and directories beyond file boundaries, and makes sure our chunksizes are pretty evenly distributed, without the file boundariesaffecting them.

The "chunking" algorithm is based on the buzhash rolling hashfunction. SHA512/256 is used as a strong hash function to generate digests of thechunks (alternatively: SHA256). zstd is used to compress the individual chunks(alternatively xz or gzip).

Is this new? Conceptually, not too much. This uses well-known concepts,implemented in a variety of other projects, and puts them together in amoderately new, nice way. That's all. The primary influences are rsync and git,but there are other systems that use similar algorithms, in particular:

BorgBackup (http://www.borgbackup.org/)
bup (https://bup.github.io/)
CAFS (https://github.com/indyjo/cafs)
dedupfs (https://github.com/xolox/dedupfs)
LBFS (https://pdos.csail.mit.edu/archive/lbfs/)
restic (https://restic.github.io/)
Tahoe-LAFS (https://tahoe-lafs.org/trac/tahoe-lafs)
tarsnap (https://www.tarsnap.com/)
Venti (https://en.wikipedia.org/wiki/Venti)
zsync (http://zsync.moria.org.uk/)

(ordered alphabetically, not in order of relevance)

File Suffixes

.catar → archive containing a directory tree (like "tar")
.caidx → index file referring to a directory tree (i.e. a .catar file)
.caibx → index file referring to a blob (i.e. any other file)
.castr → chunk store directory (where we store chunks under their hashes)
.cacnk → a compressed chunk in a chunk store (i.e. one of the files stored below a .castr directory)

Operations on directory trees

# casync list /home/lennart# casync digest /home/lennart# casync mtree /home/lennart (BSD mtree(5) compatible manifest)

Operations on archives

# casync make /home/lennart.catar /home/lennart# casync extract /home/lennart.catar /home/lennart# casync list /home/lennart.catar# casync digest /home/lennart.catar# casync mtree /home/lennart.catar# casync mount /home/lennart.catar /home/lennart# casync verify /home/lennart.catar /home/lennart  (NOT IMPLEMENTED YET)# casync diff /home/lennart.catar /home/lennart (NOT IMPLEMENTED YET)

Operations on archive index files

# casync make --store=/var/lib/backup.castr /home/lennart.caidx /home/lennart# casync extract --store=/var/lib/backup.castr /home/lennart.caidx /home/lennart# casync list --store=/var/lib/backup.castr /home/lennart.caidx# casync digest --store=/var/lib/backup.castr /home/lennart.caidx# casync mtree --store=/var/lib/backup.castr /home/lennart.caidx# casync mount --store=/var/lib/backup.castr /home/lennart.caidx /home/lennart# casync verify --store=/var/lib/backup.castr /home/lennart.caidx /home/lennart (NOT IMPLEMENTED YET)# casync diff --store=/var/lib/backup.castr /home/lennart.caidx /home/lennart (NOT IMPLEMENTED YET)

Operations on blob index files

# casync digest --store=/var/lib/backup.castr fedora25.caibx# casync mkdev --store=/var/lib/backup.castr fedora25.caibx# casync verify --store=/var/lib/backup.castr fedora25.caibx /home/lennart/Fedora25.raw (NOT IMPLEMENTED YET)

Operations involving ssh remoting

# casync make foobar:/srv/backup/lennart.caidx /home/lennart# casync extract foobar:/srv/backup/lennart.caidx /home/lennart2# casync list foobar:/srv/backup/lennart.caidx# casync digest foobar:/srv/backup/lennart.caidx# casync mtree foobar:/srv/backup/lennart.caidx# casync mount foobar:/srv/backup/lennart.caidx /home/lennart

Operations involving the web

# casync extract http://www.foobar.com/lennart.caidx /home/lennart# casync list http://www.foobar.com/lennart.caidx# casync digest http://www.foobar.com/lennart.caidx# casync mtree http://www.foobar.com/lennart.caidx# casync extract --seed=/home/lennart http://www.foobar.com/lennart.caidx /home/lennart2# casync mount --seed=/home/lennart http://www.foobar.com/lennart.caidx /home/lennart2

Maintenance

# casync gc /home/lennart-20170101.caidx /home/lennart-20170102.caidx /home/lennart-20170103.caidx# casync gc --backup /var/lib/backup/backup.castr /home/lennart-*.caidx# casync make /home/lennart.catab /home/lennart (NOT IMPLEMENTED)

Building casync

casync uses theMeson build system. To build casync,install Meson (at least 0.47), as well as the necessary build dependencies(gcc, libzstd-dev liblzma-dev libacl1-dev libfuse-dev libudev-dev python3-sphinx). Then run:

# meson build && ninja -C build && sudo ninja -C build install

About

Content-Addressable Data Synchronization Tool

Releases2

casync 2 Latest

Jul 26, 2017

+ 1 release

Packages

No packages published

Contributors45

+ 31 contributors

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

casync — Content Addressable Data Synchronizer

File Suffixes

Operations on directory trees

Operations on archives

Operations on archive index files

Operations on blob index files

Operations involving ssh remoting

Operations involving the web

Maintenance

Building casync

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases2

Packages

Uh oh!

Contributors45

Uh oh!

Languages

Movatterモバイル変換

systemd/casync

Folders and files

Latest commit

History

Repository files navigation

casync — Content Addressable Data Synchronizer

File Suffixes

Operations on directory trees

Operations on archives

Operations on archive index files

Operations on blob index files

Operations involving ssh remoting

Operations involving the web

Maintenance

Building casync

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases2

Packages0

Uh oh!

Contributors45

Uh oh!

Languages

Packages