Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Multi-stream rsync wrapper

License

NotificationsYou must be signed in to change notification settings

jbd/msrsync

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This project is not actively developed. Please have a look at the alternatives in the motivation section.

msrsync (multi-stream rsync) is a python wrapper aroundrsync. It only depends onpython >= 2.6 andrsync.

It will split the transfer in multiple buckets while the source is scanned and will hopefully help maximizing the usage of the available bandwidth by running a configurable number ofrsync processes in parallel. The main limitation is it does not handle remote source or target directory, they must be locally accessible (local disk, nfs/cifs/other mountpoint). I hope to address this in a near future.

Quick example

$ msrsync -p 4 /source /destination# you can also use -P/--progress and --stats options

This will copy /source directory in the /destination directory (same behaviour asrsync regarding the slash handling) using 4rsync processes (using"-aS --numeric-ids" as default option. Could be override with--rsync option).msrsync will split the files and directory list into bucket of 1G or 1000 files maximum (see--size and--files options) before feeding them to eachrsync process in parallel using the--files-from option. As long as the source and the destination can cope with the parallel I/O (think big boring "enterprise grade" NAS), it should be faster than a singlersync.

msrsync shares the same spirit asfpart (and itsfpsync associated tool) byGanaël Laplanche orparsync byHarry Mangalam. Those are two fantastic much more complete tools used in the field to do real work. Please check them out, they might be what you're looking for.

You can also checkfcp from the pcircle project. It looks very powerful. See theassociated publication.

Motivation

Why writemsrsync if tools likefpart,parsync orpftool exist ? While reasonable, their dependencies can be a point of friction given the constraints we can have on a given system. When you're lucky, you can use your package manager (fpart seems to be well supported among various GNU/Linux and FreeBSD distribution:FreeBSD,Debian,Ubuntu,Archlinux,OBS) to deal with the requirements but more often than not, I found myself struggling with the sad state of the machine I'm working with.

That's why the only dependencies of msrsync arepython >=2.6 andrsync. What python 2.6 ? I'm aiming RHEL6 like distribution as a minimum requirement here, so I'm stuck with python 2.6. I miss some cool features, but that's part of the project.

The devil is in the details. If you need a starting point to think about data migration, this overview by Jeff Layton is very informative:Moving Your Data – It’s Not Always Pleasant.

The "How to transfer large amounts of data via network" article byparsync author is updated regularly and its worth a read also.

If you can read french, I co-wrote an article withGanaël Laplanche aboutfpart :Parallélisez vos transferts de fichiers.

You might be also interested by this Intel whitepaper on data migration :Data Migration withIntel® Enterprise Edition for Lustre* Software which mentions all of those tools (but notmsrsync).

Requirements

python >= 2.6 andrsync

Installation

msrsync is a single python file, you just have to download it. Or if you prefer, you can clone the repository and use the provided Makefile:

$ wget https://raw.githubusercontent.com/jbd/msrsync/master/msrsync&& chmod +x msrsync

or

$ git clone https://github.com/jbd/msrsync&&cd msrsync&& sudo make install

Usage

$ msrsync --helpusage: msrsync [options] [--rsync "rsync-options-string"] SRCDIR [SRCDIR2...] DESTDIR   or: msrsync --selftestmsrsync options:    -p, --processes ...   number of rsync processes to use [1]    -f, --files ...       limit buckets to <files> files number [1000]    -s, --size ...        limit partitions to BYTES size (1024 suffixes: K, M, G, T, P, E, Z, Y) [1G]    -b, --buckets ...     where to put the buckets files (default: auto temporary directory)    -k, --keep            do not remove buckets directory at the end    -j, --show            show bucket directory    -P, --progress        show progress    --stats               show additional stats    -d, --dry-run         do not run rsync processes    -v, --version         print versionrsync options:    -r, --rsync ...       MUST be last option. rsync options as a quoted string ["-aS --numeric-ids"]. The "--from0 --files-from=... --quiet --verbose --stats --log-file=..." options will ALWAYS be added, no                            matter what. Be aware that this will affect all rsync *from/filter files if you want to use them. See rsync(1) manpage for details.self-test options:    -t, --selftest        run the integrated unit and functional tests    -e, --bench           run benchmarks    -g, --benchshm        run benchmarks in /dev/shm or the directory in $SHM environment variable

If you want to use specific options for the rsync processes, use the--rsync option.

$ msrsync -p4 --rsync"-a --numeric-ids --inplace"source destination

Some examples:

$ msrsync -p 8 /usr/share/doc/ /tmp/doc/
$ msrsync -P -p 8 /usr/share/doc/ /tmp/doc/[33491/33491 entries] [602.1 M/602.1 M transferred] [3378 entries/s] [60.7 M/s bw] [monq 1] [jq 1]
$ msrsync -P -p 8 --stats /usr/share/doc/ /tmp/doc/[33491/33491 entries] [602.1 M/602.1 M transferred] [3533 entries/s] [63.5 M/s bw] [monq 1] [jq 1]Status: SUCCESSWorking directory: /home/jbdenis/Code/msrsyncCommand line: ./msrsync -P -p 8 --stats /usr/share/doc/ /tmp/doc/Total size: 602.1 MTotal entries: 33491Buckets number: 34Mean entries per bucket: 985Mean size per bucket: 17.7 MEntries per second: 3533Speed: 63.5 M/sRsync workers: 8Total rsync's processes (34) cumulative runtime: 73.0sCrawl time: 0.4s (4.3% of total runtime)Total time: 9.5s

Performance

You can launch a benchmark using the--bench option ormake bench. It is only for testing purpose. They are comparing the performance between vanillarsync andmsrsync using multiple options. Since I'm just creating a huge fake file tree with empty files, you won't see anymsrsync benefits here, unless you're trying with many many files. They need to be run as root since I'm dropping disk cache between run.

$ sudo make bench # or sudo msrsync --benchBenchmarks with 100000 entries (95% of files):rsync -a --numeric-ids took 14.05 seconds (speedup x1)msrsync --processes 1 --files 1000 --size 1G took 18.58 seconds (speedup x0.76)msrsync --processes 2 --files 1000 --size 1G took 10.61 seconds (speedup x1.32)msrsync --processes 4 --files 1000 --size 1G took 6.60 seconds (speedup x2.13)msrsync --processes 8 --files 1000 --size 1G took 6.58 seconds (speedup x2.14)msrsync --processes 16 --files 1000 --size 1G took 6.66 seconds (speedup x2.11)

Please test on real data instead =). There is also a--benchshm option that will perform the benchmark in/dev/shm.

Here is a real test on a big nas box (not known for handling small files well) on a 1G network (you'll see that is more than useless due to the I/O overhead) with thelinux 4.0.4 kernel decompressed source 21 times in different folders:

$ ls /mnt/nfs/linux-src/0  1  10  11  12  13  14  15  16  17  18  19  2  20  3  4  5  6  7  8  9$ du -s --apparent-size --bytes /mnt/nfs/linux-src11688149821     /mnt/nfs/linux-src$ du -s --apparent-size --human /mnt/nfs/linux-src11G     /mnt/nfs/linux-src$ find /mnt/nfs/linux-src -type f | wc -l1027908$ find /mnt/nfs/linux-src -type d | wc -l66360

The source and the destination are on an nfs mount.

Let's runrsync andmsrsync with a various number of process:

$ rm -rf /mnt/nfs/dest$ echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null$ time rsync -a --numeric-ids /mnt/nfs/linux-src /mnt/nfs/destreal    136m10.406suser    1m54.939ssys     7m31.188s$ rm -rf /mnt/nfs/dest$ echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null$ msrsync -p 1 /mnt/nfs/linux-src /mnt/nfs/destreal    144m8.954suser    2m20.426ssys     8m4.127s$ rm -rf /mnt/nfs/dest$ echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null$ msrsync -p 2 /mnt/nfs/linux-src /mnt/nfs/destreal    73m57.312suser    2m27.543ssys     7m56.484s$ rm -rf /mnt/nfs/dest$ echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null$ msrsync -p 4 /mnt/nfs/linux-src /mnt/nfs/destreal    42m31.105suser    2m24.196ssys     7m46.568s$ rm -rf /mnt/nfs/dest$ echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null$ msrsync -p 8 /mnt/nfs/linux-src /mnt/nfs/destreal    36m55.141suser    2m27.149ssys     7m40.392s$ rm -rf /mnt/nfs/dest$ echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null$ msrsync -p 16 /mnt/nfs/linux-src /mnt/nfs/destreal    33m0.976suser    2m35.848ssys     7m40.623s

Ridiculous rates due to the size of each file and the I/O overhead (nfs + network), but that's a real use case and we've got nice speedup without too much thinking : just use msrync and you're good to go. That's exactly what I wanted. Here is a summary of the previousresults:

 CommandTimeEntries per secondBandwidth (MBytes/s) Speedup
rsync136m10s1331.36x1
msrsync -p 1144m9s1261.28x0.94
msrsync -p 273m57s2462.51x1.84
msrsync -p 442m31s4284.36x3.20
msrsync -p 836m55s4945.03x3.68
msrsync -p 1633m0s5525.62x4.12

Astute readers will notify the slight overhead ofmsrync over the equivalentrsync in the single process case. This overhead vanishes (but still exists) when you increase processes number.

Notes

  • Thersync processes are always run with the--from0 --files-from=... --quiet --verbose --stats --log-file=... options, no matter what.--from0 option affects--exclude-from,--include-from,--files-from, and any merged files specified in a--filter rule.

  • This may seem obvious but if the source or the destination of the copy cannot handle parallel I/O well, you won't see any benefits (quite the opposite in fact) usingmsrsync.

Development

I'm targeting python 2.6 without external dependencies besides rsync. The provided Makefile is just an helper around the embedded testing and coverage.py:

$ make helpPlease use `make <target>' where <target> is one of  clean         => clean all generated files  cov           => coverage report using /usr/bin/python-coverage (use COVERAGE env to change that)  covhtml       => coverage html report  man           => build manpage  test          => run embedded tests  install       => install msrsync in /usr/bin (use DESTDIR env to change that)  lint          => run pylint  bench         => run benchmarks (linux only. Need root to drop buffer cache between run)  benchshm      => run benchmarks using /dev/shm (linux only. Need root to drop buffer cache between run)

There is an integrated test suite (--selftest option, ormake test). Since I'm using unittest from python 2.6 library, I cannot capture the output of the tests (buffer parameter from TestResult object appeared in 2.7).

$ make test # or msrsync --selftesttest_get_human_size (__main__.TestHelpers)convert bytes to human readable string ... oktest_get_human_size2 (__main__.TestHelpers)convert bytes to human readable string ... oktest_human_size (__main__.TestHelpers)convert human readable size to bytes ... ok...test simple msrsync synchronisation ... oktest_msrsync_cli_2_processes (__main__.TestSyncCLI)test simple msrsync synchronisation ... oktest_msrsync_cli_4_processes (__main__.TestSyncCLI)test simple msrsync synchronisation ... oktest_msrsync_cli_8_processes (__main__.TestSyncCLI)test simple msrsync synchronisation ... oktest_simple_msrsync_cli (__main__.TestSyncCLI)test simple msrsync synchronisation ... oktest_simple_rsync (__main__.TestSyncCLI)test simple rsync synchronisation ... ok----------------------------------------------------------------------Ran 29 tests in 3.320sOK

[8]ページ先頭

©2009-2025 Movatter.jp