- Notifications
You must be signed in to change notification settings - Fork87
Tools for synching and streaming files from Windows to Linux
License
google/cdc-file-transfer
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Born from the ashes of Stadia, this repository contains tools for syncing andstreaming files from Windows to Windows or Linux. The tools are based on ContentDefined Chunking (CDC), in particularFastCDC,to split up files into chunks.
At Stadia, game developers had access to Linux cloud instances to run games.Most developers wrote their games on Windows, though. Therefore, they needed away to make them available on the remote Linux instance.
As developers had SSH access to those instances, they could usescp
to copythe game content. However, this was impractical, especially with the shift toworking from home during the pandemic with sub-par internet connections.scp
always copies full files, there is no "delta mode" to copy only the things thatchanged, it is slow for many small files, and there is no fast compression.
To help this situation, we developed two tools,cdc_rsync
andcdc_stream
,which enable developers to quickly iterate on their games without repeatedlyincurring the cost of transmitting dozens of GBs.
cdc_rsync
is a tool to sync files from a Windows machine to a Linux device,similar to the standard Linuxrsync. It isbasically a copy tool, but optimized for the case where there is already an oldversion of the files available in the target directory.
- It quickly skips files if timestamp and file size match.
- It uses fast compression for all data transfer.
- If a file changed, it determines which parts changed and only transfers thedifferences.
The remote diffing algorithm is based on CDC. In our tests, it is up to 30xfaster than the one used inrsync
(1500 MB/s vs 50 MB/s).
The following chart shows a comparison ofcdc_rsync
and Linuxrsync
runningunder Cygwin on Windows. The test data consists of 58 development buildsof some game provided to us for evaluation purposes. The builds are 40-45 GBlarge. For this experiment, we uploaded the first build, then synced the secondbuild with each of the two tools and measured the time. For example, syncingfrom build 1 to build 2 took 210 seconds with the Cygwinrsync
, but only 75seconds withcdc_rsync
. The three outliers are probably feature drops fromanother development branch, where the delta was much higher. Overall,cdc_rsync
syncs files about3 times faster than Cygwinrsync
.
We also ran the experiment with the native Linuxrsync
, i.e syncing Linux toLinux, to rule out issues with Cygwin. Linuxrsync
performed on average 35%worse than Cygwinrsync
, which can be attributed to CPU differences. We didnot include it in the figure because of this, but you can find ithere.
The standard Linuxrsync
splits a file into fixed-size chunks of typicallyseveral KB.
If the file is modified in the middle, e.g. by insertingxxxx
after567
,this usually means thatthe modified chunks as well asall subsequent chunks change.
The standardrsync
algorithm hashes the chunks of the remote "old" fileand sends the hashes to the local device. The local device then figures outwhich parts of the "new" file matches known chunks.
This is a simplification. The actual algorithm is more complicated and usestwo hashes, a weak rolling hash and a strong hash, seehere for a great overview. What makesrsync
relatively slow is the "no match" situation where the rolling hash doesnot match any remote hash, and the algorithm has to roll the hash forward andperform a hash map lookup for each byte.rsync
goes togreat lengthsoptimizing lookups.
cdc_rsync
does not use fixed-size chunks, but instead variable-size,content-defined chunks. That means, chunk boundaries are determined by thelocal content of the file, in practice a 64 byte sliding window. For moredetails, seethe FastCDC paperor take a look atour implementation.
If the file is modified in the middle, onlythe modifiedchunks, but notsubsequent chunkschange (unless they are less than 64 bytes away from the modifications).
Computing the chunk boundaries is cheap and involves only a left-shift, a memorylookup, anadd
and anand
operation for each input byte. This is cheaperthan the hash map lookup for the standardrsync
algorithm.
Because of this, thecdc_rsync
algorithm is faster than the standardrsync
. It is also simpler. Since chunk boundaries move along with insertionsor deletions, the task to match local and remote hashes is a trivial setdifference operation. It does not involve a per-byte hash map lookup.
cdc_stream
is a tool to stream files and directories from a Windows machine toa Linux device. Conceptually, it is similar tosshfs, but it is optimized for read speed.
- It caches streamed data on the Linux device.
- If a file is re-read on Linux after it changed on Windows, only thedifferences are streamed again. The rest is read from the cache.
- Stat operations are very fast since the directory metadata (filenames,permissions etc.) is provided in a streaming-friendly way.
To efficiently determine which parts of a file changed, the tool uses the sameCDC-based diffing algorithm ascdc_rsync
. Changes to Windows files are almostimmediately reflected on Linux, with a delay of roughly (0.5s + 0.7s x totalsize of changed files in GB).
The tool does not support writing files back from Linux to Windows; the Linuxdirectory is readonly.
The following chart compares times from starting a game to reaching the menu.In one case, the game is streamed viasshfs
, in the other case we usecdc_stream
. Overall, we see a2x to 5x speedup.
cdc_rsync | From | To |
---|---|---|
Windows x86_64 | ✓ | ✓1 |
Ubuntu 22.04 x86_64 | ✗2 | ✓ |
Ubuntu 22.04 aarch64 | ✗ | ✗ |
macOS 13 x86_643 | ✗ | ✗ |
macOS 13 aarch643 | ✗ | ✗ |
cdc_stream | From | To |
---|---|---|
Windows x86_64 | ✓ | ✗ |
Ubuntu 22.04 x86_64 | ✗ | ✓ |
Ubuntu 22.04 aarch64 | ✗ | ✗ |
macOS 13 x86_643 | ✗ | ✗ |
macOS 13 aarch643 | ✗ | ✗ |
1 Only local syncs, e.g.cdc_rsync C:\src\* C:\dst
. Support forremote syncs is being added, see#61.
2 See#56.
3 See#62.
Download the precompiled binaries from thelatest release to aWindows device and unzip them. The Linux binaries are automatically deployedto~/.cache/cdc-file-transfer
by the Windows tools. There is no need to manuallydeploy them. We currently provide Linux binaries compiled onGithub's latest Ubuntu version.If the binaries work for you, you can skip the following two sections.
Alternatively, the project can be built from source. Some binaries have to bebuilt on Windows, some on Linux.
To build the tools from source, the following steps have to be executed onboth Windows and Linux.
- Download and install Bazel fromhere. Seeworkflow logs for thecurrently used version.
- Clone the repository.
git clone https://github.com/google/cdc-file-transfer
- Initialize submodules.
cd cdc-file-transfergit submodule update --init --recursive
Finally, install an SSH client on the Windows machine if not present.The file transfer tools requiressh.exe
andsftp.exe
.
The two tools CDC RSync and CDC Stream can be built and used independently.
- On a Linux device, build the Linux components
bazel build --config linux --compilation_mode=opt --linkopt=-Wl,--strip-all --copt=-fdata-sections --copt=-ffunction-sections --linkopt=-Wl,--gc-sections //cdc_rsync_server
- On a Windows device, build the Windows components
bazel build --config windows --compilation_mode=opt --copt=/GL //cdc_rsync
- Copy the Linux build output file
cdc_rsync_server
frombazel-bin/cdc_rsync_server
tobazel-bin\cdc_rsync
on the Windows machine.
- On a Linux device, build the Linux components
bazel build --config linux --compilation_mode=opt --linkopt=-Wl,--strip-all --copt=-fdata-sections --copt=-ffunction-sections --linkopt=-Wl,--gc-sections //cdc_fuse_fs
- On a Windows device, build the Windows components
bazel build --config windows --compilation_mode=opt --copt=/GL //cdc_stream
- Copy the Linux build output files
cdc_fuse_fs
andlibfuse.so
frombazel-bin/cdc_fuse_fs
tobazel-bin\cdc_stream
on the Windows machine.
The tools require a setup where you can use SSH and SFTP from the Windowsmachine to the Linux device without entering a password, e.g. by using key-basedauthentication.
By default, the tools searchssh.exe
andsftp.exe
from the path environmentvariable. If you can run the following commands in a Windows cmd withoutentering your password, you are all set:
ssh user@linux.device.comsftp user@linux.device.com
Here,user
is the Linux user andlinux.device.com
is the Linux host toSSH into or copy the file to.
If additional arguments are required, it is recommended to provide an SSH configfile. By default, bothssh.exe
andsftp.exe
use the file at%USERPROFILE%\.ssh\config
on Windows, if it exists. A possible config filethat sets a username, a port, an identity file and a known host file could lookas follows:
Host linux_deviceHostName linux.device.comUser userPort 12345IdentityFile C:\path\to\id_rsaUserKnownHostsFile C:\path\to\known_hosts
Ifssh.exe
orsftp.exe
cannot be found, you can specify the full paths viathe command line arguments--ssh-command
and--sftp-command
forcdc_rsync
andcdc_stream start
(see below), or set the environment variablesCDC_SSH_COMMAND
andCDC_SFTP_COMMAND
, e.g.
set CDC_SSH_COMMAND="C:\path with space\to\ssh.exe"set CDC_SFTP_COMMAND="C:\path with space\to\sftp.exe"
Note that you can also specify SSH configuration via the environment variablesinstead of using a config file:
set CDC_SSH_COMMAND=C:\path\to\ssh.exe -p 12345 -i C:\path\to\id_rsa -oUserKnownHostsFile=C:\path\to\known_hostsset CDC_SFTP_COMMAND=C:\path\to\sftp.exe -P 12345 -i C:\path\to\id_rsa -oUserKnownHostsFile=C:\path\to\known_hosts
Note the small-p
forssh.exe
and the capital-P
forsftp.exe
.
For Google internal usage, set the following environment variables to enable SSHauthentication using a Google security key:
set CDC_SSH_COMMAND=C:\gnubby\bin\ssh.exeset CDC_SFTP_COMMAND=C:\gnubby\bin\sftp.exe
Note that you will have to touch the security key multiple times during thefirst run. Subsequent runs only require a single touch.
cdc_rsync
is used similar toscp
or the Linuxrsync
command. To sync asingle Windows fileC:\path\to\file.txt
to the home directory~
on the Linuxdevicelinux.device.com
, run
cdc_rsync C:\path\to\file.txt user@linux.device.com:~
cdc_rsync
understands the usual Windows wildcards*
and?
.
cdc_rsync C:\path\to\*.txt user@linux.device.com:~
To sync the contents of the Windows directoryC:\path\to\assets
recursively to~/assets
on the Linux device, run
cdc_rsync C:\path\to\assets\* user@linux.device.com:~/assets -r
To get per file progress, add-v
:
cdc_rsync C:\path\to\assets\* user@linux.device.com:~/assets -vr
The tool also supports local syncs:
cdc_rsync C:\path\to\assets\* C:\path\to\destination -vr
To stream the Windows directoryC:\path\to\assets
to~/assets
on the Linuxdevice, run
cdc_stream start C:\path\to\assets user@linux.device.com:~/assets
This makes all files and directories inC:\path\to\assets
available on~/assets
immediately, as if it were a local copy. However, data is streamedfrom Windows to Linux as files are accessed.
To stop the streaming session, enter
cdc_stream stop user@linux.device.com:~/assets
The command also accepts wildcards. For instance,
cdc_stream stop user@*:*
stops all existing streaming sessions for the given user.
On first run,cdc_stream
starts a background service, which does all the work.Thecdc_stream start
andcdc_stream stop
commands are just RPC clients thattalk to the service.
The service logs to%APPDATA%\cdc-file-transfer\logs
by default. The logs areuseful to investigate issues with asset streaming. To pass custom arguments, orto debug the service, create a JSON config file at%APPDATA%\cdc-file-transfer\cdc_stream.json
with command line flags.For instance,
{ "verbosity":3 }
instructs the service to log debug messages. Trycdc_stream start-service -h
for a list of available flags. Alternatively, run the service manually with
cdc_stream start-service
and pass the flags as command line arguments. When you run the service manually,the flag--log-to-stdout
is particularly useful as it logs to the consoleinstead of to the file.
cdc_rsync
always logs to the console. To increase log verbosity, pass-vvv
for debug logs or-vvvv
for verbose logs.
For both sync and stream, the debug logs contain all SSH and SFTP commands thatare attempted to run, which is very useful for troubleshooting. If a commandfails unexpectedly, copy it and run it in isolation. Pass-vv
or-vvv
foradditional debug output.
About
Tools for synching and streaming files from Windows to Linux