Posted on • Originally published atconda.discourse.group
How we reduced conda's index fetch bandwidth by 99%
The new conda 23.3.1 release from March, 2023 includes an
--experimental=jlap
flag orexperimental: ["jlap"]
.condarc
setting that can reduce repdata.json fetch bandwidth by orders of magnitude. This is how we developed conda's new incremental repodata feature.
Conda is a cross-platform, language-agnostic binary package manager that includes a constraint solver to choose compatible sets of packages. Before conda can install a package, it downloads information about all available packages. This allows the solver to make global decisions about which packages to install. The time and bandwidth spent downloading this metadata can be significant, but we have improved this in conda 23.3.1. By enabling theexperimental: ["jlap"]
feature in.condarc
, conda users can see more than a 99% reduction in index fetch bandwith.
Idea
Traditionally, conda tries to fetch each channel's entirerepodata.json
or smallercurrent_repodata.json
every time the cache expires, typically between 30 seconds to 20 minutes from the last remote request. If the channel has not changed this is a quick304 Not Modified
, but very active channels will change several times an hour. Any change, usually only a few added packages, requires the user to re-download the entire index; most conda users will be familiar with this process. My manager said one ofAnaconda's customers wanted a better way to track changes in our repository and I became interested in solving the problem.
I began to pursue a solution based on computing patches between successive versions ofrepodata.json
to let users download only the changes, once they had an initial, complete copy of the index in their cache.
Initial Prototypes
I choseRFC 6902 JSON Patch, a generic patch format for.json
.JSON Patch
shows the logical difference between two.json
files instead of comparing the files textually, saving space by discarding formatting.
I wrote aRust implementation of a.json
patchset format based on a single.json
file with an array of patches.
The Rust implementation helped to simplify the format and show that it could be language independent. In Python, we might have included optional or multiple-typed "string or null" fields without thinking about it. In Rust, "mandatory, always a single type" fields are easiest to specify. I was surprised to find that this change simplified the Python code as well.
Experimentation showed thatPyPy was faster than Rust for comparing two largerepodata.json
; CPython'sjson.loads
andjson.dumps
also perform shockingly well. Stopped development on the Rust implementation.
I began writing a formal specification based on array-of-patches style.
Specification
I wrote a Conda Enhancmenent Proposal (CEP) for a newjson lines based.jlap
format. The earlier array-of-patches format has the same problem asrepodata.json
but in miniature, since the client has to download every patch each time. In contrast,the new.jlap
system appends patches to the end of a file. It is designed to fetch new patches from a growing file usingHTTP Range requests. With this system, update bandwidth is proportional to the amount of change that has happened since last time.
The
.jlap
format allows clients to fetch the newest patches and only the newest patches with a single HTTP Range request. It consists of a leading checksum, any number of lines of patches, one per line in theJSON Lines format, and a trailing checksum.The checksums are constructed in such a way that the trailing checksum can be re-verified without re-reading (or retaining) the beginning of the file, if the client remembers an intermediate checksum. The trailing checksum is used to make sure the rest of the remote file was not changed.
When
repodata.json
changes, the server wil truncate themetadata
line, appending new patches, a new metadata line and a new trailing checksum.
We needed patch data to test this system.I wrote a web service that would create that data. The service checkedrepodata.json
every five minutes, compared the current and previous versions, updated a patch file and hosted it separately from the main repository.
The initial demonstration used arepodata.json
proxy, fetching patches fromrepodata.fly.dev
while forwarding package requests to the upstream server. The user points conda at the proxy server instead of the default channel. Another prototype adds a similar proxy into conda'srequests
-based HTTP backend, but duplicates an extra local cache on top of the existing cache.
The
.jlap
format is generic overJSON
. The underlying checksum/range request system is generic for any growing line-based file. Consider adapting it if you have a similar problem.
Server Side Improvements
I became theconda-index
maintainer, rewriting it fromconda-build index
to a new standalone package. We solved speed problems updating large repositories likeconda-forge
anddefaults
. This involvement made it easy to control the server-side data so that we could improve the client and server together.
Team Move
I became a full-time member of the conda team, having transferred from Anaconda's packaging team. I slowly began understanding conda's internals enough to be able to produce a solution that could integrate with conda's existing cache, instead of a local caching proxy.
Implementation of zstandard-compressed repodata.json, parallel downloads
On November 6,a community member noticed that fetchingrepodata.json
with on-the-fly server compression was slower than fetching it uncompressed, if your connection was faster than the remote server's gzip compressor. We mergedserver-siderepodata.json.zst
support into conda-index on November 14.
November's conda release included parallel package downloads and extraction, a speed improvement that makes a difference proportional to your latency to the package server.
Shiprepodata.json.zst
We shipped zstd-compressed repodata toconda-forge
anddefaults
on December 15.
Refactor cache
January's 23.1.0 conda release included a refactor of its cache that was important for incrementalrepodata.json
support. Instead of inlining cache metadata into a modifiedrepodata.json
, conda stores unmodifiedrepodata.json
in its cache, and stores cache metadata in a separate file. We avoid having to reserializerepodata.json
in many cases and can preserve its orginal content and formatting.
Wolf Vollprecht submitted a draft CEP to standardize the cache format between conda and mamba. We continue to converge on a shared format.
Shiprepodata.jlap
incremental repodata
March's 23.3.1 conda release shipped support forrepodata.jlap
under the--experimental=jlap
flag. This feature also includes support forrepodata.json.zst
with a fallback torepodata.json
if unavailable.
When the cache is empty, conda will try to downloadrepodata.json.zst
. This file is much faster to download and decompress compared toContent-Encoding: gzip
and is slightly smaller.
When the cache is primed, conda will look forrepodata.jlap
. It will download the entire file, apply any relevant patches (comparing to a content hash ofrepodata.json
) and remember the length of the patch file.
On subsequent fetches, conda will use a HTTP Range request to download new bytes, if any, added torepodata.jlap
, and apply new patches.
Conclusion
This screen capture of a conda-forge/noarch search shows that were able to download a single update to the channel in 1464 bytes for what would have otherwise have been a 10358799-byte download of the complete index. The patch size is proportional to the amount of change that has happened since the last time you ran conda.
When--experimental=jlap
is enabled, frequent conda users will see much faster index updates, especially when bandwidth is limited.
Top comments(0)
For further actions, you may consider blocking this person and/orreporting abuse