Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork7.9k
Description
matplotlib includes its test data for image comparison tests in the git repository. Current HEAD is about 131MB of test data uncompressed. Not sure what the whole history of that data is, but it's a safe bet it's a significant fraction of the git repository.
There are some real advantages to this approach: The test data and the version of matplotlib they correspond to are easily syncronized by being in the same repo. The downside, of course, is the size of the repo.
There are a few alternative solutions I've been investigating, none of which seem to be the perfect answer, so I thought I'd open this up to a wider discussion.
git submodule
: The test data would move to another repo (call it thetests
repo), and the main repo has a special kind of symbolic link that points to a specific revision in thetests
repo. Thetests
repo is not cloned unless specifically asked for (git submodule update
). The downside ofgit submodule
is that a PR that requires both updating functionality in matplotlib and updating test data would have to be split into two PRs, one for each repo, and coordinated very carefully. The link in the matplotlib repo can not point to a revision in the fork of thetests
repo, so it will fail until the PR for the tests repo is merged. In short:git submodule
is awfully close to what we need, but it doesn't interact very well with the github PR workflow.
git subtree
: Seems to avoid the extreme separation of reposgit submodule
, and merges can take place involving both repos. However, it doesn't solve the problem of only cloning the test data if requested --git subtree
s are always deeply cloned. Additionally,git submodule
seems more appropriate if the two repos are separate projects usable on their own. I don't think that's the case here.
git annex
: Allows to check in special links to the git repo instead of files. The files these links referred to can then be fetched or cleared as requested. The actual file contents can live a number of places, like a WebDAV server, or another git repo (which probably makes the most sense for us, to use free github hosting).git annex
is a cool but fairly complex tool, but I think it's the closest to what we need.
Of course, none of this impacts how we distribute matplotlib, and more and more of our packages for end users just don't include the tests, and this is easy enough to do. So given the added complexity of all of the options above vs. the bandwidth and data costs of the status quo, I'm not sure it's obvious we should do anything. But, as I said, there might be some good solutions that come out of discussion.