Investigate separating test data from repository #5329

New issue

Open

Investigate separating test data from repository#5329

Labels

keepItems to be ignored by the “Stale” Github Actiontopic: testing

Milestone

future releases

Description

mdboom

opened

on Oct 26, 2015

matplotlib includes its test data for image comparison tests in the git repository. Current HEAD is about 131MB of test data uncompressed. Not sure what the whole history of that data is, but it's a safe bet it's a significant fraction of the git repository.

There are some real advantages to this approach: The test data and the version of matplotlib they correspond to are easily syncronized by being in the same repo. The downside, of course, is the size of the repo.

There are a few alternative solutions I've been investigating, none of which seem to be the perfect answer, so I thought I'd open this up to a wider discussion.

git submodule: The test data would move to another repo (call it thetests repo), and the main repo has a special kind of symbolic link that points to a specific revision in thetests repo. Thetests repo is not cloned unless specifically asked for (git submodule update). The downside ofgit submodule is that a PR that requires both updating functionality in matplotlib and updating test data would have to be split into two PRs, one for each repo, and coordinated very carefully. The link in the matplotlib repo can not point to a revision in the fork of thetests repo, so it will fail until the PR for the tests repo is merged. In short:git submodule is awfully close to what we need, but it doesn't interact very well with the github PR workflow.

git subtree: Seems to avoid the extreme separation of reposgit submodule, and merges can take place involving both repos. However, it doesn't solve the problem of only cloning the test data if requested --git subtree s are always deeply cloned. Additionally,git submodule seems more appropriate if the two repos are separate projects usable on their own. I don't think that's the case here.

git annex: Allows to check in special links to the git repo instead of files. The files these links referred to can then be fetched or cleared as requested. The actual file contents can live a number of places, like a WebDAV server, or another git repo (which probably makes the most sense for us, to use free github hosting).git annex is a cool but fairly complex tool, but I think it's the closest to what we need.

Of course, none of this impacts how we distribute matplotlib, and more and more of our packages for end users just don't include the tests, and this is easy enough to do. So given the added complexity of all of the options above vs. the bandwidth and data costs of the status quo, I'm not sure it's obvious we should do anything. But, as I said, there might be some good solutions that come out of discussion.

Metadata

Assignees

No one assigned

Labels

keepItems to be ignored by the “Stale” Github Actiontopic: testing

Type

No type

Projects

No projects

Milestone

future releasesNo due date

Relationships

None yet

Development

No branches or pull requests

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Investigate separating test data from repository #5329

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions