I'm looking for a convenient and reliable way to make an R analysis reproducible, either at different times or across collaborators.
Listing the package versions or asessionInfo() output is not very helpful, as it leaves the work of re-creating the environment entirely to you.
As I see, in order to make analysis reproducible, the following has to be the same:
- theR environment (R, repos (Bioconductor), packages, and maybe RStudio)
- data
- code
- seed for random processes:
set.seed()
I'll post an answer, but would like to hear about other options.
- 2$\begingroup$tl;dr:There is currentlyno way to accomplish this in R. packrat comes close but ultimately falls short, because it only supports loadingone version per package, but different packages might share the same dependency, but require different versions. R needs something like Ruby’s Bundler. I’m planning to work on it, but it won’t work with packages due to fundamental architectural limitations.$\endgroup$Konrad Rudolph– Konrad Rudolph2017-09-09 16:08:29 +00:00CommentedSep 9, 2017 at 16:08
4 Answers4
You're right about trying to get the R environments to be as similar as possible, as easily as possible. It's not always necessary to have the exact environment to get the same results, but it definitely helps if you find yourself struggling to replicate results.
General tips for ensuring reproducibility, not specific to R
SeeSandveet al., PLoS Comp Bio, 2013. They mention a number of things that you do, like recording seed values and session info. I'll highlight a few points:
- Environment: Include a small script that lists all the necessary dependencies
- This should include all version numbers of languages and packages used
- Better implementation: make it a script that automatically installs dependencies
- Data: Share raw and/or processed data where possible
- At the very least, details on how to obtain the data you used
- This can be an accession number in a database, or a link to the source material
- Better implementation: again, make it a script that automatically downloads the raw data you used
- Code: Comment and document your code
- People (yourself included) need to understand what you've done and why
- Additionally, your work may be reproducible, but it may not be correct. If you (or anyone else) attempts to fix a problem in your code, they will have a difficult time doing so if they can't understand what your code is doing
- Code: Use version control
- Services like GitHub and BitBucket make it simple to share your exact work with collaborators
- Version control also ensures that you have a history of all the changes done to your work
- No manual manipulation of data
- If something is done to your data, write it down
R-specific tips
- The
packratpackage,as you mentioned in your answer - Using a number of custom functions for analyzing your data? Compile those functions into an R package
- Allows others to easily load/unload the exact functions you used through tools like CRAN, BioConductor, or
devtools::install_github() - Makes it easy to see whether these functions are working as intended, since they are separated from other parts of the analysis
- Allows others to easily load/unload the exact functions you used through tools like CRAN, BioConductor, or
- Make unit tests for your work with the
testthatpackage- Helps ensure that your code is working as intended, and gives others confidence in your work as well
- Write the tests using
testthat::test_that(), and check that they pass usingR CMD checkordevtools::check() - This works especially well with R packages
- This also works especially well withTravisCI, since you can automatically have your code get tested after you change it and make a commit
- ConsiderDocker images for your environment
- This bundles everything about the R environment together into one easily executable image (and it's not just limited to R)
- These images can be easily shared throughDocker Hub
- Note this might not always be the best solution, since few people know Docker and it's not always simple to use. Docker also requires admin privileges, which may not be possible for work on computer clusters
- There is lots of current discussion about Docker in science. SeeBoettiger, 2014, for an example, using R
- 1$\begingroup$If you use Docker, Bioconductor hasimages too.$\endgroup$llrs– llrs2017-09-01 07:18:18 +00:00CommentedSep 1, 2017 at 7:18
Another option I'm trying at the moment is the combination ofsnakemake,conda.io andknitr.
Conda is a "Package, dependency and environment management for any language" and could be used to set up theR-environment as well as other software. Note, most bioinformatics software can be found in the relatedbioconda project.
Snakemake is a workflow management system, which can be used to automate the preprocessing of your data in a reproducible manner (also it works nicely with conda since a conda environments can be specified in the Snake-Makefile)
Knitr for dynamic report generation with R. Well suited to document and reproduce your R workflow into reports, ticking yourcode and seedboxes.
- $\begingroup$how is snakemake working nicely with conda, as respect to more "traditional" systems like Makefiles?$\endgroup$mgalardini– mgalardini2017-09-10 12:15:20 +00:00CommentedSep 10, 2017 at 12:15
- 1$\begingroup$@mgalardini I read it as "is installable and works if installed via conda".$\endgroup$Michael Schubert– Michael Schubert2017-09-10 21:21:04 +00:00CommentedSep 10, 2017 at 21:21
- 1$\begingroup$There are dedicated directives for conda in the Snakefile. I changed the text to make it more clear. Also have a lookhere for further info$\endgroup$Sebastian Müller– Sebastian Müller2017-09-11 15:47:17 +00:00CommentedSep 11, 2017 at 15:47
One way to achieve this is viapackrat, a dependency management system developed by RStudio.
Setup:
2)Create script in project directory to start RStudio with specific R version:
export RSTUDIO_WHICH_R=/path/to/Rrstudio3)Bring project underpackrat control. Install bioconductor, install packages...
Every time you work on the project, run the above script in project directory.https://github.com/rstudio/packrat/issues/342#issuecomment-315059897
checkpoint is a system for ensuring you get packages of the same time / version. However, I confess that I haven't used it and thus can't vouch for it.
My feeling is that the rest will have to be handled outside this: code and data can be versioned in git (and dat). A whole solution might have use something like docker as others have suggested.
Explore related questions
See similar questions with these tags.


