Ultimate reproducibility in R?

Question 1

I'm looking for a convenient and reliable way to make an R analysis reproducible, either at different times or across collaborators.

Listing the package versions or asessionInfo() output is not very helpful, as it leaves the work of re-creating the environment entirely to you.

As I see, in order to make analysis reproducible, the following has to be the same:

theR environment (R, repos (Bioconductor), packages, and maybe RStudio)
data
code
seed for random processes:set.seed()

I'll post an answer, but would like to hear about other options.

Question 2

tl;dr:There is currentlyno way to accomplish this in R. packrat comes close but ultimately falls short, because it only supports loadingone version per package, but different packages might share the same dependency, but require different versions. R needs something like Ruby’s Bundler. I’m planning to work on it, but it won’t work with packages due to fundamental architectural limitations.

Question 3

You're right about trying to get the R environments to be as similar as possible, as easily as possible. It's not always necessary to have the exact environment to get the same results, but it definitely helps if you find yourself struggling to replicate results.

General tips for ensuring reproducibility, not specific to R

SeeSandveet al., PLoS Comp Bio, 2013. They mention a number of things that you do, like recording seed values and session info. I'll highlight a few points:

Environment: Include a small script that lists all the necessary dependencies
- This should include all version numbers of languages and packages used
- Better implementation: make it a script that automatically installs dependencies
Data: Share raw and/or processed data where possible
- At the very least, details on how to obtain the data you used
- This can be an accession number in a database, or a link to the source material
- Better implementation: again, make it a script that automatically downloads the raw data you used
Code: Comment and document your code
- People (yourself included) need to understand what you've done and why
- Additionally, your work may be reproducible, but it may not be correct. If you (or anyone else) attempts to fix a problem in your code, they will have a difficult time doing so if they can't understand what your code is doing
Code: Use version control
- Services like GitHub and BitBucket make it simple to share your exact work with collaborators
- Version control also ensures that you have a history of all the changes done to your work
No manual manipulation of data
- If something is done to your data, write it down

R-specific tips

Thepackrat package,as you mentioned in your answer
Using a number of custom functions for analyzing your data? Compile those functions into an R package
- Allows others to easily load/unload the exact functions you used through tools like CRAN, BioConductor, ordevtools::install_github()
- Makes it easy to see whether these functions are working as intended, since they are separated from other parts of the analysis
Make unit tests for your work with thetestthat package
- Helps ensure that your code is working as intended, and gives others confidence in your work as well
- Write the tests usingtestthat::test_that(), and check that they pass usingR CMD check ordevtools::check()
- This works especially well with R packages
- This also works especially well withTravisCI, since you can automatically have your code get tested after you change it and make a commit
ConsiderDocker images for your environment
- This bundles everything about the R environment together into one easily executable image (and it's not just limited to R)
- These images can be easily shared throughDocker Hub
- Note this might not always be the best solution, since few people know Docker and it's not always simple to use. Docker also requires admin privileges, which may not be possible for work on computer clusters
- There is lots of current discussion about Docker in science. SeeBoettiger, 2014, for an example, using R

Question 4

If you use Docker, Bioconductor hasimages too.

Question 5

Another option I'm trying at the moment is the combination ofsnakemake,conda.io andknitr.

Conda is a "Package, dependency and environment management for any language" and could be used to set up theR-environment as well as other software. Note, most bioinformatics software can be found in the relatedbioconda project.
Snakemake is a workflow management system, which can be used to automate the preprocessing of your data in a reproducible manner (also it works nicely with conda since a conda environments can be specified in the Snake-Makefile)
Knitr for dynamic report generation with R. Well suited to document and reproduce your R workflow into reports, ticking yourcode and seedboxes.

Question 6

how is snakemake working nicely with conda, as respect to more "traditional" systems like Makefiles?

Question 7

@mgalardini I read it as "is installable and works if installed via conda".

Question 8

There are dedicated directives for conda in the Snakefile. I changed the text to make it more clear. Also have a lookhere for further info

Question 9

One way to achieve this is viapackrat, a dependency management system developed by RStudio.

Setup:

1)Create project

2)Create script in project directory to start RStudio with specific R version:

export RSTUDIO_WHICH_R=/path/to/Rrstudio

3)Bring project underpackrat control. Install bioconductor, install packages...

Every time you work on the project, run the above script in project directory.https://github.com/rstudio/packrat/issues/342#issuecomment-315059897

Question 10

checkpoint is a system for ensuring you get packages of the same time / version. However, I confess that I haven't used it and thus can't vouch for it.

My feeling is that the rest will have to be handled outside this: code and data can be versioned in git (and dat). A whole solution might have use something like docker as others have suggested.

James Hawley 1,3847 silver badges21 bronze badges · Accepted Answer · 2017-08-31 15:35:45Z

You're right about trying to get the R environments to be as similar as possible, as easily as possible. It's not always necessary to have the exact environment to get the same results, but it definitely helps if you find yourself struggling to replicate results.

General tips for ensuring reproducibility, not specific to R

SeeSandveet al., PLoS Comp Bio, 2013. They mention a number of things that you do, like recording seed values and session info. I'll highlight a few points:

Environment: Include a small script that lists all the necessary dependencies
- This should include all version numbers of languages and packages used
- Better implementation: make it a script that automatically installs dependencies
Data: Share raw and/or processed data where possible
- At the very least, details on how to obtain the data you used
- This can be an accession number in a database, or a link to the source material
- Better implementation: again, make it a script that automatically downloads the raw data you used
Code: Comment and document your code
- People (yourself included) need to understand what you've done and why
- Additionally, your work may be reproducible, but it may not be correct. If you (or anyone else) attempts to fix a problem in your code, they will have a difficult time doing so if they can't understand what your code is doing
Code: Use version control
- Services like GitHub and BitBucket make it simple to share your exact work with collaborators
- Version control also ensures that you have a history of all the changes done to your work
No manual manipulation of data
- If something is done to your data, write it down

R-specific tips

Thepackrat package,as you mentioned in your answer
Using a number of custom functions for analyzing your data? Compile those functions into an R package
- Allows others to easily load/unload the exact functions you used through tools like CRAN, BioConductor, ordevtools::install_github()
- Makes it easy to see whether these functions are working as intended, since they are separated from other parts of the analysis
Make unit tests for your work with thetestthat package
- Helps ensure that your code is working as intended, and gives others confidence in your work as well
- Write the tests usingtestthat::test_that(), and check that they pass usingR CMD check ordevtools::check()
- This works especially well with R packages
- This also works especially well withTravisCI, since you can automatically have your code get tested after you change it and make a commit
ConsiderDocker images for your environment
- This bundles everything about the R environment together into one easily executable image (and it's not just limited to R)
- These images can be easily shared throughDocker Hub
- Note this might not always be the best solution, since few people know Docker and it's not always simple to use. Docker also requires admin privileges, which may not be possible for work on computer clusters
- There is lots of current discussion about Docker in science. SeeBoettiger, 2014, for an example, using R

Movatterモバイル変換

Stack Exchange Network

Ultimate reproducibility in R?

4 Answers4

General tips for ensuring reproducibility, not specific to R

R-specific tips

Your Answer

Sign up orlog in

Post as a guest

Hot Network Questions

Subscribe to RSS