rdtLite is an R package that collects provenance as an R scriptexecutes. The resulting provenance provides a detailed record of theexecution of the script and includes information on the steps that wereperformed and the intermediate data values that were created. Theresulting provenance can be used for a wide variety of applications thatinclude debugging scripts, cleaning code, and reproducing results.rdtLite can also be used to collect provenance during consolesessions.
The provenance is stored in PROV-JSON format (for details seeJSON-format.md). For immediate use it maybe retrieved from memory usingthe prov.json function. For later use the provenance is also written tothe file prov.json. This file and associated files are written bydefault to the R session temporary directory. The user may change thislocation by (1) using the optional parameter prov.dir in the prov.run orprov.init functions, or (2) setting the prov.dir option (e.g. by usingthe R options command or editing the Rprofile.site or .Rprofile file).If prov.dir is set to “.”, the current working directory is used.
rdtLite provides two modes of operation. In script mode, the prov.runfunction is used to execute a script and collect provenance as thescript executes. In console mode, provenance is collected during aconsole session. Here the prov.init function is used to initiateprovenance collection, prov.save is used to save provenance collectedsince the last time prov.save was used, and prov.quit is used to saveand close the provenance file.
Simple data values are stored in the PROV-JSON file. “Snapshots” ofcomplex data values (e.g. data frames) are optionally stored byadjusting the value of the parameter max.snapshot.size in prov.run orprov.init. To collect provenance for inputs/outputs only and not forindividual statements, set the parameter details to FALSE inprov.run.
rdtLite belongs to a collection ofRTools developed as part of a larger project onEnd-to-end-provenance.
rdtLite currently requires R version 3.5.0 (or later) and thefollowing R packages: curl, devtools, digest, ggplot2, grDevices,gtools, jsonlite, knitr, methods, stringr, tools, utils, XML.
rdtLite is easily installed from GitHub using devtools:
library(devtools)install_github("End-to-end-provenance/rdtLite")Once installed, use the R library command to load rdtLite:
library(rdtLite)Note that all exported rdtLite functions begin with “prov.” to avoidconfusion with variable or function names in the main script or otherlibraries.
To capture provenance for an R script, set the working directory,load the rdtLite package (as above), and enter the following:
prov.run("my-script.R")where “my-script.R” is an R script in the working directory. Theprov.run command will execute the script and save the provenance in asubdirectory called “prov_my-script” under the current provenancedirectory (as above).
To capture provenance for a console session, enter the following:
prov.init()and enter commands at the R console. To save the provenance collectedso far to a subdirectory called “prov_console” under the currentprovenance directory (as above), enter the following:
prov.save()To save the provenance and quit provenance collection, enter thefollowing:
prov.quit()Note that various parameters of the prov.run and prov.init functionsmay be used to control where the provenance is stored and whetherearlier provenance at the same location should be overwritten.
The source code forrdt andrdtLite iscontained in theRDataTrackerrepository.A script is run nightly to create separate repos for each tool tofacilitate GitHub installations. If you’d like to see the long historyof code development and issues forrdt andrdtLite, orenter new issues, please see theRDataTrackerrepository.
TheRDataTracker repository contains the necessary filesto build and installrdt andrdtLite and to runextensive regression tests using Apache Ant. To do this, clone theRDataTracker repo and see the fileREADME_Build_and_Test.md.