- Notifications
You must be signed in to change notification settings - Fork2
📦 R package for working with Content Identifiers
License
Unknown, MIT licenses found
Licenses found
cboettig/contentid
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
contentid seeks to facilitate reproducible workflows that involveexternal data files through the use of content identifiers.
Install the current development version using:
# install.packages("remotes")remotes::install_github("cboettig/contentid")
library(contentid)Instead of reading in data directly from a local file or URL, useregister() to register permanent content-based identifiers for yourexternal data file or URL:
register("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/ess-dive-457358fdc81d3a5-20180726T203952542")#> [1] "hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37"
Then,resolve() that content-based identifier in your scripts for morereproducible workflow. Optionally, setstore=TRUE to enable localcaching:
vostok<- resolve("hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37",store=TRUE)
resolve will download and cryptographically verify the identifiermatches the content, returning a local file path. Use that file path inthe of our analysis script, e.g.
co2<- read.table(vostok,col.names= c("depth","age_ice","age_air","co2"),skip=21)
R users frequently write scripts which must load data from an externalfile – a step which increases friction in reuse and creates a commonfailure point in reproducibility of the analysis later on. Reading afile directly from a URL is often preferable, since we don’t have toworry about distributing the data separately ourselves. For example, ananalysis might read in the famous CO2 ice core data directly from ORNLrepository:
co2<- read.table("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/ess-dive-457358fdc81d3a5-20180726T203952542",col.names= c("depth","age_ice","age_air","co2"),skip=21)
However, we know that data hosted at a given URL could change ordisappear, and not all data we want to work with is available at a URLto begin with. Digital Object Identifiers (DOIs) were created to dealwith these problems of ‘link rot’. Unfortunately, there is no straightforward and general way to read data directly from a DOI, (which almostalways resolves to a human-readable webpage rather than the dataitself), often apply to collections of files rather than individualsource we want to read in our script, and we must frequently work withdata that does not (yet) have a DOI. Registering a DOI for a dataset hasgotten easier through repositories with simple APIs like Zenodo andfigshare, but this is still an involved process and still leaves uswithout a mechanism to directly access the data. For instance, the datareferenced above has DOIhttps://doi.org/10.3334/CDIAC/ATG.009, butthis is still not easy to work directly into our R scripts.
contentid offers a complementary approach to addressing thischallenge, which will work with data that has (or will later receive) aDOI, but also with arbitrary URLs or with local files. The basic idea isquite similar to referencing data by DOI: we first “register” anidentifier, and then we use that identifier to retrieve the data in ourscripts:
register("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/ess-dive-457358fdc81d3a5-20180726T203952542")#> [1] "hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37"
Registering the data returns an identifier that we canresolve in ourscripts to later read in the file:
co2_file<- resolve("hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37")co2_b<- read.table(co2_file,col.names= c("depth","age_ice","age_air","co2"),skip=21)
Note that we have manually embedded the identifier in our script, ratherthan automatically passing the identifier returned byregister()directly to resolve. The command toregister() needs to only be runonce, and thus doesn’t need to be embedded in our script (though it isharmless to include it, as it will always return the same identifierunless the data file itself changes).
We can confirm this is the same data:
identical(co2,co2_b)#> [1] TRUE
As the identifier (hash://sha256/...) itself suggests, this is merelythe SHA-256 hash of the requested file. This means that unless the dataat that URL changes, we will always get that same identifier back whenwe register that file. If we have a copy of that data someplace else, wecan verify it is indeed precisely the same data. For instance,contentid includes a copy of this file as well. Registering the localcopy verifies that it indeed has the same hash:
co2_file_c<- system.file("extdata","vostok.icecore.co2",package="contentid")register(co2_file_c)#> [1] "hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37"
We have now registered the same content at two locations: a URL and alocal file path.resolve() will use this registry information toaccess the requested content.resolve() will choose a local pathfirst, allowing us to avoid re-downloading any content we already have.resolve() will verify the content of any local file or file downloadedfrom a URL matches the requested content hash before returning the path.If the file has been altered in any way, the hash will no longer matchandresolve() will try the next source.
We can get a better sense of this process by querying for all availablesources for our requested content:
sources("hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37")#> # A tibble: 5 × 2#> source date#> <chr> <dttm>#> 1 /usr/local/lib/R/site-library/contentid/extdata/vostok.ic… 2022-12-01 17:39:31#> 2 /tmp/Rtmps8N03b/sha256/94/12/9412325831dab22aeebdd674b6eb… 2022-12-01 17:39:30#> 3 https://archive.softwareheritage.org/api/1/content/sha256… 2022-12-01 17:39:31#> 4 https://knb.ecoinformatics.org/knb/d1/mn/v2/object/ess-di… 2022-12-01 17:39:31#> 5 https://zenodo.org/api/files/5967f986-b599-4492-9a08-94ce… 2022-12-01 17:26:27
Note thatsources() has found more locations than we have registeredabove. This is because in addition to maintaining a local registry ofsources,contentid registers online sources with the Hash Archive,https://hash-archive.org. (The Hash Archive doesn’t store content, butonly a list of public links at which content matching the hash has beenseen.)query_sources() has also checked for this content on theSoftware Heritage Archive, which does periodic crawls of all publiccontent on GitHub which have also picked up a copy of this exact file.With each URL is a date at which it was last seen - repeated calls toregister() will update this date, or lead to a source being deprecatedfor this content if the content it serves no longer matches therequested hash. We can view the history of all registrations of a givensource usinghistory_url().
This approach can also be used with local or unpublished data.register()ing a local file only creates an entry incontentid’slocal registry, so this does not provide a backup copy of the data or amechanism to distribute it to collaborators. But it does provide a checkthat the data has not accidentally changed on our disk. If we move thedata or eventually publish the data, we have only to register these newlocations and we never need to update a script that accesses the datausing calls toresolve() likeread.table(resolve("hash://sha256/xxx...")) rather than using localfile names.
If we prefer to keep a local copy of a specific dataset around,(e.g. for data that is used frequently or used across multipleprojects), we can instructresolve() to store a persistent copy incontentid’s local storage:
co2_file<- resolve("hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37",store=TRUE)
Any future calls toresolve() with this hash on this machine will thenalways be able to load the content from the local store. This provides aconvenient way to cache downloads for future use. Because the localstore is based on the content identifier, repeatedly storing the samecontent will have no effect, and we cannot easily overwrite oraccidentally delete this content.
register() andresolve() provide a low-friction mechanism to createa permanent identifier for external files and then resolve thatidentifier to an appropriate source. This can be useful in scripts thatare frequently re-run as a way of caching the download step, andsimultaneously helps ensure the script is more reproducible. While thisapproach is not fail-proof (since all registered locations could fail toproduce the content), if all else fails our script itself still containsa cryptographic fingerprint of the data we could use to verify if agiven file was really the one used.
contentid is largely based on the design and implementation ofhttps://hash-archive.org, and can interface with thehttps://hash-archive.org API or mimic it locally.contentid alsodraws inspiration fromPreston,a biodiversity dataset tracker, andElton, acommand-line tool to update/clone, review and index existing speciesinteraction datasets.
This work is funded in part by grantNSF OAC1839201from the National Science Foundation.
About
📦 R package for working with Content Identifiers
Resources
License
Unknown, MIT licenses found
Licenses found
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Contributors2
Uh oh!
There was an error while loading.Please reload this page.