cboettig/contentidPublic

NotificationsYou must be signed in to change notification settings
Fork2
Star46

📦 R package for working with Content Identifiers

License

Unknown, MIT licenses found

Licenses found

46 stars 2 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 374 Commits
.github		.github
R		R
docs		docs
inst		inst
man		man
paper		paper
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
CRAN-RELEASE		CRAN-RELEASE
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
contentid.Rproj		contentid.Rproj
cran-comments.md		cran-comments.md

Repository files navigation

contentid

contentid seeks to facilitate reproducible workflows that involveexternal data files through the use of content identifiers.

Quick start

Install the current development version using:

# install.packages("remotes")remotes::install_github("cboettig/contentid")

library(contentid)

Instead of reading in data directly from a local file or URL, useregister() to register permanent content-based identifiers for yourexternal data file or URL:

register("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/ess-dive-457358fdc81d3a5-20180726T203952542")#> [1] "hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37"

Then,resolve() that content-based identifier in your scripts for morereproducible workflow. Optionally, setstore=TRUE to enable localcaching:

vostok<- resolve("hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37",store=TRUE)

resolve will download and cryptographically verify the identifiermatches the content, returning a local file path. Use that file path inthe of our analysis script, e.g.

co2<- read.table(vostok,col.names= c("depth","age_ice","age_air","co2"),skip=21)

Overview

R users frequently write scripts which must load data from an externalfile – a step which increases friction in reuse and creates a commonfailure point in reproducibility of the analysis later on. Reading afile directly from a URL is often preferable, since we don’t have toworry about distributing the data separately ourselves. For example, ananalysis might read in the famous CO2 ice core data directly from ORNLrepository:

co2<- read.table("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/ess-dive-457358fdc81d3a5-20180726T203952542",col.names= c("depth","age_ice","age_air","co2"),skip=21)

However, we know that data hosted at a given URL could change ordisappear, and not all data we want to work with is available at a URLto begin with. Digital Object Identifiers (DOIs) were created to dealwith these problems of ‘link rot’. Unfortunately, there is no straightforward and general way to read data directly from a DOI, (which almostalways resolves to a human-readable webpage rather than the dataitself), often apply to collections of files rather than individualsource we want to read in our script, and we must frequently work withdata that does not (yet) have a DOI. Registering a DOI for a dataset hasgotten easier through repositories with simple APIs like Zenodo andfigshare, but this is still an involved process and still leaves uswithout a mechanism to directly access the data. For instance, the datareferenced above has DOIhttps://doi.org/10.3334/CDIAC/ATG.009, butthis is still not easy to work directly into our R scripts.

contentid offers a complementary approach to addressing thischallenge, which will work with data that has (or will later receive) aDOI, but also with arbitrary URLs or with local files. The basic idea isquite similar to referencing data by DOI: we first “register” anidentifier, and then we use that identifier to retrieve the data in ourscripts:

register("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/ess-dive-457358fdc81d3a5-20180726T203952542")#> [1] "hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37"

Registering the data returns an identifier that we canresolve in ourscripts to later read in the file:

co2_file<- resolve("hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37")co2_b<- read.table(co2_file,col.names= c("depth","age_ice","age_air","co2"),skip=21)

Note that we have manually embedded the identifier in our script, ratherthan automatically passing the identifier returned byregister()directly to resolve. The command toregister() needs to only be runonce, and thus doesn’t need to be embedded in our script (though it isharmless to include it, as it will always return the same identifierunless the data file itself changes).

We can confirm this is the same data:

identical(co2,co2_b)#> [1] TRUE

How this works

As the identifier (hash://sha256/...) itself suggests, this is merelythe SHA-256 hash of the requested file. This means that unless the dataat that URL changes, we will always get that same identifier back whenwe register that file. If we have a copy of that data someplace else, wecan verify it is indeed precisely the same data. For instance,contentid includes a copy of this file as well. Registering the localcopy verifies that it indeed has the same hash:

co2_file_c<- system.file("extdata","vostok.icecore.co2",package="contentid")register(co2_file_c)#> [1] "hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37"

We have now registered the same content at two locations: a URL and alocal file path.resolve() will use this registry information toaccess the requested content.resolve() will choose a local pathfirst, allowing us to avoid re-downloading any content we already have.resolve() will verify the content of any local file or file downloadedfrom a URL matches the requested content hash before returning the path.If the file has been altered in any way, the hash will no longer matchandresolve() will try the next source.

We can get a better sense of this process by querying for all availablesources for our requested content:

sources("hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37")#> # A tibble: 5 × 2#>   source                                                     date#>   <chr>                                                      <dttm>#> 1 /usr/local/lib/R/site-library/contentid/extdata/vostok.ic… 2022-12-01 17:39:31#> 2 /tmp/Rtmps8N03b/sha256/94/12/9412325831dab22aeebdd674b6eb… 2022-12-01 17:39:30#> 3 https://archive.softwareheritage.org/api/1/content/sha256… 2022-12-01 17:39:31#> 4 https://knb.ecoinformatics.org/knb/d1/mn/v2/object/ess-di… 2022-12-01 17:39:31#> 5 https://zenodo.org/api/files/5967f986-b599-4492-9a08-94ce… 2022-12-01 17:26:27

Note thatsources() has found more locations than we have registeredabove. This is because in addition to maintaining a local registry ofsources,contentid registers online sources with the Hash Archive,https://hash-archive.org. (The Hash Archive doesn’t store content, butonly a list of public links at which content matching the hash has beenseen.)query_sources() has also checked for this content on theSoftware Heritage Archive, which does periodic crawls of all publiccontent on GitHub which have also picked up a copy of this exact file.With each URL is a date at which it was last seen - repeated calls toregister() will update this date, or lead to a source being deprecatedfor this content if the content it serves no longer matches therequested hash. We can view the history of all registrations of a givensource usinghistory_url().

This approach can also be used with local or unpublished data.register()ing a local file only creates an entry incontentid’slocal registry, so this does not provide a backup copy of the data or amechanism to distribute it to collaborators. But it does provide a checkthat the data has not accidentally changed on our disk. If we move thedata or eventually publish the data, we have only to register these newlocations and we never need to update a script that accesses the datausing calls toresolve() likeread.table(resolve("hash://sha256/xxx...")) rather than using localfile names.

If we prefer to keep a local copy of a specific dataset around,(e.g. for data that is used frequently or used across multipleprojects), we can instructresolve() to store a persistent copy incontentid’s local storage:

co2_file<- resolve("hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37",store=TRUE)

Any future calls toresolve() with this hash on this machine will thenalways be able to load the content from the local store. This provides aconvenient way to cache downloads for future use. Because the localstore is based on the content identifier, repeatedly storing the samecontent will have no effect, and we cannot easily overwrite oraccidentally delete this content.

register() andresolve() provide a low-friction mechanism to createa permanent identifier for external files and then resolve thatidentifier to an appropriate source. This can be useful in scripts thatare frequently re-run as a way of caching the download step, andsimultaneously helps ensure the script is more reproducible. While thisapproach is not fail-proof (since all registered locations could fail toproduce the content), if all else fails our script itself still containsa cryptographic fingerprint of the data we could use to verify if agiven file was really the one used.

Acknowledgements

contentid is largely based on the design and implementation ofhttps://hash-archive.org, and can interface with thehttps://hash-archive.org API or mimic it locally.contentid alsodraws inspiration fromPreston,a biodiversity dataset tracker, andElton, acommand-line tool to update/clone, review and index existing speciesinteraction datasets.

This work is funded in part by grantNSF OAC1839201from the National Science Foundation.

About

📦 R package for working with Content Identifiers

cboettig.github.io/contentid

Resources

Readme

License

Unknown, MIT licenses found

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Licenses found

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

contentid

Quick start

Overview

How this works

Acknowledgements

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases1

Packages

Uh oh!

Uh oh!

Contributors2

Uh oh!

Languages

Movatterモバイル変換

License

Licenses found

cboettig/contentid

Folders and files

Latest commit

History

Repository files navigation

contentid

Quick start

Overview

How this works

Acknowledgements

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases1

Packages0

Uh oh!

Uh oh!

Contributors2

Uh oh!

Languages

Packages