| Type: | Package |
| Title: | General Purpose 'Oai-PMH' Services Client |
| Description: | A general purpose client to work with any 'OAI-PMH' (Open Archives Initiative Protocol for 'Metadata' Harvesting) service. The 'OAI-PMH' protocol is described athttp://www.openarchives.org/OAI/openarchivesprotocol.html. Functions are provided to work with the 'OAI-PMH' verbs: 'GetRecord', 'Identify', 'ListIdentifiers', 'ListMetadataFormats', 'ListRecords', and 'ListSets'. |
| Version: | 0.4.0 |
| License: | MIT + file LICENSE |
| URL: | https://docs.ropensci.org/oai/,https://github.com/ropensci/oai |
| BugReports: | https://github.com/ropensci/oai/issues |
| VignetteBuilder: | knitr |
| LazyData: | true |
| Encoding: | UTF-8 |
| Imports: | xml2 (≥ 1.0.0), httr (≥ 1.2.0), plyr (≥ 1.8.4), stringr (≥1.1.0), tibble (≥ 1.2) |
| Suggests: | DBI, knitr, RSQLite, testthat, markdown, covr |
| RoxygenNote: | 7.2.1 |
| NeedsCompilation: | no |
| Packaged: | 2022-11-10 11:32:01 UTC; mbojan |
| Author: | Scott Chamberlain [aut], Michal Bojanowski |
| Maintainer: | Michal Bojanowski <michal2992@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2022-11-10 16:10:02 UTC |
OAI-PMH Client
Description

oai is an R client to work with OAI-PMH (Open ArchivesInitiative Protocol for Metadata Harvesting) services, a protocoldeveloped by the Open Archives Initiative(https://en.wikipedia.org/wiki/Open_Archives_Initiative).OAI-PMH uses XML data format transported over HTTP.
OAI-PMH Info
See the OAI-PMH V2 specification athttp://www.openarchives.org/OAI/openarchivesprotocol.html
Implementation details
oai is built onxml2 andhttr. In addition, we give backdata.frame's whenever possible to make data comprehension, manipulation,and visualization easier. We also have functions to fetch a large directoryof OAI-PMH services - it isn't exhaustive, but does contain a lot.
Paging
Instead of paging with e.g.,page andper_page parameters,OAI-PMH uses (optionally)resumptionTokens, with an optionalexpiration date. These tokens can be used to continue on to the next chunkof data, if the first request did not get to the end. Often, OAI-PMHservices limit each request to 50 records, but this may vary by provider,I don't know for sure. The API of this package is such that wewhileloop for you internally until we get all records. We may in the futureexpose e.g., alimit parameter so you can say how many recordsyou want, but we haven't done this yet.
Acknowledgements
Michal Bojanowski contributions were supported by (Polish) National ScienceCenter (NCN) through grant 2012/07/D/HS6/01971.
Author(s)
Scott Chamberlainmyrmecocystus@gmail.com
Michal Bojanowskimichal2992@gmail.com
See Also
Useful links:
Report bugs athttps://github.com/ropensci/oai/issues
Count OAI-PMH identifiers for a data provider.
Description
Count OAI-PMH identifiers for a data provider.
Usage
count_identifiers(url = "http://export.arxiv.org/oai2", prefix = "oai_dc", ...)Arguments
url | (character) OAI-PMH base url. Defaults to the URL forarXiv's OAI-PMH server (http://export.arxiv.org/oai2)or GBIF's OAI-PMH server (http://api.gbif.org/v1/oai-pmh/registry) |
prefix | Specifies the metadata format that the records will bereturned in |
... | Curl options passed on to |
Details
Note that some OAI providers do not include the entrycompleteListSize(http://www.openarchives.org/OAI/openarchivesprotocol.html#FlowControl)in which case we return an NA - which does not mean 0, but rather we don'tknow.
Examples
## Not run: count_identifiers()# curl options# library("httr")# count_identifiers(config = verbose())## End(Not run)Result dumpers
Description
Result dumpers are functions allowing to handle the chunks of results fromOAI-PMH service "on the fly". Handling can include processing, writing tofiles, databases etc.
Usage
dump_raw_to_txt( res, args, as, file_pattern = "oaidump", file_dir = ".", file_ext = ".xml")dump_to_rds( res, args, as, file_pattern = "oaidump", file_dir = ".", file_ext = ".rds")dump_raw_to_db(res, args, as, dbcon, table_name, field_name, ...)Arguments
res | results, depends on |
args | list, query arguments, not to be specified by the user |
as | character, type of result to return, not to be specified by theuser |
file_pattern,file_dir,file_ext | character respectively: initial part ofthe file name, directory name, and file extension used to create filenames. These arguments are passed to |
dbcon | DBI-compliant database connection |
table_name | character, name of the database table to write into |
field_name | character, name of the field in database table to writeinto |
... | arguments passed to/from other functions |
Details
Often the result of a request to a OAI-PMH service are so large that it issplit into chunks that need to be requested separately usingresumptionToken. By default functions likelist_identifiers() orlist_records() request thesechunks under the hood and return all concatenated in a single R object. Itis convenient but insufficient when dealing with large result sets thatmight not fit into RAM. A result dumper is a function that is called oneach result chunk. Dumper functions can write chunks to files or databases,include initial pre-processing or extraction, and so on.
A result dumper needs to be function that accepts at least the arguments:res,args,as. They will get values by the enclosingfunction internally. There may be additional arguments, including....Dumpers should returnNULL or a value that willbe collected and returned by the function calling the dumper (e.g.list_records()).
Currently result dumpers can be used with functions:list_identifiers(),list_records(), andlist_sets().To use a dumper with one of these functions you need to:
Pass it as an additional argument
dumperPass optional addtional arguments to the dumper function in a listas the
dumper_argsargument
See Examples. Below we provide more details on the dumpers currentlyimplemented.
dump_raw_to_txt writes raw XML to text files. It requiresas=="raw". File names are created usingtempfile(). Bydefault they are written in the current working directory and have a formatoaidump*.xml where* is a random string in hex.
dump_to_rds saves results in an.rds file viasaveRDS().Type of object being saved is determined by theas argument. File namesare generated in the same way as bydump_raw_to_txt, but with defaultextension.rds
dump_xml_to_db writes raw XML to a single text column of a table in adatabase. Requiresas == "raw". Database connectiondbconshould be a connection object as created byDBI::dbConnect() frompackageDBI. As such, it can connect to any database supported byDBI. The records are written to a fieldfield_name in a tabletable_name usingDBI::dbWriteTable(). If the table does notexist, it is created. If it does, the records are appended. Any additionalarguments are passed toDBI::dbWriteTable()
Value
Dumpers should returnNULL or a value that will be collectedand returned by the function using the dumper.
dump_raw_to_txt returns the name of the created file.
dump_to_rds returns the name of the created file.
dump_xml_to_db returnsNULL
References
OAI-PMH specificationhttps://www.openarchives.org/OAI/openarchivesprotocol.html
See Also
Functions supporting the dumpers:list_identifiers(),list_sets(), andlist_records()
Examples
## Not run: ### Dumping raw XML to text files# This will write a set of XML files to a temporary directoryfnames <- list_identifiers(from="2018-06-01T", until="2018-06-14T", as="raw", dumper=dump_raw_to_txt, dumper_args=list(file_dir=tempdir()))# vector of file names createdstr(fnames)all( file.exists(fnames) )# clean-upunlink(fnames)### Dumping raw XML to a database# Connect to in-memory SQLite databasecon <- DBI::dbConnect(RSQLite::SQLite(), dbname=":memory:")# Harvest and dump the results into field "bar" of table "foo"list_identifiers(from="2018-06-01T", until="2018-06-14T", as="raw", dumper=dump_raw_to_db, dumper_args=list(dbcon=con, table_name="foo", field_name="bar") )# Count records, should be 101DBI::dbGetQuery(con, "SELECT count(*) as no_records FROM foo")DBI::dbDisconnect(con)## End(Not run)Get records
Description
Get records
Usage
get_records( ids, prefix = "oai_dc", url = "http://api.gbif.org/v1/oai-pmh/registry", as = "parsed", ...)Arguments
ids | The OAI-PMH identifier for the record. One or more. Required. |
prefix | specifies the metadata format that the records will bereturned in. Default: |
url | (character) OAI-PMH base url. Defaults to the URL forarXiv's OAI-PMH server (http://export.arxiv.org/oai2)or GBIF's OAI-PMH server (http://api.gbif.org/v1/oai-pmh/registry) |
as | (character) What to return. One of "parsed" (default),or "raw" (raw text) |
... | Curl options passed on to |
Details
There are some finite set of results based on the OAI prefix.We will provide parsers as we have time, and as users express interest.For prefix types we have parsers for we return a list of data.frame's,for each identifier, one data.frame for theheader bits of data, andone data.frame for themetadata bits of data.
For prefixes we don't have parsers for, we fall back to returning rawXML, so you can at least parse the XML yourself.
Because some XML nodes are duplicated, we join values together ofduplicated node names, separated by a semicolon (;) with nospaces. You can seprarate them yourself easily.
Value
a named list of data.frame's, or lists, or raw text
Examples
## Not run: get_records("87832186-00ea-44dd-a6bf-c2896c4d09b4")ids <- c("87832186-00ea-44dd-a6bf-c2896c4d09b4", "d981c07d-bc43-40a2-be1f-e786e25106ac")(res <- get_records(ids))lapply(res, "[[", "header")lapply(res, "[[", "metadata")do.call(rbind, lapply(res, "[[", "header"))do.call(rbind, lapply(res, "[[", "metadata"))# Get raw textget_records("d981c07d-bc43-40a2-be1f-e786e25106ac", as = "raw")# from arxiv.orgget_records("oai:arXiv.org:0704.0001", url = "http://export.arxiv.org/oai2")## End(Not run)Identify the OAI-PMH service for each data provider.
Description
Identify the OAI-PMH service for each data provider.
Usage
id(url, as = "parsed", ...)Arguments
url | (character) OAI-PMH base url. Defaults to the URL forarXiv's OAI-PMH server (http://export.arxiv.org/oai2)or GBIF's OAI-PMH server (http://api.gbif.org/v1/oai-pmh/registry) |
as | (character) What to return. One of "parsed" (default),or "raw" (raw text) |
... | Curl options passed on to |
Examples
## Not run: # arxivid("http://export.arxiv.org/oai2")# GBIF - http://www.gbif.org/id("http://api.gbif.org/v1/oai-pmh/registry")# get back text instead of parsedid("http://export.arxiv.org/oai2", as = "raw")id("http://api.gbif.org/v1/oai-pmh/registry", as = "raw")# curl optionslibrary("httr")id("http://export.arxiv.org/oai2", config = verbose())## End(Not run)List OAI-PMH identifiers
Description
List OAI-PMH identifiers
Usage
list_identifiers( url = "http://api.gbif.org/v1/oai-pmh/registry", prefix = "oai_dc", from = NULL, until = NULL, set = NULL, token = NULL, as = "df", ...)Arguments
url | (character) OAI-PMH base url. Defaults to the URL forarXiv's OAI-PMH server (http://export.arxiv.org/oai2)or GBIF's OAI-PMH server (http://api.gbif.org/v1/oai-pmh/registry) |
prefix | Specifies the metadata format that the records will bereturned in. |
from | specifies that records returned must have beencreated/update/deleted on or after this date. |
until | specifies that records returned must have beencreated/update/deleted on or before this date. |
set | specifies the set that returned records must belong to. |
token | a token previously provided by the server to resume a requestwhere it last left off. |
as | (character) What to return. One of "df" (for data.frame; default),"list", or "raw" (raw text) |
... | Curl options passed on to |
Examples
## Not run: # fromrecently <- format(Sys.Date() - 1, "%Y-%m-%d")list_identifiers(from = recently)# from and untillist_identifiers(from = '2018-06-01T', until = '2018-06-14T')# set parameter - here, using ANDS - Australian National Data Servicelist_identifiers(from = '2018-09-01T', until = '2018-09-05T', set = "dataset_type:CHECKLIST")## End(Not run)List available metadata formats from various providers.
Description
List available metadata formats from various providers.
Usage
list_metadataformats( url = "http://api.gbif.org/v1/oai-pmh/registry", id = NULL, ...)Arguments
url | (character) OAI-PMH base url. Defaults to the URL forarXiv's OAI-PMH server (http://export.arxiv.org/oai2)or GBIF's OAI-PMH server (http://api.gbif.org/v1/oai-pmh/registry) |
id | The OAI-PMH identifier for the record. Optional. |
... | Curl options passed on to |
Examples
## Not run: list_metadataformats()# no metadatformats for an identifierlist_metadataformats(id = "9da8a65a-1b9b-487c-a564-d184a91a2705")# metadatformats available for an identifierlist_metadataformats(id = "ad7295e0-3261-4028-8308-b2047d51d408")## End(Not run)List records
Description
List records
Usage
list_records( url = "http://api.gbif.org/v1/oai-pmh/registry", prefix = "oai_dc", from = NULL, until = NULL, set = NULL, token = NULL, as = "df", ...)Arguments
url | (character) OAI-PMH base url. Defaults to the URL forarXiv's OAI-PMH server (http://export.arxiv.org/oai2)or GBIF's OAI-PMH server (http://api.gbif.org/v1/oai-pmh/registry) |
prefix | specifies the metadata format that the records will bereturned in. Default: |
from | specifies that records returned must have beencreated/update/deleted on or after this date. |
until | specifies that records returned must have beencreated/update/deleted on or before this date. |
set | specifies the set that returned records must belong to. |
token | (character) a token previously provided by the server toresume a request where it last left off. 50 is max number of recordsreturned. We will loop for you internally to get all the records youasked for. |
as | (character) What to return. One of "df" (for data.frame; default),"list", or "raw" (raw text) |
... | Curl options passed on to |
Examples
## Not run: # By default you get back a single data.framelist_records(from = '2018-05-01T00:00:00Z', until = '2018-05-03T00:00:00Z')list_records(from = '2018-05-01T', until = '2018-05-04T')# Get a listlist_records(from = '2018-05-01T', until = '2018-05-04T', as = "list")# Get raw textlist_records(from = '2018-05-01T', until = '2018-05-04T', as = "raw")list_records(from = '2018-05-01T', until = '2018-05-04T', as = "raw")# Use a resumption token# list_records(token =# "1443799900201,2015-09-01T00:00:00Z,2015-10-01T23:59:59Z,50,null,oai_dc")## End(Not run)List sets
Description
List sets
Usage
list_sets( url = "http://api.gbif.org/v1/oai-pmh/registry", token = NULL, as = "df", ...)Arguments
url | (character) OAI-PMH base url. Defaults to the URL forarXiv's OAI-PMH server (http://export.arxiv.org/oai2)or GBIF's OAI-PMH server (http://api.gbif.org/v1/oai-pmh/registry) |
token | (character) a token previously provided by the server toresume a request where it last left off |
as | (character) What to return. One of "df" (for data.frame; default),"list", or "raw" (raw text) |
... | Curl options passed on to |
Examples
## Not run: # Get back a data.framelist_sets()# Get back a listlist_sets(as = "list")# Get back raw textlist_sets(as = "raw")# curl optionslibrary("httr")list_sets(config = verbose())## End(Not run)Load an updated cache
Description
Load an updated cache
Usage
load_providers(path = NULL, envir = .GlobalEnv)Arguments
path | location where cache is located. Leaving to NULL loadsthe version in the installed package |
envir | R environment to load data in to. |
Details
Loads the data object providers into the global workspace.
Value
loads the object providers into the working space.
See Also
Examples
## Not run: # By default the new providers table goes to directory ".", so just# load from thereupdate_providers()load_providers(path=".")# Loads the version in the packageload_providers()## End(Not run)Test of OAI-PMH service is available
Description
Silently test if OAI-PMH service is available under the URL provided.
Usage
oai_available(u, ...)Arguments
u | base URL to OAI-PMH service |
... | other arguments passed to |
Value
TRUE orFALSE if the service is available.
Examples
## Not run: url_list <- list( archivesic="http://archivesic.ccsd.cnrs.fr/oai/oai.php", datacite = "http://oai.datacite.org/oai", # No OAI-PMH here google = "http://google.com")sapply(url_list, oai_available)## End(Not run)Metadata providers data.frame.
Description
Metadata providers data.frame.
Value
A data.frame of three columns:
repo_name - Name of the OAI repository
base_url - Base URL of the OAI repository
oai_identifier - OAI identifier for the OAI repository
Update the locally stored OAI-PMH data providers table.
Description
Data comes fromhttp://www.openarchives.org/Register/BrowseSites. It includes theoai-identifier (if they have one) and the base URL. The website hasthe name of the data provider too, but not provided in the data pulleddown here, but you can grab the name using the example below.
Usage
update_providers(path = ".", ...)Arguments
path | Path to put data in. |
... | Curl options passed on to |
Details
This table is scraped fromhttp://www.openarchives.org/Register/BrowseSites.I would get it fromhttp://www.openarchives.org/pmh/registry/ListFriends,but it does not include repository names.
This function updates the table for you. Does take a while though, sogo get a coffee.
See Also
Examples
## Not run: update_providers()load_providers()## End(Not run)