Movatterモバイル変換


[0]ホーム

URL:


Type:Package
Title:General Purpose 'Oai-PMH' Services Client
Description:A general purpose client to work with any 'OAI-PMH' (Open Archives Initiative Protocol for 'Metadata' Harvesting) service. The 'OAI-PMH' protocol is described athttp://www.openarchives.org/OAI/openarchivesprotocol.html. Functions are provided to work with the 'OAI-PMH' verbs: 'GetRecord', 'Identify', 'ListIdentifiers', 'ListMetadataFormats', 'ListRecords', and 'ListSets'.
Version:0.4.0
License:MIT + file LICENSE
URL:https://docs.ropensci.org/oai/,https://github.com/ropensci/oai
BugReports:https://github.com/ropensci/oai/issues
VignetteBuilder:knitr
LazyData:true
Encoding:UTF-8
Imports:xml2 (≥ 1.0.0), httr (≥ 1.2.0), plyr (≥ 1.8.4), stringr (≥1.1.0), tibble (≥ 1.2)
Suggests:DBI, knitr, RSQLite, testthat, markdown, covr
RoxygenNote:7.2.1
NeedsCompilation:no
Packaged:2022-11-10 11:32:01 UTC; mbojan
Author:Scott Chamberlain [aut], Michal BojanowskiORCID iD [aut, cre], National Science Centre [fnd] (Supported MB through grant 2012/07/D/HS6/01971, <https://ncn.gov.pl>)
Maintainer:Michal Bojanowski <michal2992@gmail.com>
Repository:CRAN
Date/Publication:2022-11-10 16:10:02 UTC

OAI-PMH Client

Description

logo

oai is an R client to work with OAI-PMH (Open ArchivesInitiative Protocol for Metadata Harvesting) services, a protocoldeveloped by the Open Archives Initiative(https://en.wikipedia.org/wiki/Open_Archives_Initiative).OAI-PMH uses XML data format transported over HTTP.

OAI-PMH Info

See the OAI-PMH V2 specification athttp://www.openarchives.org/OAI/openarchivesprotocol.html

Implementation details

oai is built onxml2 andhttr. In addition, we give backdata.frame's whenever possible to make data comprehension, manipulation,and visualization easier. We also have functions to fetch a large directoryof OAI-PMH services - it isn't exhaustive, but does contain a lot.

Paging

Instead of paging with e.g.,page andper_page parameters,OAI-PMH uses (optionally)resumptionTokens, with an optionalexpiration date. These tokens can be used to continue on to the next chunkof data, if the first request did not get to the end. Often, OAI-PMHservices limit each request to 50 records, but this may vary by provider,I don't know for sure. The API of this package is such that wewhileloop for you internally until we get all records. We may in the futureexpose e.g., alimit parameter so you can say how many recordsyou want, but we haven't done this yet.

Acknowledgements

Michal Bojanowski contributions were supported by (Polish) National ScienceCenter (NCN) through grant 2012/07/D/HS6/01971.

Author(s)

Scott Chamberlainmyrmecocystus@gmail.com

Michal Bojanowskimichal2992@gmail.com

See Also

Useful links:


Count OAI-PMH identifiers for a data provider.

Description

Count OAI-PMH identifiers for a data provider.

Usage

count_identifiers(url = "http://export.arxiv.org/oai2", prefix = "oai_dc", ...)

Arguments

url

(character) OAI-PMH base url. Defaults to the URL forarXiv's OAI-PMH server (http://export.arxiv.org/oai2)or GBIF's OAI-PMH server (http://api.gbif.org/v1/oai-pmh/registry)

prefix

Specifies the metadata format that the records will bereturned in

...

Curl options passed on toGET

Details

Note that some OAI providers do not include the entrycompleteListSize(http://www.openarchives.org/OAI/openarchivesprotocol.html#FlowControl)in which case we return an NA - which does not mean 0, but rather we don'tknow.

Examples

## Not run: count_identifiers()# curl options# library("httr")# count_identifiers(config = verbose())## End(Not run)

Result dumpers

Description

Result dumpers are functions allowing to handle the chunks of results fromOAI-PMH service "on the fly". Handling can include processing, writing tofiles, databases etc.

Usage

dump_raw_to_txt(  res,  args,  as,  file_pattern = "oaidump",  file_dir = ".",  file_ext = ".xml")dump_to_rds(  res,  args,  as,  file_pattern = "oaidump",  file_dir = ".",  file_ext = ".rds")dump_raw_to_db(res, args, as, dbcon, table_name, field_name, ...)

Arguments

res

results, depends onas, not to be specified by the user

args

list, query arguments, not to be specified by the user

as

character, type of result to return, not to be specified by theuser

file_pattern,file_dir,file_ext

character respectively: initial part ofthe file name, directory name, and file extension used to create filenames. These arguments are passed totempfile() argumentspattern,tmpdir, andfileext respectively.

dbcon

DBI-compliant database connection

table_name

character, name of the database table to write into

field_name

character, name of the field in database table to writeinto

...

arguments passed to/from other functions

Details

Often the result of a request to a OAI-PMH service are so large that it issplit into chunks that need to be requested separately usingresumptionToken. By default functions likelist_identifiers() orlist_records() request thesechunks under the hood and return all concatenated in a single R object. Itis convenient but insufficient when dealing with large result sets thatmight not fit into RAM. A result dumper is a function that is called oneach result chunk. Dumper functions can write chunks to files or databases,include initial pre-processing or extraction, and so on.

A result dumper needs to be function that accepts at least the arguments:res,args,as. They will get values by the enclosingfunction internally. There may be additional arguments, including....Dumpers should returnNULL or a value that willbe collected and returned by the function calling the dumper (e.g.list_records()).

Currently result dumpers can be used with functions:list_identifiers(),list_records(), andlist_sets().To use a dumper with one of these functions you need to:

See Examples. Below we provide more details on the dumpers currentlyimplemented.

dump_raw_to_txt writes raw XML to text files. It requiresas=="raw". File names are created usingtempfile(). Bydefault they are written in the current working directory and have a formatoaidump*.xml where* is a random string in hex.

dump_to_rds saves results in an.rds file viasaveRDS().Type of object being saved is determined by theas argument. File namesare generated in the same way as bydump_raw_to_txt, but with defaultextension.rds

dump_xml_to_db writes raw XML to a single text column of a table in adatabase. Requiresas == "raw". Database connectiondbconshould be a connection object as created byDBI::dbConnect() frompackageDBI. As such, it can connect to any database supported byDBI. The records are written to a fieldfield_name in a tabletable_name usingDBI::dbWriteTable(). If the table does notexist, it is created. If it does, the records are appended. Any additionalarguments are passed toDBI::dbWriteTable()

Value

Dumpers should returnNULL or a value that will be collectedand returned by the function using the dumper.

dump_raw_to_txt returns the name of the created file.

dump_to_rds returns the name of the created file.

dump_xml_to_db returnsNULL

References

OAI-PMH specificationhttps://www.openarchives.org/OAI/openarchivesprotocol.html

See Also

Functions supporting the dumpers:list_identifiers(),list_sets(), andlist_records()

Examples

## Not run: ### Dumping raw XML to text files# This will write a set of XML files to a temporary directoryfnames <- list_identifiers(from="2018-06-01T",                           until="2018-06-14T",                           as="raw",                           dumper=dump_raw_to_txt,                           dumper_args=list(file_dir=tempdir()))# vector of file names createdstr(fnames)all( file.exists(fnames) )# clean-upunlink(fnames)### Dumping raw XML to a database# Connect to in-memory SQLite databasecon <- DBI::dbConnect(RSQLite::SQLite(), dbname=":memory:")# Harvest and dump the results into field "bar" of table "foo"list_identifiers(from="2018-06-01T",                 until="2018-06-14T",                 as="raw",                 dumper=dump_raw_to_db,                 dumper_args=list(dbcon=con,                                  table_name="foo",                                  field_name="bar") )# Count records, should be 101DBI::dbGetQuery(con, "SELECT count(*) as no_records FROM foo")DBI::dbDisconnect(con)## End(Not run)

Get records

Description

Get records

Usage

get_records(  ids,  prefix = "oai_dc",  url = "http://api.gbif.org/v1/oai-pmh/registry",  as = "parsed",  ...)

Arguments

ids

The OAI-PMH identifier for the record. One or more. Required.

prefix

specifies the metadata format that the records will bereturned in. Default:oai_dc

url

(character) OAI-PMH base url. Defaults to the URL forarXiv's OAI-PMH server (http://export.arxiv.org/oai2)or GBIF's OAI-PMH server (http://api.gbif.org/v1/oai-pmh/registry)

as

(character) What to return. One of "parsed" (default),or "raw" (raw text)

...

Curl options passed on toGET

Details

There are some finite set of results based on the OAI prefix.We will provide parsers as we have time, and as users express interest.For prefix types we have parsers for we return a list of data.frame's,for each identifier, one data.frame for theheader bits of data, andone data.frame for themetadata bits of data.

For prefixes we don't have parsers for, we fall back to returning rawXML, so you can at least parse the XML yourself.

Because some XML nodes are duplicated, we join values together ofduplicated node names, separated by a semicolon (⁠;⁠) with nospaces. You can seprarate them yourself easily.

Value

a named list of data.frame's, or lists, or raw text

Examples

## Not run: get_records("87832186-00ea-44dd-a6bf-c2896c4d09b4")ids <- c("87832186-00ea-44dd-a6bf-c2896c4d09b4",   "d981c07d-bc43-40a2-be1f-e786e25106ac")(res <- get_records(ids))lapply(res, "[[", "header")lapply(res, "[[", "metadata")do.call(rbind, lapply(res, "[[", "header"))do.call(rbind, lapply(res, "[[", "metadata"))# Get raw textget_records("d981c07d-bc43-40a2-be1f-e786e25106ac", as = "raw")# from arxiv.orgget_records("oai:arXiv.org:0704.0001", url = "http://export.arxiv.org/oai2")## End(Not run)

Identify the OAI-PMH service for each data provider.

Description

Identify the OAI-PMH service for each data provider.

Usage

id(url, as = "parsed", ...)

Arguments

url

(character) OAI-PMH base url. Defaults to the URL forarXiv's OAI-PMH server (http://export.arxiv.org/oai2)or GBIF's OAI-PMH server (http://api.gbif.org/v1/oai-pmh/registry)

as

(character) What to return. One of "parsed" (default),or "raw" (raw text)

...

Curl options passed on toGET

Examples

## Not run: # arxivid("http://export.arxiv.org/oai2")# GBIF - http://www.gbif.org/id("http://api.gbif.org/v1/oai-pmh/registry")# get back text instead of parsedid("http://export.arxiv.org/oai2", as = "raw")id("http://api.gbif.org/v1/oai-pmh/registry", as = "raw")# curl optionslibrary("httr")id("http://export.arxiv.org/oai2", config = verbose())## End(Not run)

List OAI-PMH identifiers

Description

List OAI-PMH identifiers

Usage

list_identifiers(  url = "http://api.gbif.org/v1/oai-pmh/registry",  prefix = "oai_dc",  from = NULL,  until = NULL,  set = NULL,  token = NULL,  as = "df",  ...)

Arguments

url

(character) OAI-PMH base url. Defaults to the URL forarXiv's OAI-PMH server (http://export.arxiv.org/oai2)or GBIF's OAI-PMH server (http://api.gbif.org/v1/oai-pmh/registry)

prefix

Specifies the metadata format that the records will bereturned in.

from

specifies that records returned must have beencreated/update/deleted on or after this date.

until

specifies that records returned must have beencreated/update/deleted on or before this date.

set

specifies the set that returned records must belong to.

token

a token previously provided by the server to resume a requestwhere it last left off.

as

(character) What to return. One of "df" (for data.frame; default),"list", or "raw" (raw text)

...

Curl options passed on toGET

Examples

## Not run: # fromrecently <- format(Sys.Date() - 1, "%Y-%m-%d")list_identifiers(from = recently)# from and untillist_identifiers(from = '2018-06-01T', until = '2018-06-14T')# set parameter - here, using ANDS - Australian National Data Servicelist_identifiers(from = '2018-09-01T', until = '2018-09-05T',  set = "dataset_type:CHECKLIST")## End(Not run)

List available metadata formats from various providers.

Description

List available metadata formats from various providers.

Usage

list_metadataformats(  url = "http://api.gbif.org/v1/oai-pmh/registry",  id = NULL,  ...)

Arguments

url

(character) OAI-PMH base url. Defaults to the URL forarXiv's OAI-PMH server (http://export.arxiv.org/oai2)or GBIF's OAI-PMH server (http://api.gbif.org/v1/oai-pmh/registry)

id

The OAI-PMH identifier for the record. Optional.

...

Curl options passed on toGET

Examples

## Not run: list_metadataformats()# no metadatformats for an identifierlist_metadataformats(id = "9da8a65a-1b9b-487c-a564-d184a91a2705")# metadatformats available for an identifierlist_metadataformats(id = "ad7295e0-3261-4028-8308-b2047d51d408")## End(Not run)

List records

Description

List records

Usage

list_records(  url = "http://api.gbif.org/v1/oai-pmh/registry",  prefix = "oai_dc",  from = NULL,  until = NULL,  set = NULL,  token = NULL,  as = "df",  ...)

Arguments

url

(character) OAI-PMH base url. Defaults to the URL forarXiv's OAI-PMH server (http://export.arxiv.org/oai2)or GBIF's OAI-PMH server (http://api.gbif.org/v1/oai-pmh/registry)

prefix

specifies the metadata format that the records will bereturned in. Default:oai_dc

from

specifies that records returned must have beencreated/update/deleted on or after this date.

until

specifies that records returned must have beencreated/update/deleted on or before this date.

set

specifies the set that returned records must belong to.

token

(character) a token previously provided by the server toresume a request where it last left off. 50 is max number of recordsreturned. We will loop for you internally to get all the records youasked for.

as

(character) What to return. One of "df" (for data.frame; default),"list", or "raw" (raw text)

...

Curl options passed on toGET

Examples

## Not run: # By default you get back a single data.framelist_records(from = '2018-05-01T00:00:00Z', until = '2018-05-03T00:00:00Z')list_records(from = '2018-05-01T', until = '2018-05-04T')# Get a listlist_records(from = '2018-05-01T', until = '2018-05-04T', as = "list")# Get raw textlist_records(from = '2018-05-01T', until = '2018-05-04T', as = "raw")list_records(from = '2018-05-01T', until = '2018-05-04T', as = "raw")# Use a resumption token# list_records(token =#  "1443799900201,2015-09-01T00:00:00Z,2015-10-01T23:59:59Z,50,null,oai_dc")## End(Not run)

List sets

Description

List sets

Usage

list_sets(  url = "http://api.gbif.org/v1/oai-pmh/registry",  token = NULL,  as = "df",  ...)

Arguments

url

(character) OAI-PMH base url. Defaults to the URL forarXiv's OAI-PMH server (http://export.arxiv.org/oai2)or GBIF's OAI-PMH server (http://api.gbif.org/v1/oai-pmh/registry)

token

(character) a token previously provided by the server toresume a request where it last left off

as

(character) What to return. One of "df" (for data.frame; default),"list", or "raw" (raw text)

...

Curl options passed on toGET

Examples

## Not run: # Get back a data.framelist_sets()# Get back a listlist_sets(as = "list")# Get back raw textlist_sets(as = "raw")# curl optionslibrary("httr")list_sets(config = verbose())## End(Not run)

Load an updated cache

Description

Load an updated cache

Usage

load_providers(path = NULL, envir = .GlobalEnv)

Arguments

path

location where cache is located. Leaving to NULL loadsthe version in the installed package

envir

R environment to load data in to.

Details

Loads the data object providers into the global workspace.

Value

loads the object providers into the working space.

See Also

update_providers()

Examples

## Not run: # By default the new providers table goes to directory ".", so just# load from thereupdate_providers()load_providers(path=".")# Loads the version in the packageload_providers()## End(Not run)

Test of OAI-PMH service is available

Description

Silently test if OAI-PMH service is available under the URL provided.

Usage

oai_available(u, ...)

Arguments

u

base URL to OAI-PMH service

...

other arguments passed toid()

Value

TRUE orFALSE if the service is available.

Examples

## Not run: url_list <- list(  archivesic="http://archivesic.ccsd.cnrs.fr/oai/oai.php",  datacite = "http://oai.datacite.org/oai",  # No OAI-PMH here  google = "http://google.com")sapply(url_list, oai_available)## End(Not run)

Metadata providers data.frame.

Description

Metadata providers data.frame.

Value

A data.frame of three columns:


Update the locally stored OAI-PMH data providers table.

Description

Data comes fromhttp://www.openarchives.org/Register/BrowseSites. It includes theoai-identifier (if they have one) and the base URL. The website hasthe name of the data provider too, but not provided in the data pulleddown here, but you can grab the name using the example below.

Usage

update_providers(path = ".", ...)

Arguments

path

Path to put data in.

...

Curl options passed on tohttr::GET()

Details

This table is scraped fromhttp://www.openarchives.org/Register/BrowseSites.I would get it fromhttp://www.openarchives.org/pmh/registry/ListFriends,but it does not include repository names.

This function updates the table for you. Does take a while though, sogo get a coffee.

See Also

load_providers()

Examples

## Not run: update_providers()load_providers()## End(Not run)

[8]ページ先頭

©2009-2025 Movatter.jp