Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
/methaPublic

Command line OAI-PMH harvester and client with built-in cache.

License

NotificationsYou must be signed in to change notification settings

miku/metha

Repository files navigation

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is alow-barrier mechanism for repository interoperability. Data Providers arerepositories that expose structured metadata via OAI-PMH. Service Providersthen make OAI-PMH service requests to harvest that metadata. --https://www.openarchives.org/pmh/

The metha command line tools can gather information on OAI-PMH endpoints andharvest data incrementally. The goal of metha is to make it simple to getaccess to data, its focus is not to manage it.

DOIProject Status: Active – The project has reached a stable, usable state and is being actively developed.

The metha tool has been developed forproject finc atLeipzig University Library (lab).

Why yet another OAI harvester?

  • I wanted to crawlArxiv but found that existing tools would timeout.
  • Some harvesters would start to download all records anew, if I interrupted a running harvest.
  • There are many OAIendpoints outthere. It is a widely usedprotocol andsomewhat worth knowing.
  • I wanted something simple for the command line; also fast and robust - methaas it is implemented now, is relatively robust and more efficient thanrequesting all record one-by-one (there is oneannoyance which will hopefully befixed soon).

How it works

The functionality is spread accross a few different executables:

  • metha-sync for harvesting
  • metha-cat for viewing
  • metha-id for gathering data about endpoints
  • metha-ls for inspecting the local cache
  • metha-files for listing the associated files for a harvest

To harvest and endpoint in the defaultoai_dc format:

$ metha-sync http://export.arxiv.org/oai2...

All downloaded files are written to a directory below a base directory. The basedirectory is~/.cache/metha by default and can be adjusted with theMETHA_DIRenvironment variable.

When the-dir flag is set, only the directory corresponding to a harvest is printed.

$ metha-sync -dir http://export.arxiv.org/oai2/home/miku/.metha/I29haV9kYyNodHRwOi8vZXhwb3J0LmFyeGl2Lm9yZy9vYWky
$ METHA_DIR=/tmp/harvest metha-sync -dir http://export.arxiv.org/oai2/tmp/harvest/I29haV9kYyNodHRwOi8vZXhwb3J0LmFyeGl2Lm9yZy9vYWky

The harvesting can be interrupted at any time and the HTTP client willautomatically retry failed requests a few times before giving up.

Currently, there is a limitation which only allows to harvest data up to thelast day. Example: If the current date would beThu Apr 21 14:28:10 CEST2016, the harvester would request all data since the repositories earliestdate and2016-04-20 23:59:59.

To stream the harvested XML data to stdout run:

$ metha-cat http://export.arxiv.org/oai2

You can emit records based on datestamp as well:

$ metha-cat -from 2016-01-01 http://export.arxiv.org/oai2

This will only stream records with a datestamp equal or after 2016-01-01.

To just stream all data really fast, usefind andzcat over the harvestingdirectory.

$ find$(metha-sync -dir http://export.arxiv.org/oai2) -name"*gz"| xargs unpigz -c

To display basic repository information:

$ metha-id http://export.arxiv.org/oai2

To list all harvested endpoints:

$ metha-ls

Further examples can be found in the methaman page:

$ man metha

Installation

Use a deb, rpmrelease, or the go tool:

$ go install -v github.com/miku/metha/cmd/...@latest

Limitations

Currently the endpoint URL, the format and the set are concatenated and base64encoded to form the target directory, e.g:

$ echo "U291bmRzI29haV9kYyNodHRwOi8vY29wYWMuamlzYy5hYy51ay9vYWktcG1o" | base64 -dSounds#oai_dc#http://copac.jisc.ac.uk/oai-pmh

If you have very long set names or a very long URL and the target directoryexceeds e.g. 255 chars (on ext4), the harvest won't work.

Harvesting Roulette

$ URL=$(shuf -n 1<(curl -Lsf https://git.io/vKXFv)); metha-sync$URL&& metha-cat$URL

In 0.1.27 ametha-fortune command was added, which fetches a random articledescription and displays it.

$ metha-fortuneActive Networking is concerned with the rapid definition and deployment ofinnovative, but reliable and robust, networking services. Towards this end wehave developed a composite protocol and networking services architecture thatencourages re-use of protocol functions, is well defined, and facilitatesautomatic checking of interfaces and protocol component properties. Thearchitecture has been used to implement common Internet protocols and services.We will report on this work at the workshop.    -- http://drops.dagstuhl.de/opus/phpoai/oai2.php$ metha-fortuneIn this paper we show that the Lempert property (i.e., the equality between theLempertfunctionand the Carathéodory distance) holdsin the tetrablock, abounded hyperconvex domain which is not biholomorphic to a convex domain. Thequestion whether such an equality holds was posed by Abouhajar et al.in J.Geom. Anal. 17(4), 717–750 (2007).    -- http://ruj.uj.edu.pl/oai/request$ metha-fortuneI argue that Gödel's incompleteness theorem is much easier to understand whenthought of in terms of computers, and describe the writing of a computerprogram which generates the undecidable Gödel sentence.    -- http://quantropy.org/cgi/oai2$ metha-fortuneNigeria, a country in West Africa, sits on the Atlantic coast with a land areaof approximately 90 million hectares and a population of more than 140 millionpeople. The southern part of the country falls within the tropical rainforestwhich has now been largely depleted and is in dire need of reforestation. About10 percent of the land area was constituted into forest reserves for purposesof conservation but this has suffered perturbations over the years to theextent that what remains of the constituted forest reserves currently is lessthan 4 percent of the country land area. As at today about 382,000 ha have beenreforested with indigenous and exotic species representing about 4 percent ofthe remaining forest estate. Regrettably, funding of the Forestry sector inNigeria has been critically low, rendering reforestation programme nearimpossible, especially in the last two decades. To revive the forestry sectorgovernment at all levels must re-strategize and involve the local communitiesas co-managers of the forest estates in order to create mutual dependence andinteraction in resource conservation.    -- http://journal.reforestationchallenges.org/index.php/REFOR/oai

Scrape all metadata in a best-effort way

Use an endless loop with a timeout to get out of any hanging connection (whichhappen). Example scrape, converted to JSON (326M records, 60+ GB:2023-11-01-metha-oai.ndjson.zst).

$whiletrue;do \    timeout 120 metha-sync -list| \    shuf| \    parallel -j 64 -I {}"metha-sync -base-dir ~/.cache/metha {}"; \done

Alternatively, use ametha.servicefile to run harvests continuously.

metha stores harvested data in one file per interval; to combine all XML filesinto a single JSON file you can utilize thexmlstream.go (adjust the harvest directory):

$ fd.'/data/.cache/metha' -e xml.gz| parallel unpigz -c| xmlstream -D

For notes on parallel processing of XML see:Faster XML processing in Go.

Errors this harvester can somewhat handle

  • responses with resumption tokens that lead to empty responses
  • gzipped responses, that are not advertised as such
  • funny (illegal) control characters in XML responses
  • repositories, that won't respond unless the dates are given with the exact granualarity
  • repositories with endless token loops
  • repositories that do not support selective harvesting, use-no-intervals flag
  • limited repositories, metha will try a few times with an exponential backoff
  • repositories, which throw occasional HTTP errors, although most of the responses look good, use-ignore-http-errors flag

Authors

Misc

Show formats of random repository:

$ shuf -n 1<(curl -Lsf https://git.io/vKXFv)| xargs -I {} metha-id {}| jq .formats

A snippet from a 2010 publication:

The Open Archives Protocol for Metadata Harvesting(OAI-PMH) (Lagoze and van de Sompel, 2002) is currently implemented by morethan 1,700 digital library reposi- tories world-wide and enables the exchangeof metadata via HTTP. --Interweaving OAI-PMH Data Sources with the Linked Data Cloud

Metha elsewhere

Asciicast

asciicast


[8]ページ先頭

©2009-2025 Movatter.jp