Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Tasks around metadata.

License

NotificationsYou must be signed in to change notification settings

miku/siskin

Repository files navigation

Various tasks for heterogeneous metadata handling for projectfinc atLeipzig University Library. Based onluigi from Spotify.

We use a couple ofscripts in the repository to harvest about twentydata sources of various flavors (FTPs, OAIs, HTTPs), mix andmatch CSV, XML and JSON, run conversions and deduplication to create a singlefile that is indexable and conforms to a customized VuFind SOLR schema, runningon an unified index host serving part of the data in the online catalogs ofpartners.

DOIProject Status: Active – The project has reached a stable, usable state and is being actively developed.

Luigi (and other frameworks) allow to divide complex workflows into a set oftasks, which form aDAG. The task logic isimplemented in Python, but it is easy to use external tools, e.g. viaExternalProgramorshellout. Luigi isworkflow glue and scales up (HDFS) and down (local scheduler).

More on Luigi:

More about the project:

Contents.


Install

$ pip install -U siskin

The siskin project includes abunch ofscripts, that allow to create,inspect or remove tasks or task artifacts.

Starting 02/2020, only Python 3 is supported.

Runtaskchecksetup to see, what additional tools might need to be installed(this is a manuallycurated list, not everything isrequired for every task).

$ taskchecksetupok      7zok      csvcutok      curlok      filterlineok      flux.shok      groupcoverok      iconvok      iconv-chunksok      jqok      metha-syncok      pigzok      solrbulkok      span-importok      unzipok      wgetok      xmllintok      yaz-marcdump

Update

For siskin updates a

$ pip install -U siskin

should suffices. If newer versions of external program are required, thanplease update those manually (e.g. via your OS' package manager).

Run

List tasks:

$ tasknames

A task is an encapsulation of a processing step and can be in theory, anything;Typical tasks are: fetching data from FTP, OAI endpoint or an HTTP API, formatconversions, filters or reports. Many tasks are parameterized by date (with thedefault often beingtoday), which allows siskin to keep track, whether an artifactis update-to-date or not.

Run simple task:

$ taskdo DOAJHarvest

Documentation:

$ taskdocs | less -R

Remove artefacts of a task:

$ taskrm DOAJHarvest

Inspect the source code of a task:

$taskinspectAILocalDataclassAILocalData(AITask):"""    Extract a CSV about source, id, doi and institutions for deduplication.    """date=ClosestDateParameter(default=datetime.date.today())batchsize=luigi.IntParameter(default=25000,significant=False)defrequires(self):returnAILicensing(date=self.date)    ...

Create an aggregated file for finc

There are a couple of prerequisites:

  • siskin isinstalled
  • most additional tools are installed (or: output of thetaskchecksetup is mostly green)
  • credentials areconfigured in/etc/siskin/siskin.ini or~/.config/siskin/siskin.ini
  • some static data (that cannot be accessed over the net) is put into place (and configured insiskin.ini)
  • sufficient disk space is available

The update process itself consists of various updates:

  • all data sources (crossref, doaj, ...) are updated, as needed (e.g. FTP is synced, OAI is harvested, API, ...)
  • the licensing data is fetched fromAMSL

This dependency graph of these operations can become complex:

However, if everything is put into place, a single command will suffice:

$ taskdo AIUpdate --workers 4

This can be a long running (hours, days) command, depending on the state of the already cached data.

Note: Currently a jour fixe (the 15th of a month) is used as default for thelicensing information (another task, calledAMSLFilterConfigFreeze should berun daily for this to work). The jour fixe can be overriden with thecurrent information, by passing a parameter to theAILicensing task:

$ taskdo AIUpdate --workers 4 --AILicensing-override

Once the task is completed, the output of the two tasks:

  • AIExport (solr)
  • AIRedact (blob, currentlymicroblob)

can be put into their respective data stores (e.g. viasolrbulk).

Configuration

The siskin package harvests all kinds of data sources, some of which might beprotected. All credentials and a few other configuration options go into asiskin.ini, either in/etc/siskin/ or~/.config/siskin/. If both filesare present, the local options take precedence.

Luigi uses a bit of configuration as well, put it under/etc/luigi/.

Completions on task names will save you typing and time, so putsiskin_compeletion.sh under/etc/bash_completion.d or somewhere else.

$ tree etcetc├── bash_completion.d│   └── siskin_completion.sh├── luigi│   ├── luigi.cfg│   └── logging.ini└── siskin    └── siskin.ini

All configuration values can be inspected quickly with:

$ taskconfig[core]home = /var/siskin[imslp]listings-url = https://example.org/abc[jstor]ftp-username = abcftp-password = d3f...

Software versioning

Since siskin works mostlyon data, software versioning differs a bit, but wetry to adhere to the following rules:

  • major changes:You need to recreate all your data from scratch.
  • minor changes: We added, renamed or removedat least one task. You willhave to recreate a subset of the tasks to see the changes. You might need to changepipelines depending on those tasks, because they might not exist any more or have been renamed.
  • revision changes: A modification within existing tasks (e.g. bugfixes).You will have to recreate a subset of the tasks to see this changes, but no newtask is introduced.No pipeline is broken, that wasn't already.

These rules apply for version 0.2.0 and later. To see the current version, use:

$ taskversion0.43.3

Schema changes

To remove all files of a certain format (due to schema changes or such) it helps, if naming is uniform:

$ tasknames| grep IntermediateSchema| xargs -I {} taskrm {}...

Apart from that, all upstream tasks need to be removed manually (consult themap) as this is not automatic yet.

Task dependencies

Inspect task dependencies with:

$ taskdeps JstorIntermediateSchema  └─ JstorIntermediateSchema(date=2018-05-25)      └─ AMSLService(date=2018-05-25, name=outboundservices:discovery)      └─ JstorCollectionMapping(date=2018-05-25)      └─ JstorIntermediateSchemaGenericCollection(date=2018-05-25)

Or visually viagraphviz.

$ taskdeps-dot JstorIntermediateSchema| dot -Tpng> deps.png

Evolving workflows

Development

To converge the project on a common format run:

$ make imports style

This will fix import order and code style in-place. Requires isort and yapfinstalled. Should be executed under Python 3 only (as Python 2 isort seems tohave differing opinions).

Other tools:

Naming conventions

Some conventions are enforced by tools (e.g. imports, yapf), but the followingmay be considered as well.

Task names and filenames

  • task class names that produce MARC21 should have suffix MARC, e.g. ArchiveMARC
  • task class names that produce intermediate schema files should have suffix IntermediateSchema, e.g. ArchiveIntermediateSchema
  • task for a single source should share a prefix, e.g. ArchiveMARC, ArchiveISSNList
  • source prefix names should follow the source names (e.g. site of publisher), in German:vorlagegetreu, e.g. DOAJHarvest, GallicaMARC
  • potentially long source names can be shortened, e.g. Umweltbibliothek can become UmBi... in umbi.py
  • it is recommended that the source file name follows the source name, e.g. DOAJ tasks live in doaj.py

Module docstrings for tasks (and scripts)

Rough examples:

# coding: utf-8# pylint: ...## Copyright 2019 ... GPL-3.0+ snippet# ...# @license GPL-3.0+ <http://spdx.org/licenses/GPL-3.0+>"""Source: GallicaSID: 20Ticket: #14793Origin: OAIUpdates: monthlyConfig:[vkfilm]input = /path/to/filepassword = helloadmin"""

Quoting style

  • use double quotes, if possible

Executable

  • if a module can be used as standalone script, then it should include the following line as first line:
#!/usr/bin/env python

Python 2/3 considerations

  • usesix, if necessary
  • use__future__ imports if necessary
  • preferio.open to raw open, e.g. Python 2 builtin has no keywordencoding
  • string literals should be written with theu prefix (obsolete in Python 3, but required in Python 2)

Debugging

  • prefer logging over print statements

Open for discussion

  • one suffix for data acquisition tasks, e.g. Harvest, Get, Fetch, Download, ...

Deployment

A distribution can be created via Makefile.

$ make dist$ tree dist/dist/└── siskin-0.62.0.tar.gz

The tarball can be installed viapip:

$ pip install siskin-0.62.0.tar.gz

If access to PyPI is possible, one can upload the tarball there with:

$ make upload

Which in turn allows to install siskin via:

$ pip install -U siskin

on the target machine.

TODO

  • The naming of the scripts is a bit unfortunate,taskdo,taskcat,.... Maybe bettersiskin run,siskin cat,siskin rm and so on.
  • Investigatepytest for testing tasks, given inputs.

Misc

A short video using luigi'son_success andon_failurehandlers to make the processing audible.


[8]ページ先頭

©2009-2025 Movatter.jp