miku/siskinPublic

NotificationsYou must be signed in to change notification settings
Fork5
Star21

Tasks around metadata.

License

GPL-3.0 license

21 stars 5 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 6,646 Commits
ansible		ansible
bin		bin
docs		docs
etc		etc
extra		extra
fixtures		fixtures
patches		patches
siskin		siskin
.editorconfig		.editorconfig
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.python-version		.python-version
INSTALL.md		INSTALL.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml.wip		pyproject.toml.wip
setup.py		setup.py
uv.lock		uv.lock

Repository files navigation

siskin

Various tasks for heterogeneous metadata handling for projectfinc atLeipzig University Library. Based onluigi from Spotify.

We use a couple ofscripts in the repository to harvest about twentydata sources of various flavors (FTPs, OAIs, HTTPs), mix andmatch CSV, XML and JSON, run conversions and deduplication to create a singlefile that is indexable and conforms to a customized VuFind SOLR schema, runningon an unified index host serving part of the data in the online catalogs ofpartners.

Overview in afew markdown slides

Luigi (and other frameworks) allow to divide complex workflows into a set oftasks, which form aDAG. The task logic isimplemented in Python, but it is easy to use external tools, e.g. viaExternalProgramorshellout. Luigi isworkflow glue and scales up (HDFS) and down (local scheduler).

Install

$ pip install -U siskin

The siskin project includes abunch ofscripts, that allow to create,inspect or remove tasks or task artifacts.

Starting 02/2020, only Python 3 is supported.

Runtaskchecksetup to see, what additional tools might need to be installed(this is a manuallycurated list, not everything isrequired for every task).

$ taskchecksetupok      7zok      csvcutok      curlok      filterlineok      flux.shok      groupcoverok      iconvok      iconv-chunksok      jqok      metha-syncok      pigzok      solrbulkok      span-importok      unzipok      wgetok      xmllintok      yaz-marcdump

Update

For siskin updates a

$ pip install -U siskin

should suffices. If newer versions of external program are required, thanplease update those manually (e.g. via your OS' package manager).

Run

List tasks:

$ tasknames

A task is an encapsulation of a processing step and can be in theory, anything;Typical tasks are: fetching data from FTP, OAI endpoint or an HTTP API, formatconversions, filters or reports. Many tasks are parameterized by date (with thedefault often beingtoday), which allows siskin to keep track, whether an artifactis update-to-date or not.

Run simple task:

$ taskdo DOAJHarvest

Documentation:

$ taskdocs | less -R

Remove artefacts of a task:

$ taskrm DOAJHarvest

Inspect the source code of a task:

$taskinspectAILocalDataclassAILocalData(AITask):"""    Extract a CSV about source, id, doi and institutions for deduplication.    """date=ClosestDateParameter(default=datetime.date.today())batchsize=luigi.IntParameter(default=25000,significant=False)defrequires(self):returnAILicensing(date=self.date)    ...

Create an aggregated file for finc

There are a couple of prerequisites:

siskin isinstalled
most additional tools are installed (or: output of thetaskchecksetup is mostly green)
credentials areconfigured in/etc/siskin/siskin.ini or~/.config/siskin/siskin.ini
some static data (that cannot be accessed over the net) is put into place (and configured insiskin.ini)
sufficient disk space is available

The update process itself consists of various updates:

all data sources (crossref, doaj, ...) are updated, as needed (e.g. FTP is synced, OAI is harvested, API, ...)
the licensing data is fetched fromAMSL

This dependency graph of these operations can become complex:

However, if everything is put into place, a single command will suffice:

$ taskdo AIUpdate --workers 4

This can be a long running (hours, days) command, depending on the state of the already cached data.

Note: Currently a jour fixe (the 15th of a month) is used as default for thelicensing information (another task, calledAMSLFilterConfigFreeze should berun daily for this to work). The jour fixe can be overriden with thecurrent information, by passing a parameter to theAILicensing task:

$ taskdo AIUpdate --workers 4 --AILicensing-override

Once the task is completed, the output of the two tasks:

AIExport (solr)
AIRedact (blob, currentlymicroblob)

can be put into their respective data stores (e.g. viasolrbulk).

Configuration

The siskin package harvests all kinds of data sources, some of which might beprotected. All credentials and a few other configuration options go into asiskin.ini, either in/etc/siskin/ or~/.config/siskin/. If both filesare present, the local options take precedence.

Luigi uses a bit of configuration as well, put it under/etc/luigi/.

Completions on task names will save you typing and time, so putsiskin_compeletion.sh under/etc/bash_completion.d or somewhere else.

$ tree etcetc├── bash_completion.d│   └── siskin_completion.sh├── luigi│   ├── luigi.cfg│   └── logging.ini└── siskin    └── siskin.ini

All configuration values can be inspected quickly with:

$ taskconfig[core]home = /var/siskin[imslp]listings-url = https://example.org/abc[jstor]ftp-username = abcftp-password = d3f...

Software versioning

Since siskin works mostlyon data, software versioning differs a bit, but wetry to adhere to the following rules:

major changes:You need to recreate all your data from scratch.
minor changes: We added, renamed or removedat least one task. You willhave to recreate a subset of the tasks to see the changes. You might need to changepipelines depending on those tasks, because they might not exist any more or have been renamed.
revision changes: A modification within existing tasks (e.g. bugfixes).You will have to recreate a subset of the tasks to see this changes, but no newtask is introduced.No pipeline is broken, that wasn't already.

These rules apply for version 0.2.0 and later. To see the current version, use:

$ taskversion0.43.3

Schema changes

To remove all files of a certain format (due to schema changes or such) it helps, if naming is uniform:

$ tasknames| grep IntermediateSchema| xargs -I {} taskrm {}...

Apart from that, all upstream tasks need to be removed manually (consult themap) as this is not automatic yet.

Task dependencies

Inspect task dependencies with:

$ taskdeps JstorIntermediateSchema  └─ JstorIntermediateSchema(date=2018-05-25)      └─ AMSLService(date=2018-05-25, name=outboundservices:discovery)      └─ JstorCollectionMapping(date=2018-05-25)      └─ JstorIntermediateSchemaGenericCollection(date=2018-05-25)

Or visually viagraphviz.

$ taskdeps-dot JstorIntermediateSchema| dot -Tpng> deps.png

Evolving workflows

Development

To converge the project on a common format run:

$ make imports style

This will fix import order and code style in-place. Requires isort and yapfinstalled. Should be executed under Python 3 only (as Python 2 isort seems tohave differing opinions).

Other tools:

usepylint, currently 9.18/10 with many errors ignored, maybe withgit commit hook
usepytest,pytest-cov, coverage at 9%

Naming conventions

Some conventions are enforced by tools (e.g. imports, yapf), but the followingmay be considered as well.

Task names and filenames

task class names that produce MARC21 should have suffix MARC, e.g. ArchiveMARC
task class names that produce intermediate schema files should have suffix IntermediateSchema, e.g. ArchiveIntermediateSchema
task for a single source should share a prefix, e.g. ArchiveMARC, ArchiveISSNList
source prefix names should follow the source names (e.g. site of publisher), in German:vorlagegetreu, e.g. DOAJHarvest, GallicaMARC
potentially long source names can be shortened, e.g. Umweltbibliothek can become UmBi... in umbi.py
it is recommended that the source file name follows the source name, e.g. DOAJ tasks live in doaj.py

Module docstrings for tasks (and scripts)

Rough examples:

# coding: utf-8# pylint: ...## Copyright 2019 ... GPL-3.0+ snippet# ...# @license GPL-3.0+ <http://spdx.org/licenses/GPL-3.0+>"""Source: GallicaSID: 20Ticket: #14793Origin: OAIUpdates: monthlyConfig:[vkfilm]input = /path/to/filepassword = helloadmin"""

Quoting style

use double quotes, if possible

Executable

if a module can be used as standalone script, then it should include the following line as first line:

#!/usr/bin/env python

Python 2/3 considerations

usesix, if necessary
use__future__ imports if necessary
preferio.open to raw open, e.g. Python 2 builtin has no keywordencoding
string literals should be written with theu prefix (obsolete in Python 3, but required in Python 2)

Debugging

prefer logging over print statements

Open for discussion

one suffix for data acquisition tasks, e.g. Harvest, Get, Fetch, Download, ...

Deployment

A distribution can be created via Makefile.

$ make dist$ tree dist/dist/└── siskin-0.62.0.tar.gz

The tarball can be installed viapip:

$ pip install siskin-0.62.0.tar.gz

If access to PyPI is possible, one can upload the tarball there with:

$ make upload

Which in turn allows to install siskin via:

$ pip install -U siskin

on the target machine.

TODO

The naming of the scripts is a bit unfortunate,taskdo,taskcat,.... Maybe bettersiskin run,siskin cat,siskin rm and so on.
Investigatepytest for testing tasks, given inputs.

Misc

A short video using luigi'son_success andon_failurehandlers to make the processing audible.

About

Tasks around metadata.

Releases5

siskin 1.5.73 Latest

Jan 10, 2024

+ 4 releases

Packages

No packages published

Movatterモバイル変換

License

miku/siskin

Folders and files

Latest commit

History

Repository files navigation

siskin

Install

Update

Run

Create an aggregated file for finc

Configuration

Software versioning

Schema changes

Task dependencies

Evolving workflows

Development

Naming conventions

Task names and filenames

Module docstrings for tasks (and scripts)

Quoting style

Executable

Python 2/3 considerations

Debugging

Open for discussion

Deployment

TODO

Misc

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases5

Packages0

Uh oh!

Contributors2

Uh oh!

Languages

Packages