- Notifications
You must be signed in to change notification settings - Fork5
miku/siskin
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Various tasks for heterogeneous metadata handling for projectfinc atLeipzig University Library. Based onluigi from Spotify.
We use a couple ofscripts in the repository to harvest about twentydata sources of various flavors (FTPs, OAIs, HTTPs), mix andmatch CSV, XML and JSON, run conversions and deduplication to create a singlefile that is indexable and conforms to a customized VuFind SOLR schema, runningon an unified index host serving part of the data in the online catalogs ofpartners.
- Overview in afew markdown slides
Luigi (and other frameworks) allow to divide complex workflows into a set oftasks, which form aDAG. The task logic isimplemented in Python, but it is easy to use external tools, e.g. viaExternalProgramorshellout. Luigi isworkflow glue and scales up (HDFS) and down (local scheduler).
More on Luigi:
- Luigi docs
- Luigi presentation at LPUG 2015
- Luigi workshop at PyCon Balkan 2018
- Data pipelines, Luigi, Airflow: everything you need to know
More about the project:
- Blog about index [de], 2015
- Presentation at 4th VuFind Meetup [de], 2015
- Metadaten zwischen Autopsie und Automatisierung [de], 2018
Contents.
- Install
- Update
- Run
- Create an aggregated file for finc
- Configuration
- Software versioning
- Schema changes
- Task dependencies
- Evolving workflows
- Development
- Naming conventions
- Deployment
- TODO
$ pip install -U siskin
The siskin project includes abunch ofscripts, that allow to create,inspect or remove tasks or task artifacts.
Starting 02/2020, only Python 3 is supported.
Runtaskchecksetup
to see, what additional tools might need to be installed(this is a manuallycurated list, not everything isrequired for every task).
$ taskchecksetupok 7zok csvcutok curlok filterlineok flux.shok groupcoverok iconvok iconv-chunksok jqok metha-syncok pigzok solrbulkok span-importok unzipok wgetok xmllintok yaz-marcdump
For siskin updates a
$ pip install -U siskin
should suffices. If newer versions of external program are required, thanplease update those manually (e.g. via your OS' package manager).
List tasks:
$ tasknames
A task is an encapsulation of a processing step and can be in theory, anything;Typical tasks are: fetching data from FTP, OAI endpoint or an HTTP API, formatconversions, filters or reports. Many tasks are parameterized by date (with thedefault often beingtoday), which allows siskin to keep track, whether an artifactis update-to-date or not.
Run simple task:
$ taskdo DOAJHarvest
Documentation:
$ taskdocs | less -R
Remove artefacts of a task:
$ taskrm DOAJHarvest
Inspect the source code of a task:
$taskinspectAILocalDataclassAILocalData(AITask):""" Extract a CSV about source, id, doi and institutions for deduplication. """date=ClosestDateParameter(default=datetime.date.today())batchsize=luigi.IntParameter(default=25000,significant=False)defrequires(self):returnAILicensing(date=self.date) ...
There are a couple of prerequisites:
- siskin isinstalled
- most additional tools are installed (or: output of the
taskchecksetup
is mostly green) - credentials areconfigured in/etc/siskin/siskin.ini or~/.config/siskin/siskin.ini
- some static data (that cannot be accessed over the net) is put into place (and configured insiskin.ini)
- sufficient disk space is available
The update process itself consists of various updates:
- all data sources (crossref, doaj, ...) are updated, as needed (e.g. FTP is synced, OAI is harvested, API, ...)
- the licensing data is fetched fromAMSL
This dependency graph of these operations can become complex:
However, if everything is put into place, a single command will suffice:
$ taskdo AIUpdate --workers 4
This can be a long running (hours, days) command, depending on the state of the already cached data.
Note: Currently a jour fixe (the 15th of a month) is used as default for thelicensing information (another task, calledAMSLFilterConfigFreeze should berun daily for this to work). The jour fixe can be overriden with thecurrent information, by passing a parameter to theAILicensing task:
$ taskdo AIUpdate --workers 4 --AILicensing-override
Once the task is completed, the output of the two tasks:
- AIExport (solr)
- AIRedact (blob, currentlymicroblob)
can be put into their respective data stores (e.g. viasolrbulk).
The siskin package harvests all kinds of data sources, some of which might beprotected. All credentials and a few other configuration options go into asiskin.ini
, either in/etc/siskin/
or~/.config/siskin/
. If both filesare present, the local options take precedence.
Luigi uses a bit of configuration as well, put it under/etc/luigi/
.
Completions on task names will save you typing and time, so putsiskin_compeletion.sh
under/etc/bash_completion.d
or somewhere else.
$ tree etcetc├── bash_completion.d│ └── siskin_completion.sh├── luigi│ ├── luigi.cfg│ └── logging.ini└── siskin └── siskin.ini
All configuration values can be inspected quickly with:
$ taskconfig[core]home = /var/siskin[imslp]listings-url = https://example.org/abc[jstor]ftp-username = abcftp-password = d3f...
Since siskin works mostlyon data, software versioning differs a bit, but wetry to adhere to the following rules:
- major changes:You need to recreate all your data from scratch.
- minor changes: We added, renamed or removedat least one task. You willhave to recreate a subset of the tasks to see the changes. You might need to changepipelines depending on those tasks, because they might not exist any more or have been renamed.
- revision changes: A modification within existing tasks (e.g. bugfixes).You will have to recreate a subset of the tasks to see this changes, but no newtask is introduced.No pipeline is broken, that wasn't already.
These rules apply for version 0.2.0 and later. To see the current version, use:
$ taskversion0.43.3
To remove all files of a certain format (due to schema changes or such) it helps, if naming is uniform:
$ tasknames| grep IntermediateSchema| xargs -I {} taskrm {}...
Apart from that, all upstream tasks need to be removed manually (consult themap) as this is not automatic yet.
Inspect task dependencies with:
$ taskdeps JstorIntermediateSchema └─ JstorIntermediateSchema(date=2018-05-25) └─ AMSLService(date=2018-05-25, name=outboundservices:discovery) └─ JstorCollectionMapping(date=2018-05-25) └─ JstorIntermediateSchemaGenericCollection(date=2018-05-25)
Or visually viagraphviz.
$ taskdeps-dot JstorIntermediateSchema| dot -Tpng> deps.png
To converge the project on a common format run:
$ make imports style
This will fix import order and code style in-place. Requires isort and yapfinstalled. Should be executed under Python 3 only (as Python 2 isort seems tohave differing opinions).
Other tools:
- usepylint, currently 9.18/10 with many errors ignored, maybe withgit commit hook
- usepytest,pytest-cov, coverage at 9%
Some conventions are enforced by tools (e.g. imports, yapf), but the followingmay be considered as well.
- task class names that produce MARC21 should have suffix MARC, e.g. ArchiveMARC
- task class names that produce intermediate schema files should have suffix IntermediateSchema, e.g. ArchiveIntermediateSchema
- task for a single source should share a prefix, e.g. ArchiveMARC, ArchiveISSNList
- source prefix names should follow the source names (e.g. site of publisher), in German:vorlagegetreu, e.g. DOAJHarvest, GallicaMARC
- potentially long source names can be shortened, e.g. Umweltbibliothek can become UmBi... in umbi.py
- it is recommended that the source file name follows the source name, e.g. DOAJ tasks live in doaj.py
Rough examples:
# coding: utf-8# pylint: ...## Copyright 2019 ... GPL-3.0+ snippet# ...# @license GPL-3.0+ <http://spdx.org/licenses/GPL-3.0+>"""Source: GallicaSID: 20Ticket: #14793Origin: OAIUpdates: monthlyConfig:[vkfilm]input = /path/to/filepassword = helloadmin"""
- use double quotes, if possible
- if a module can be used as standalone script, then it should include the following line as first line:
#!/usr/bin/env python
- usesix, if necessary
- use
__future__
imports if necessary - preferio.open to raw open, e.g. Python 2 builtin has no keyword
encoding
- string literals should be written with the
u
prefix (obsolete in Python 3, but required in Python 2)
- prefer logging over print statements
- one suffix for data acquisition tasks, e.g. Harvest, Get, Fetch, Download, ...
A distribution can be created via Makefile.
$ make dist$ tree dist/dist/└── siskin-0.62.0.tar.gz
The tarball can be installed viapip:
$ pip install siskin-0.62.0.tar.gz
If access to PyPI is possible, one can upload the tarball there with:
$ make upload
Which in turn allows to install siskin via:
$ pip install -U siskin
on the target machine.
- The naming of the scripts is a bit unfortunate,
taskdo
,taskcat
,.... Maybe bettersiskin run
,siskin cat
,siskin rm
and so on. - Investigatepytest for testing tasks, given inputs.
A short video using luigi'son_success andon_failurehandlers to make the processing audible.