- Notifications
You must be signed in to change notification settings - Fork17
QA catalogue – a metadata quality assessment tool for library catalogue records (MARC, PICA)
License
pkiraly/qa-catalogue
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
QA catalogue is a set of software packages for bibliographical record quality assessment. It reads MARC or PICA files (in different formats), analyses some quality dimensions, and saves the results into CSV files. These CSV files could be used in different context, we provide a lightweight, web-baseduser interface for that. Some of the functionalities are available as aweb service, so the validation could be built into a cataloguing/quality assessment workflow.
Screenshot from the web UI of the QA catalogue
- For more info
- main project page:Metadata Quality Assessment Framework
- Validating 126 million MARC records at DATeCH 2019paper,slides,thesis chapter
- Empirical evaluation of library catalogues at SWIB 2019slides,paper in English,paper in Spanish
- quality assessment of Gent University Library catalogue (a running instanceof the dashboard):http://gent.qa-catalogue.eu/
- new: QA catalogue mailing listhttps://listserv.gwdg.de/mailman/listinfo/qa-catalogue
- If you would like to play with this project, but you don't have MARC21 pleaseto download some recordsets mentioned inAppendix I: Where can I get MARC records? of this document.
- Quick start guide
- Build
- Download
- Usage
- helper-scripts
- Detailed instructions
- General parameters
- Validating MARC records
- Display one MARC record, or extract data elements from MARC records
- Completeness analyses
- Contextual analyses
- Field frequency distribution
- Generating cataloguing history chart
- Import tables to SQLite
- Indexing bibliographic records with Solr
- Indexing MARC JSON records with Solr
- Export mapping table
- Shacl4Bib
- Extending the functionalities
- User interface
- Appendices
SeeINSTALL.md
for dependencies.
wget https://github.com/pkiraly/metadata-qa-marc/releases/download/v0.6.0/metadata-qa-marc-0.6.0-release.zip
unzip metadata-qa-marc-0.6.0-release.zip
cd metadata-qa-marc-0.6.0/
Either use the scriptqa-catalogue
or create configuration files:
cp setdir.sh.template setdir.sh
Change the input and output base directories insetdir.sh
. Local directoriesinput/
andoutput/
will be used bydefault. Files of each catalogue are in a subdirectory of theses base directories:
- Create configuration based on some existing config files:
- cp catalogues/loc.sh catalogues/[abbreviation-of-your-library].sh
- edit catalogues/[abbreviation-of-your-library].sh according toconfiguration guide
A more detailed instruction how to use qa-catalogue with Docker can be foundin the wiki
A Docker image bundling qa-catalogue with all of its dependencies and the webinterfaceqa-catalogue-web is made available:
continuously via GitHub as
ghcr.io/pkiraly/qa-catalogue
and for releases via Docker Hub as
pkiraly/metadata-qa-marc
To download, configure and start an image in a new container the filedocker-compose.yml is needed in the current directory. Itcan be configured with the following environment variables:
IMAGE
: which Docker image to download and run. By default the latestimage from Docker Hub is used (pkiraly/metadata-qa-marc
). Alternatives includeIMAGE=ghcr.io/pkiraly/qa-catalogue:main
for most recent image from GitHub packagesIMAGE=metadata-qa-marc
if you have locallybuild the Docker image
CONTAINER
: the name of the docker container. Default:metadata-qa-marc
.INPUT
: Base directory to put your bibliographic record files in subdirectories of.Set to./input
by default, so record files are expected to be ininput/$NAME
.OUTPUT
: Base directory to put result of qa-catalogue in subdirectory of.Set to./output
by default, so files are put inoutput/$NAME
.WEBCONFIG
: directory to expose configuration ofqa-catalogue-web. Set to./web-config
by default. If using non-default configuration for data analysis(for instance PICA instead of MARC) then you likely need to adjust configurationof the web interface as well. This directory should contain a configuration fileconfiguration.cnf
.WEBPORT
: port to expose the web interface. For instanceWEBPORT=9000
willmake it available athttp://localhost:9000/ instead ofhttp://localhost/.SOLRPORT
: port to expose Solr to. Default:8983
.
Environment variables can be set on command line or be put in local file.env
, e.g.:
WEBPORT=9000 docker compose up -d
or
docker compose --env-file config.env up -d
When the application has been started this way, run analyses with script./docker/qa-catalogue
the same ways as script./qa-catalogue
is called when not using Docker (seeusage fordetails). The following example uses parameters for Gent university librarycatalogue:
./docker/qa-catalogue \ --params"--marcVersion GENT --alephseq" \ --mask"rug01.export" \ --catalogue gent \ all
Now you can reach the web interface (qa-catalogue-web) athttp://localhost:80/ (or at another port as configured withenvironment variableWEBPORT
). To further modify appearance of the interface,createtemplatesin yourWEBCONFIG
directory and/or create a fileconfiguration.cnf
inthis directory to extendUI configuration without having to restart the Docker container.
This example works under Linux. Windows users should consult theDocker on Windows wiki page.Other usefulDocker commandsat QA catalogue's wiki.
Everything else should work the same way as in other environments, so follow the next sections.
catalogues/[abbreviation-of-your-library].sh all-analysescatalogues/[abbreviation-of-your-library].sh all-solr
For a catalogue with around 1 million record the first command will take 5-10minutes, the later 1-2 hours.
Prerequisites: Java 11 (I use OpenJDK), and Maven 3
- Optional step: clone and build the parent library, metadata-qa-api project:
git clone https://github.com/pkiraly/metadata-qa-api.gitcd metadata-qa-apimvn clean installcd ..
- Mandatory step: clone and build the current metadata-qa-marc project
git clone https://github.com/pkiraly/metadata-qa-marc.gitcd metadata-qa-marcmvn clean install
The released versions of the software is available from Maven Centralrepository. The stable releases (currently 0.6.0) is available from all Mavenrepos, while the developer version (*-SNAPSHOT) is available only from the[Sonatype Maven repository](https://oss.sonatype.org/content/repositories/snapshots/de/gwdg/metadataqa/metadata-qa-marc/0.5.0/).What you need to select is the filemetadata-qa-marc-0.6.0-jar-with-dependencies.jar
.
Be aware that no automation exists for creating a current developer version asnightly build, so there is a chance that the latest features are not availablein this version. If you want to use the latest version, do build it.
Since the jar file doesn't contain the helper scripts, you might also considerdownloading them from this GitHub repository:
wget https://raw.githubusercontent.com/pkiraly/metadata-qa-marc/master/common-scriptwget https://raw.githubusercontent.com/pkiraly/metadata-qa-marc/master/validatorwget https://raw.githubusercontent.com/pkiraly/metadata-qa-marc/master/formatterwget https://raw.githubusercontent.com/pkiraly/metadata-qa-marc/master/tt-completeness
You should adjustcommon-script
to point to the jar file you just downloaded.
The tool comes with some bash helper scripts to run all these with defaultvalues. The generic scripts locate in the root directory and library specificconfiguration like scripts exist in thecatalogues
directory. You can findpredefined scripts for several library catalogues (if you want to run it, firstyou have to configure it). All these scrips mainly contain configuration, andthen it calls the centralcommon-script
which contains the functions.
If you do not want to
catalogues/[your script] [command(s)]
or
./qa-catalogue --params="[options]" [command(s)]
The following commands are supported:
validate
-- runs validationcompleteness
-- runs completeness analysisclassifications
-- runs classification analysisauthorities
-- runs authorities analysistt-completeness
-- runs Thomson-Trail completeness analysisshelf-ready-completeness
-- runs shelf-ready completeness analysisserial-score
-- calculates the serial scoresformat
-- runs formatting recordsfunctional-analysis
-- runs functional analysispareto
-- runs pareto analysismarc-history
-- generates cataloguing history chartprepare-solr
-- prepare Solr index (you should already have Solr running, and index created)index
-- runs indexing with Solrsqlite
-- import tables to SQLiteexport-schema-files
-- export schema filesall-analyses
-- run all default analysis tasksall-solr
-- run all indexing tasksall
-- run all tasksconfig
-- show configuration of selected catalogue
You can find information about these functionalities below this document.
- create the configuration file (setdir.sh)
cp setdir.sh.template setdir.sh
- edit the file configuration file. Two lines are important here
BASE_INPUT_DIR=your/pathBASE_OUTPUT_DIR=your/pathBASE_LOG_DIR==your/path
BASE_INPUT_DIR
is the parent directory where your MARC records existsBASE_OUTPUT_DIR
is where the analysis results will be storedBASE_LOG_DIR
is where the analysis logs will be stored
- edit the library specific file
Here is an example file for analysing Library of Congress' MARC records
#!/usr/bin/env bash. ./setdir.shNAME=locMARC_DIR=${BASE_INPUT_DIR}/loc/marcMASK=*.mrc. ./common-script
Three variables are important here:
NAME
is a name for the output directory. The analysis result will landunder $BASE_OUTPUT_DIR/$NAME directoryMARC_DIR
is the location of MARC files. All the files should be in thesame directoryMASK
is a file mask, such as*.mrc
,*.marc
or*.dat.gz
. Files ending with.gz
are uncompressed automatically.
You can add here any other parameters this document mentioned at the descriptionof individual command, wrapped in TYPE_PARAMS variable e.g. for the DeutcheNationalbibliothek's config file, one can find this
TYPE_PARAMS="--marcVersion DNB --marcxml"
This line sets the DNB's MARC version (to cover fields defined within DNB'sMARC version), and XML as input format.
The following table summarizes the configuration variables. The scriptqa-catalogue
can be used to set variables and execute analysis without alibrary specific configuration file:
variable | qa-catalogue | description | default |
---|---|---|---|
ANALYSES | -a /--analyses | which tasks to run withall-analyses | validate, validate_sqlite, completeness, completeness_sqlite, classifications, authorities, tt_completeness, shelf_ready_completeness, serial_score, functional_analysis, pareto, marc_history |
-c /--catalogue | display name of the catalogue | $NAME | |
NAME | -n /--name | name of the catalogue | qa-catalogue |
BASE_INPUT_DIR | -d /--input | parent directory of input file directories | ./input |
INPUT_DIR | -d /--input-dir | subdirectory of input directory to read files from | |
BASE_OUTPUT_DIR | -o /--output | parent output directory | ./output |
BASE_LOG_DIR | -l /--logs | directory of log files | ./logs |
MASK | -m /--mask | a file mask which input files to process, e.g.*.mrc | * |
TYPE_PARAMS | -p /--params | parameters to pass to individual tasks (see below) | |
SCHEMA | -s /--schema | record schema | MARC21 |
UPDATE | -u /--update | optional date of input files | |
VERSION | -v /--version | optional version number/date of the catalogue to compare changes | |
WEB_CONFIG | -w /--web-config | update the specified configuration file of qa-catalogue-web | |
-f /--env-file | configuration file to load environment variables from (default:.env ) |
We will use the same jar file in every command, so we save its path into a variable.
export JAR=target/metadata-qa-marc-0.7.0-jar-with-dependencies.jar
Most of the analyses uses the following general parameters
--schemaType <type>
metadata schema type. The supported types are:MARC21
PICA
UNIMARC
(assessment of UNIMARC records are not yet supported, thisparameter value is only reserved for future usage)
-m <version>
,--marcVersion <version>
specifies a MARC version.Currently, the supported versions are:MARC21
, Library of Congress MARC21DNB
, the Deuthche Nationalbibliothek's MARC versionOCLC
, the OCLCMARCGENT
, fields available in the catalog of Gent University (Belgium)SZTE
, fields available in the catalog of Szegedi Tudományegyetem (Hungary)FENNICA
, fields available in the Fennica catalog of Finnish National LibraryNKCR
, fields available at the National Library of the Czech RepublicBL
, fields available at the British LibraryMARC21NO
, fields available at the MARC21 profile for Norwegian public librariesUVA
, fields available at the University of Amsterdam LibraryB3KAT
, fields available at the B3Kat union catalogue of Bibliotheksverbundes Bayern (BVB)and Kooperativen Bibliotheksverbundes Berlin-Brandenburg (KOBV)KBR
, fields available at KBR, the national library of BelgiumZB
, fields available at Zentralbibliothek ZürichOGYK
, fields available at Országygyűlési Könyvtár, Budapest
-n
,--nolog
do not display log messages- parameters to limit the validation:
-i [record ID]
,--id [record ID]
validates only a single recordhaving the specifies identifier (the content of 001)-l [number]
,--limit [number]
validates only given number ofrecords-o [number]
,--offset [number]
starts validation at the givenNth record-z [list of tags]
,--ignorableFields [list of tags]
do NOTvalidate the selected fields. The list should contain the tagsseparated by commas (,
), e.g.--ignorableFields A02,AQN
-v [selector]
,--ignorableRecords [selector]
do NOT validatethe records which match the condition denoted by the selector.The selector is a test MARCspec string e.g.--ignorableRecords STA$a=SUPPRESSED
. It ignores the records whichhasSTA
field with ana
subfield with the valueSUPPRESSED
.
-d [record type]
,--defaultRecordType [record type]
the default recordtype to be used if the record's type is undetectable. The record type iscalculated from the combination of Leader/06 (Type of record) and Leader/07(bibliographic level), however sometimes the combination doesn't fit to thestandard. In this case the tool will use the given record type. Possiblevalues of the record type argument:- BOOKS
- CONTINUING_RESOURCES
- MUSIC
- MAPS
- VISUAL_MATERIALS
- COMPUTER_FILES
- MIXED_MATERIALS
- parameters to fix known issues before any analyses:
-q
,--fixAlephseq
sometimes ALEPH export contains '^' charactersinstead of spaces in control fields (006, 007, 008). This flag replacesthem with spaces before the validation. It might occur in any input format.-a
,--fixAlma
sometimes Alma export contains '#' characters instead ofspaces in control fields (006, 007, 008). This flag replaces them withspaces before the validation. It might occur in any input format.-b
,--fixKbr
KBR's export contains '#' characters instead spaces incontrol fields (006, 007, 008). This flag replaces them with spaces beforethe validation. It might occur in any input format.
-f <format>
,--marcFormat <format>
The input format. Possible values areISO
: Binary (ISO 2709)XML
: MARCXML (shortcuts:-x
,--marcxml
)ALEPHSEQ
: Alephseq (shortcuts:-p
,--alephseq
)LINE_SEPARATED
: Line separated binary MARC where each line contains onerecord) (shortcuts:-y
,--lineSeparated
)MARC_LINE
: MARC Line is a line-separated format i.e. it is a text file, whereeach line is a distinct field, the same way as MARC records are usuallydisplayed in the MARC21 standard documentation.MARCMAKER
: MARCMaker formatPICA_PLAIN
: PICA plain (https://format.gbv.de/pica/plain) is aserialization format, that contains each fields in distinct row.PICA_NORMALIZED
: normalized PICA (https://format.gbv.de/pica/normalized)is a serialization format where each line is a separate record (by bytecode0A
). Fields are terminated by bytecode 1E, and subfields are introducedby bytecode1F
.
-t <directory>
,--outputDir <directory>
specifies the output directorywhere the files will be created-r
,--trimId
remove spaces from the end of record IDs in the output files(some library system add paddingspaces around field value 001 in exported files)-g <encoding>
,--defaultEncoding <encoding>
specify a default encoding ofthe records. Possible values:ISO-8859-1
orISO8859_1
orISO_8859_1
UTF8
orUTF-8
MARC-8
orMARC8
-s <datasource>
,--dataSource <datasource>
specify the type of datasource. Possible values:FILE
: reading from fileSTREAM
: reading from a Java data stream. It is not usable if you use thetool from the command line, only ifyou use it with its API.
-c <configuration>
,--allowableRecords <configuration>
if set, criteriawhich allows analysis of records. If the record does not met the criteria, itwill be excluded. An individual criterium should be formed as a MarcSpec (forMARC21 records) or PicaFilter (for PICA records). Multiple criteria might beconcatenated with logical operations:&&
for AND,||
for OR and!
fornot. One can use parentheses to group logical expressions. An example:'002@.0 !~ "^L" && 002@.0 !~ "^..[iktN]" && (002@.0 !~ "^.v" || 021A.a?)'
.Since the criteria might form a complex phase containing spaces, the passingof which is problematic among multiple scripts, one can apply Base64 encoding.In this case addbase64:
prefix to the parameters, such asbase64:"$(echo '002@.0 !~ "^L" && 002@.0 !~ "^..[iktN]" && (002@.0 !~ "^.v" || 021A.a?)' | base64 -w 0)
.-1 <type>
,--alephseqLineType <type>
, true, "Alephseq line type. Thetype
could beWITH_L
: the records' AlephSeq lines contain anL
string(e.g.000000002 008 L 780804s1977^^^^enk||||||b||||001^0|eng||
)WITHOUT_L
: the records' AlephSeq lines do not contai anL
string(e.g.000000002 008 780804s1977^^^^enk||||||b||||001^0|eng||
)
- PICA related parameters
-2 <path>
,--picaIdField <path>
the record identifier-u <char>
,--picaSubfieldSeparator <char>
the PICA subfield separator.subfield of PICA records. Default is003@$0
.Default is$
.-j <file>
,--picaSchemaFile <file>
an Avram schema file, which describesthe structure of PICA records-k <path>
,--picaRecordType <path>
The PICA subfield which stores therecord type information. Default is002@$0
.
- Parameters for grouping analyses
-e <path>
,--groupBy <path>
group the results by the value of this dataelement (e.g. the ILN of libraries holding the item). An example:--groupBy 001@$0
where001@$0
is the subfield containing the comma separated list of library ILN codes.-3 <file>
,--groupListFile <file>
the file which contains a list of ILN codes
The last argument of the commands are a list of files. It might contain anywildcard the operating system supports ('*', '?', etc.).
It validates each records against the MARC21 standard, including those localdefined field, which are selected by the MARC version parameter.
The issues are classified into the following categories: record, control field,data field, indicator, subfield and their subtypes.
There is an uncertainty in the issue detection. Almost all library catalogueshave fields, which are not part of the MARC standard, neither that of theirdocumentation about the locally defined fields (these documents are rarelyavailable publicly, and even if they are available sometimes they do not coverall fields). So if the tool meets a field which are undefined, it is impossibleto decide whether it is valid or invalid in a particular context. So in someplaces the tool reflects this uncertainty and provides two calculations, onewhich handles these fields as error, and another which handles these as valid fields.
The tool detects the following issues:
machine name | explanation |
---|---|
record level issues | |
undetectableType | the document type is not detectable |
invalidLinkage | the linkage in field 880 is invalid |
ambiguousLinkage | the linkage in field 880 is ambiguous |
control field position issues | |
obsoleteControlPosition | the code in the position is obsolete (it was valid in a previous version of MARC, but it is not valid now) |
controlValueContainsInvalidCode | the code in the position is invalid |
invalidValue | the position value is invalid |
data field issues | |
missingSubfield | missing reference subfield (880$6) |
nonrepeatableField | repetition of a non-repeatable field |
undefinedField | the field is not defined in the specified MARC version(s) |
indicator issues | |
obsoleteIndicator | the indicator value is obsolete (it was valid in a previous version of MARC, but not in the current version) |
nonEmptyIndicator | indicator that should be empty is non-empty |
invalidValue | the indicator value is invalid |
subfield issues | |
undefinedSubfield | the subfield is undefined in the specified MARC version(s) |
invalidLength | the length of the value is invalid |
invalidReference | the reference to the classification vocabulary is invalid |
patternMismatch | content does not match the patterns specified by the standard |
nonrepeatableSubfield | repetition of a non-repeatable subfield |
invalidISBN | invalid ISBN value |
invalidISSN | invalid ISSN value |
unparsableContent | the value of the subfield is not well-formed according to its specification |
nullCode | null subfield code |
invalidValue | invalid subfield value |
Usage:
java -cp$JAR de.gwdg.metadataqa.marc.cli.Validator [options]<file>
or with a bash script
./validator [options]<file>
or
catalogues/<catalogue>.sh validate
or
./qa-catalogue --params="[options]" validate
options:
- general parameters
- granularity of the report
-S
,--summary
: creating a summary report instead of record level reports-H
,--details
: provides record level details of the issues
- output parameters:
-G <file>
,--summaryFileName <file>
: the name of summary report theprogram produces. The file provides a summary of issues, such as thenumber of instance and number of records having the particular issue.-F <file>
,--detailsFileName <file>
: the name of report the programproduces. Default isvalidation-report.txt
. If you use "stdout", it won'tcreate file, but put results into the standard output.-R <format>
,--format <format>
: format specification of the output. Possible values:text
(default),tab-separated
ortsv
,comma-separated
orcsv
-W
,--emptyLargeCollectors
: the output files are created during theprocess and not only at the end of it. It helps in memory management if theinput is large, and it has lots of errors, on the other hand the output filewill be segmented, which should be handled after the process.-T
,--collectAllErrors
: collect all errors (useful only for validatingsmall number of records). Default is turned off.-I <types>
,--ignorableIssueTypes <types>
: comma separated list of issuetypes not to collect. The valid values are (for details see theissue types table):undetectableType
: undetectable typeinvalidLinkage
: invalid linkageambiguousLinkage
: ambiguous linkageobsoleteControlPosition
: obsolete codecontrolValueContainsInvalidCode
: invalid codeinvalidValue
: invalid valuemissingSubfield
: missing reference subfield (880$6)nonrepeatableField
: repetition of non-repeatable fieldundefinedField
: undefined fieldobsoleteIndicator
: obsolete valuenonEmptyIndicator
: non-empty indicatorinvalidValue
: invalid valueundefinedSubfield
: undefined subfieldinvalidLength
: invalid lengthinvalidReference
: invalid classification referencepatternMismatch
: content does not match any patternsnonrepeatableSubfield
: repetition of non-repeatable subfieldinvalidISBN
: invalid ISBNinvalidISSN
: invalid ISSNunparsableContent
: content is not well-formattednullCode
: null subfield codeinvalidValue
: invalid value
Outputs:
count.csv
: the count of bibliographic records in the source dataset
total1192536
issue-by-category.csv
: the counts of issues by categories. Columns:id
the identifier of error categorycategory
the name of the categoryinstances
the number of instances of errors within the category (one record might have multiple instances of the same error)records
the number of records having at least one of the errors within the category
id,category,instances,records2,control field,994241,3139603,data field,12,124,indicator,5990,50415,subfield,571,555
issue-by-type.csv
: the count of issues by types (subcategories).
id,categoryId,category,type,instances,records5,2,control field,"invalid code",951,5416,2,control field,"invalid value",993290,3137338,3,data field,"repetition of non-repeatable field",12,1210,4,indicator,"obsolete value",1,111,4,indicator,"non-empty indicator",33,3212,4,indicator,"invalid value",5956,501813,5,subfield,"undefined subfield",48,4814,5,subfield,"invalid length",2,215,5,subfield,"invalid classification reference",2,216,5,subfield,"content does not match any patterns",286,27517,5,subfield,"repetition of non-repeatable subfield",123,12018,5,subfield,"invalid ISBN",5,319,5,subfield,"invalid ISSN",105,105
issue-summary.csv
: details of individual issues including basic statistics
id,MarcPath,categoryId,typeId,type,message,url,instances,records53,008/33-34 (008map33),2,5,invalid code,'b' in 'b ',https://www.loc.gov/marc/bibliographic/bd008p.html,1,170,008/00-05 (008all00),2,5,invalid code,Invalid content: '2023 '. Text '2023 ' could not be parsed at index 4,https://www.loc.gov/marc/bibliographic/bd008a.html,1,128,008/22-23 (008map22),2,6,invalid value,| ,https://www.loc.gov/marc/bibliographic/bd008p.html,12,1219,008/31 (008book31),2,6,invalid value, ,https://www.loc.gov/marc/bibliographic/bd008b.html,1,117,008/29 (008book29),2,6,invalid value, ,https://www.loc.gov/marc/bibliographic/bd008b.html,1,1
issue-details.csv
: list of issues by record identifiers. It has two columns, the record identifier, and acomplex string, which contains the number of occurrences of each individual issue concatenated by semicolon.
recordId,errors99117335059205508,1:2;2:1;3:199117335059305508,1:199117335059405508,2:299117335059505508,3:1
1:2;2:1;3:1
means that 3 different types of issues are occurred in the record, the firs issue which has issue ID 1occurred twice, issue ID 2 which occurred once and issue ID 3, which occurred once. The issue IDs can be resolvedfrom theissue-summary.csv
file's firs column.
issue-details-normalized.csv
: the normalized version of the previous file
id,errorId,instances99117335059205508,1,299117335059205508,2,199117335059205508,3,199117335059305508,1,199117335059405508,2,299117335059505508,3,1
issue-total.csv
: the number of issue free records, and number of record having issues
type,instances,records0,0,2511,1711,8482,413,275
where types are
- 0: records without errors
- 1: records with any kinds of errors
- 2: records with errors excluding invalid field errors
issue-collector.csv
: non normalized file of record ids per issues. This is the "inverse" ofissue-details.csv
,it tells you in which records a particular issue occurred.
errorId,recordIds1,99117329355705508;99117328948305508;99117334968905508;99117335067705508;99117335176005508;...
validation.params.json
: the list of the actual parameters during the running of the validation
An example with parameters used for analysing a PICA dataset. When the input is a complex expression it is displayedhere in a parsed format. It also contains some metadata such as the versions of MQFA API and QA catalogue.
{"args":["/path/to/input.dat"],"marcVersion":"MARC21","marcFormat":"PICA_NORMALIZED","dataSource":"FILE","limit":-1,"offset":-1,"id":null,"defaultRecordType":"BOOKS","alephseq":false,"marcxml":false,"lineSeparated":false,"trimId":true,"outputDir":"/path/to/_output/k10plus_pica","recordIgnorator":{"criteria":[],"booleanCriteria":null,"empty":true },"recordFilter":{"criteria":[],"booleanCriteria":{"op":"AND","children":[ {"op":null,"children":[],"value":{"path":{"path":"002@.0","tag":"002@","xtag":null,"occurrence":null,"subfields":{"type":"SINGLE","input":"0","codes":["0"]},"subfieldCodes":["0"] },"operator":"NOT_MATCH","value":"^L" } }, {"op":null,"children":[],"value":{"path":{"path":"002@.0","tag":"002@","xtag":null,"occurrence":null,"subfields":{"type":"SINGLE","input":"0","codes":["0"]},"subfieldCodes":["0"]},"operator":"NOT_MATCH","value":"^..[iktN]"}}, {"op":"OR","children":[{"op":null,"children":[],"value":{"path":{"path":"002@.0","tag":"002@","xtag":null,"occurrence":null,"subfields":{"type":"SINGLE","input":"0","codes":["0"]},"subfieldCodes":["0"]},"operator":"NOT_MATCH","value":"^.v"}},{"op":null,"children":[],"value":{"path":{"path":"021A.a","tag":"021A","xtag":null,"occurrence":null,"subfields":{"type":"SINGLE","input":"a","codes":["a"]},"subfieldCodes":["a"]},"operator":"EXIST","value":null}}],"value":null} ],"value":null },"empty":false },"ignorableFields":{"fields":["001@","001E","001L","001U","001U","001X","001X","002V","003C","003G","003Z","008G","017N","020F","027D","031B","037I","039V","042@","046G","046T","101@","101E","101U","102D","201E","201U","202D"],"empty":false },"stream":null,"defaultEncoding":null,"alephseqLineType":null,"picaIdField":"003@$0","picaSubfieldSeparator":"$","picaSchemaFile":null,"picaRecordTypeField":"002@$0","schemaType":"PICA","groupBy":null,"detailsFileName":"issue-details.csv","summaryFileName":"issue-summary.csv","format":"COMMA_SEPARATED","ignorableIssueTypes":["FIELD_UNDEFINED"],"pica":true,"replacementInControlFields":null,"marc21":false,"mqaf.version":"0.9.2","qa-catalogue.version":"0.7.0-SNAPSHOT"}
id-groupid.csv
: the pairs of record identifiers - group identifiers.
id,groupId010000011,0010000011,77010000011,2035010000011,70010000011,20
Currently, validation detects the following errors:
Leader specific errors:
- Leader/[position] has an invalid value: '[value]' (e.g.
Leader/19 (leader19) has an invalid value: '4'
)
Control field specific errors:
- 006/[position] ([name]) contains an invalid code: '[code]' in '[value]' (e.g.
006/01-05 (tag006book01) contains an invalid code: 'n' in ' n '
) - 006/[position] ([name]) has an invalid value: '[value]' (e.g.
006/13 (tag006book13) has an invalid value: ' '
) - 007/[position] ([name]) contains an invalid code: '[code]' in '[value]'
- 007/[position] ([name]) has an invalid value: '[value]' (e.g.
007/01 (tag007microform01) has an invalid value: ' '
) - 008/[position] ([name]) contains an invalid code: '[code]' in '[value]' (e.g.
008/18-22 (tag008book18) contains an invalid code: 'u' in 'u '
) - 008/[position] ([name]) has an invalid value: '[value]' (e.g.
008/06 (tag008all06) has an invalid value: ' '
)
Data field specific errors
- Unhandled tag(s): [tags] (e.g.
Unhandled tag: 265
) - [tag] is not repeatable, however there are [number] instances
- [tag] has invalid subfield(s): [subfield codes] (e.g.
110 has invalid subfield: s
) - [tag]$[indicator] has invalid code: '[code]' (e.g.
110$ind1 has invalid code: '2'
) - [tag]$[indicator] should be empty, it has '[code]' (e.g.
110$ind2 should be empty, it has '0'
) - [tag]$[subfield code] is not repeatable, however there are [number] instances (e.g.
072$a is not repeatable, however there are 2 instances
) - [tag]$[subfield code] has an invalid value: [value] (e.g.
046$a has an invalid value: 'fb-----'
)
Errors of specific fields:
- 045$a error in '[value]': length is not 4 char (e.g.
045$a error in '2209668': length is not 4 char
) - 045$a error in '[value]': '[part]' does not match any patterns
- 880 should have subfield $a
- 880 refers to field [tag], which is not defined (e.g.
880 refers to field 590, which is not defined
)
An example:
Error in ' 00000034 ': 110$ind1 has invalid code: '2'Error in ' 00000056 ': 110$ind1 has invalid code: '2'Error in ' 00000057 ': 082$ind1 has invalid code: ' 'Error in ' 00000086 ': 110$ind1 has invalid code: '2'Error in ' 00000119 ': 700$ind1 has invalid code: '2'Error in ' 00000234 ': 082$ind1 has invalid code: ' 'Errors in ' 00000294 ': 050$ind2 has invalid code: ' ' 260$ind1 has invalid code: '0' 710$ind2 has invalid code: '0' 710$ind2 has invalid code: '0' 710$ind2 has invalid code: '0' 740$ind2 has invalid code: '1'Error in ' 00000322 ': 110$ind1 has invalid code: '2'Error in ' 00000328 ': 082$ind1 has invalid code: ' 'Error in ' 00000374 ': 082$ind1 has invalid code: ' 'Error in ' 00000395 ': 082$ind1 has invalid code: ' 'Error in ' 00000514 ': 082$ind1 has invalid code: ' 'Errors in ' 00000547 ': 100$ind2 should be empty, it has '0' 260$ind1 has invalid code: '0'Errors in ' 00000571 ': 050$ind2 has invalid code: ' ' 100$ind2 should be empty, it has '0' 260$ind1 has invalid code: '0'...
Usage:
catalogues/<catalogue>.sh validate-sqlite
or
./qa-catalogue --params="[options]" validate-sqlite
or
./common-script [options] validate-sqlite
[options] are the same as for validation
If the data isnot grouped by libraries (no--groupBy <path>
parameter), it creates the following SQLite3 databasestructure and import some of the CSV files into it:
issue_summary
table for theissue-summary.csv
:
It represents a particular type of error
id INTEGER, -- identifier of the errorMarcPath TEXT, -- the location of the error in the bibliographic recordcategoryId INTEGER, -- the identifier of the category of the errortypeId INTEGER, -- the identifier of the type of the errortype TEXT, -- the description of the typemessage TEXT, -- extra contextual information url TEXT, -- the url of the definition of the data elementinstances INTEGER, -- the number of instances this error occuredrecords INTEGER -- the number of records this error occured in
issue_details
table for theissue-details.csv
:
Each row represents how many instances of an error occur in a particular bibliographic record
id TEXT, -- the record identifiererrorId INTEGER, -- the error identifier (-> issue_summary.id)instances INTEGER -- the number of instances of an error in the record
If the dataset is a union catalogue, and the record contains a subfield for the libraries holding the item (there is--groupBy <path>
parameter), it creates the following SQLite3 database structure and import some of the CSV filesinto it:
issue_summary
table for theissue-summary.csv
(it is similar to the other issue_summary table, but it has an extragroupId
column)
groupId INTEGER,id INTEGER,MarcPath TEXT,categoryId INTEGER,typeId INTEGER,type TEXT,message TEXT,url TEXT,instances INTEGER,records INTEGER
issue_details
table (same as the otherissue_details
table)
id TEXT,errorId INTEGER,instances INTEGER
id_groupid
table forid-groupid.csv
:
id TEXT,groupId INTEGER
issue_group_types
table contains statistics for the error types per groups.
groupId INTEGER,typeId INTEGER,records INTEGER,instances INTEGER
issue_group_categories
table contains statistics for the error categories per groups
groupId INTEGER,categoryId INTEGER,records INTEGER,instances INTEGER
issue_group_paths
table contains statistics for the error types per paths per groups
groupId INTEGER,typeId INTEGER,path TEXT,records INTEGER,instances INTEGER
For union catalogues it also creates an extra Solr index with the suffix_validation
. It contains one Solr documentfor each bibliographic record with three fields: the record identifier, the list of group identifiers and the listof error identifiers (if any). This Solr index is needed for populating theissue_group_types
,issue_group_categories
andissue_group_paths
tables. This index will be ingested into the main Solr index.
java -cp$JAR de.gwdg.metadataqa.marc.cli.Formatter [options]<file>
or with a bash script
./formatter [options]<file>
options:
- general parameters
-f
,--format
: the MARC output format- if not set, the output format follows the examples in the MARC21documentation (see the example below)
xml
: the output will be MARCXML
-c <number>
,-countNr <number>
: count number of the record (e.g. 1 meansthe first record)-s [path=query]
,-search [path=query]
: print records matching the query.The query part is the content of the element. The path should be one of thefollowing types:- control field tag (e.g.
001
,002
,003
) - control field position (e.g.
Leader/0
,008/1-2
) - data field (
655\$2
,655\$ind1
) - named control field position (
tag006book01
)
- control field tag (e.g.
-l <selector>
,--selector <selector>
: one or more MarcSpec or PICA Filterselectors, separated by ';' (semicolon) character-w
,--withId
: the generated CSV should contain record ID as first field(default is turned off)-p <separator>
,--separator <separator>
: separator between the parts(default: TAB)-e <file>
,--fileName <file>
: the name of report the program produces(default:extracted.csv
)-A <identifiers>
,--ids <identifiers>
: a comma separated list of recordidentifiers
The output of displaying a single MARC record is something like this one:
LEADER 01697pam a2200433 c 4500001 1023012219003 DE-101005 20160912065830.0007 tu008 120604s2012 gw ||||| |||| 00||||ger 015 $a14,B04$z12,N24$2dnb016 7 $2DE-101$a1023012219020 $a9783860124352$cPp. : EUR 19.50 (DE), EUR 20.10 (AT)$9978-3-86012-435-2024 3 $a9783860124352035 $a(DE-599)DNB1023012219035 $a(OCoLC)864553265035 $a(OCoLC)864553328040 $a1145$bger$cDE-101$d1140041 $ager044 $cXA-DE-SN082 04$81\u$a622.0943216$qDE-101$222/ger083 7 $a620$a660$qDE-101$222sdnb084 $a620$a660$qDE-101$2sdnb085 $81\u$b622085 $81\u$z2$s43216090 $ab110 1 $0(DE-588)4665669-8$0http://d-nb.info/gnd/4665669-8$0(DE-101)963486896$aHalsbrücke$4aut245 00$aHalsbrücke$bzur Geschichte von Gemeinde, Bergbau und Hütten$chrsg. von der Gemeinde Halsbrücke anlässlich des Jubliäums "400 Jahre Hüttenstandort Halsbrücke". [Hrsg.: Ulrich Thiel]264 1$a[Freiberg]$b[Techn. Univ. Bergakad.]$c2012300 $a151 S.$bIll., Kt.$c31 cm, 1000 g653 $a(Produktform)Hardback653 $aGemeinde Halsbrücke653 $aHüttengeschichte653 $aFreiberger Bergbau653 $a(VLB-WN)1943: Hardcover, Softcover / Sachbücher/Geschichte/Regionalgeschichte, Ländergeschichte700 1 $0(DE-588)1113208554$0http://d-nb.info/gnd/1113208554$0(DE-101)1113208554$aThiel, Ulrich$d1955-$4edt$eHrsg.850 $aDE-101a$aDE-101b856 42$mB:DE-101$qapplication/pdf$uhttp://d-nb.info/1023012219/04$3Inhaltsverzeichnis925 r $arb
An example for extracting values:
./formatter --selector"008~7-10;008~0-5" \ --defaultRecordType BOOKS \ --separator"," \ --outputDir${OUTPUT_DIR} \ --fileName marc-history.csv \${MARC_DIR}/*.mrc
It will put the output into ${OUTPUT_DIR}/marc-history.csv.
Counts basic statistics about the data elements available in the catalogue.
Usage:
java -cp$JAR de.gwdg.metadataqa.marc.cli.Completeness [options]<file>
or with a bash script
./completeness [options]<file>
or
catalogues/<catalogue>.sh completeness
or
./qa-catalogue --params="[options]" completeness
options:
- general parameters
-R <format>
,--format <format>
: format specification of the output.Possible values are:tab-separated
ortsv
,comma-separated
orcsv
,text
ortxt
json
-V
,--advanced
: advanced mode (not yet implemented)-P
,--onlyPackages
: only packages (not yet implemented)
Output files:
marc-elements.csv
: is list of MARC elements (field$subfield) and their occurrences in two ways:documenttype
: the document types found in the dataset. There is an extra document type:all
representing allrecordspath
: the notation of the data elementpackageid
andpackage
: each path belongs to one package, such asControl Fields
, and each package has aninternal identifier.tag
: the label of tagsubfield
: the label of subfieldnumber-of-record
: means how many records they are available,number-of-instances
: means how many instances are there in total (some records might contain more than oneinstances, while others don't have them at all)min
,max
,mean
,stddev
the minimum, maximum, mean and standard deviation of the number of instances perrecord (as floating point numbers)histogram
: the histogram of the instances (1=1; 2=1
means: a single instance is available in one record, twoinstances are available in one record)
documenttype | path | packageid | package | tag | subfield | number-of-record | number-of-instances | min | max | mean | stddev | histogram |
---|---|---|---|---|---|---|---|---|---|---|---|---|
all | leader23 | 0 | Control Fields | Leader | Undefined | 1099 | 1099 | 1 | 1 | 1.0 | 0.0 | 1=1099 |
all | leader22 | 0 | Control Fields | Leader | Length of the implementation-defined portion | 1099 | 1099 | 1 | 1 | 1.0 | 0.0 | 1=1099 |
all | leader21 | 0 | Control Fields | Leader | Length of the starting-character-position portion | 1099 | 1099 | 1 | 1 | 1.0 | 0.0 | 1=1099 |
all | 110$a | 2 | Main Entry | Main Entry - Corporate Name | Corporate name or jurisdiction name as entry element | 4 | 4 | 1 | 1 | 1.0 | 0.0 | 1=4 |
all | 340$b | 5 | Physical Description | Physical Medium | Dimensions | 2 | 3 | 1 | 2 | 1.5 | 0.3535533905932738 | 1=1; 2=1 |
all | 363$a | 5 | Physical Description | Normalized Date and Sequential Designation | First level of enumeration | 1 | 1 | 1 | 1 | 1.0 | 0.0 | 1=1 |
all | 340$a | 5 | Physical Description | Physical Medium | Material base and configuration | 2 | 3 | 1 | 2 | 1.5 | 0.3535533905932738 | 1=1; 2=1 |
packages.csv
: the completeness of packages.documenttype
: the document type of the recordpackageid
: the identifier of the packagename
: name of the packagelabel
: label of the packageiscoretag
: does the package belong to the Library of Congress MARC standardcount
: the number of records having at least one data element from this package
documenttype | packageid | name | label | iscoretag | count |
---|---|---|---|---|---|
all | 1 | 01X-09X | Numbers and Code | true | 1099 |
all | 2 | 1XX | Main Entry | true | 816 |
all | 6 | 4XX | Series Statement | true | 358 |
all | 5 | 3XX | Physical Description | true | 715 |
all | 8 | 6XX | Subject Access | true | 514 |
all | 4 | 25X-28X | Edition, Imprint | true | 1096 |
all | 7 | 5XX | Note | true | 354 |
all | 0 | 00X | Control Fields | true | 1099 |
all | 99 | unknown | unknown origin | false | 778 |
libraries.csv
: list the content of the 852$a (it is useful only if the catalog is an aggregated catalog)library
: the code of a librarycount
: the number of records having a particular library code
library | count |
---|---|
"00Mf" | 713 |
"British Library" | 525 |
"Inserted article about the fires from the Courant after the title page." | 1 |
"National Library of Scotland" | 310 |
"StEdNL" | 1 |
"UkOxU" | 33 |
libraries003.csv
: list the content of the 003 (it is useful only if the catalog is an aggregated catalog)library
: the code of a librarycount
: the number of records having a particular library code
library | count |
---|---|
"103861" | 1 |
"BA-SaUP" | 143 |
"BoCbLA" | 25 |
"CStRLIN" | 110 |
"DLC" | 3 |
completeness.params.json
: the list of the actual parameters in analysis
An example with parameters used for analysing a MARC dataset. When the input is a complex expression it is displayedhere in a parsed format. It also contains some metadata such as the versions of MQFA API and QA catalogue.
{"args":["/path/to/input.xml.gz"],"marcVersion":"MARC21","marcFormat":"XML","dataSource":"FILE","limit":-1,"offset":-1,"id":null,"defaultRecordType":"BOOKS","alephseq":false,"marcxml":true,"lineSeparated":false,"trimId":false,"outputDir":"/path/to/_output/","recordIgnorator":{"conditions":null,"empty":true },"recordFilter":{"conditions":null,"empty":true },"ignorableFields":{"fields":null,"empty":true },"stream":null,"defaultEncoding":null,"alephseqLineType":null,"picaIdField":"003@$0","picaSubfieldSeparator":"$","picaSchemaFile":null,"picaRecordTypeField":"002@$0","schemaType":"MARC21","groupBy":null,"groupListFile":null,"format":"COMMA_SEPARATED","advanced":false,"onlyPackages":false,"replacementInControlFields":"#","marc21":true,"pica":false,"mqaf.version":"0.9.2","qa-catalogue.version":"0.7.0"}
For union catalogues themarc-elements.csv
andpackages.csv
have a special version:
completeness-grouped-marc-elements.csv
- the same asmarc-elements.csv
but with an extra elementgroupId
groupId
: the library identifier available in the data element specified by the--groupBy
parameter.0
hasa special meaning: all libraries
groupId | documenttype | path | packageid | package | tag | subfield | number-of-record | number-of-instances | min | max | mean | stddev | histogram |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
350 | all | 044K$9 | 50 | PICA+ bibliographic description | "Schlagwortfolgen (GBV, SWB, K10plus)" | PPN | 1 | 1 | 1 | 1 | 1.0 | 0.0 | 1=1 |
350 | all | 044K$7 | 50 | PICA+ bibliographic description | "Schlagwortfolgen (GBV, SWB, K10plus)" | Vorläufiger Link | 1 | 1 | 1 | 1 | 1.0 | 0.0 | 1=1 |
completeness-grouped-packages.csv
- the same aspackages.csv
but with an extra elementgroup
group
: the library identifier available in the data element specified by the--groupBy
parameter.0
hasa special meaning: all libraries
group | documenttype | packageid | name | label | iscoretag | count |
---|---|---|---|---|---|---|
0 | Druckschriften (einschließlich Bildbänden) | 50 | 0... | PICA+ bibliographic description | false | 987 |
0 | Druckschriften (einschließlich Bildbänden) | 99 | unknown | unknown origin | false | 3 |
0 | Medienkombination | 50 | 0... | PICA+ bibliographic description | false | 1 |
0 | Mikroform | 50 | 0... | PICA+ bibliographic description | false | 11 |
0 | Tonträger, Videodatenträger, Bildliche Darstellungen | 50 | 0... | PICA+ bibliographic description | false | 1 |
0 | all | 50 | 0... | PICA+ bibliographic description | false | 1000 |
0 | all | 99 | unknown | unknown origin | false | 3 |
100 | Druckschriften (einschließlich Bildbänden) | 50 | 0... | PICA+ bibliographic description | false | 20 |
100 | Medienkombination | 50 | 0... | PICA+ bibliographic description | false | 1 |
completeness-groups.csv
: this is available for union catalogues, containing the groupsid
: the group identifiergroup
: the name of the librarycount
: the number of records from the particular library
id | group | count |
---|---|---|
0 | all | 1000 |
100 | Otto-von-Guericke-Universität, Universitätsbibliothek Magdeburg [DE-Ma9] | 21 |
1003 | Kreisarchäologie Rotenburg [DE-MUS-125322...] | 1 |
101 | Otto-von-Guericke-Universität, Universitätsbibliothek, Medizinische Zentralbibliothek (MZB), Magdeburg [DE-Ma14...] | 6 |
1012 | Mariengymnasium Jever [DE-Je1] | 19 |
id-groupid.csv
: this is the very same file what validation creates. Completeness creates it if it not yet available.
Thecompleteness-sqlite
step (which is launched by thecompleteness
step, but could be launched independently aswell) importsmarc-elements.csv
orcompleteness-grouped-marc-elements.csv
file intomarc_elements
table. For the catalogueswithout the--groupBy
parameter thegroupId
column will be filled by0
.
groupId INTEGER,documenttype TEXT,path TEXT,packageid INTEGER,package TEXT,tag TEXT,subfield TEXT,number-of-record INTEGER,number-of-instances INTEGER,min INTEGER,max INTEGER,mean REAL,stddev REAL,histogram TEXT
Kelly Thompson and Stacie Traill recently published their approach to calculatethe quality of ebook records coming from different data sources. Their articleisImplementation of the scoring algorithm described in Leveraging Python toimprove ebook metadata selection, ingest, and management. In Code4Lib Journal,Issue 38, 2017-10-18.http://journal.code4lib.org/articles/12828
java -cp$JAR de.gwdg.metadataqa.marc.cli.ThompsonTraillCompleteness [options]<file>
or with a bash script
./tt-completeness [options]<file>
or
catalogues/[catalogue].sh tt-completeness
or
./qa-catalogue --params="[options]" tt-completeness
options:
- general parameters
-F <file>
,--fileName <file>
: the name of report the program produces.Default istt-completeness.csv
.
It produces a CSV file like this:
id,ISBN,Authors,Alternative Titles,Edition,Contributors,Series,TOC,Date 008,Date 26X,LC/NLM, \LoC,Mesh,Fast,GND,Other,Online,Language of Resource,Country of Publication,noLanguageOrEnglish, \RDA,total"010002197",0,0,0,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,4"01000288X",0,0,1,0,0,1,0,1,2,0,0,0,0,0,0,0,0,0,0,0,5"010004483",0,0,1,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,5"010018883",0,0,0,0,1,0,0,1,2,0,0,0,0,0,0,0,1,1,0,0,6"010023623",0,0,3,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,7"010027734",0,0,3,0,1,2,0,1,2,0,0,0,0,0,0,0,1,0,0,0,10
This analysis is the implementation of the following paper:
Emma Booth (2020)Quality of Shelf-Ready Metadata. Analysis of surveyresponses and recommendations for suppliers Pontefract (UK): NationalAcquisitions Group, 2020. p 31.https://nag.org.uk/wp-content/uploads/2020/06/NAG-Quality-of-Shelf-Ready-Metadata-Survey-Analysis-and-Recommendations_FINAL_June2020.pdf
The main purpose of the report is to highlight which fields of the printed andelectronic book records are important when the records are coming fromdifferent suppliers. 50 libraries participated in the survey, each selectedwhich fields are important to them. The report listed those fields which getsthe highest scores.
The current calculation based on this list of essential fields. If all dataelements specified are available in the record it gets the full score, if onlysome of them, it gets a proportional score. E.g. under 250 (edition statement)there are two subfields. If both are available, it gets score 44. If only oneof them, it gets the half of it, 22, and if none, it gets 0. For 1XX, 6XX, 7XXand 8XX the record gets the full scores if at least one of those fields (withsubfield $a) is available. The total score became the average. The theoreticalmaximum score would be 28.44, which could be accessed if all the data elementsare available in the record.
java -cp$JAR de.gwdg.metadataqa.marc.cli.ShelfReadyCompleteness [options]<file>
with a bash script
./shelf-ready-completeness [options]<file>
or
catalogues/[catalogue].sh shelf-ready-completeness
or
./qa-catalogue --params="[options]" shelf-ready-completeness
options:
- general parameters
-F <file>
,--fileName <file>
: the report file name (default isshelf-ready-completeness.csv
)
These scores are calculated for each continuing resources (type of record(LDR/6) is language material ('a') and bibliographic level (LDR/7) is serialcomponent part ('b'), integrating resource ('i') or serial ('s')).
The calculation is based on a slightly modified version of the method publishedby Jamie Carlstone in the following paper:
Jamie Carlstone (2017)Scoring the Quality of E-Serials MARC Records UsingJava, Serials Review, 43:3-4, pp. 271-277, DOI: 10.1080/00987913.2017.1350525URL:https://www.tandfonline.com/doi/full/10.1080/00987913.2017.1350525
java -cp$JAR de.gwdg.metadataqa.marc.cli.SerialScore [options]<file>
with a bash script
./serial-score [options]<file>
or
catalogues/[catalogue].sh serial-score
or
./qa-catalogue --params="[options]" serial-score
options:
- general parameters
-F <file>
,--fileName <file>
: the report file name. Default isshelf-ready-completeness.csv
.
The Functional Requirements for Bibliographic Records (FRBR) document's mainpart defines the primary and secondary entities which became famous as FRBRmodels. Years later Tom Delsey created a mapping between the 12 functions andthe individual MARC elements.
Tom Delsey (2002)Functional analysis of the MARC 21 bibliographic andholdings formats. Tech. report. Library of Congress, 2002. Prepared for theNetwork Development and MARC Standards Office Library of Congress. SecondRevision: September 17, 2003.https://www.loc.gov/marc/marc-functional-analysis/original_source/analysis.pdf.
This analysis shows how these functions are supported by the records. Lowsupport means that only small portion of the fields support a function areavailable in the records, strong support on the contrary means lots of fieldsare available. The analyses calculate the support of 12 functions for eachrecord, and returns summary statistics.
It is an experimental feature because it turned out, that the mapping coversabout 2000 elements (fields, subfields, indicators etc.), however on anaverage record there are max several hundred elements, which results that evenin the best record has about 10-15% of the totality of the elements supportinga given function. So the tool doesn't show you exact numbers, and the scaleis not 0-100 but 0-[best score] which is different for every catalogue.
The 12 functions:Discovery functions
- search (DiscoverySearch): Search for a resource corresponding to statedcriteria (i.e., to search either a single entity or a set of entities usingan attribute or relationship of the entity as the search criteria).
- identify (DiscoveryIdentify): Identify a resource (i.e., to confirm that theentity described or located corresponds to the entity sought, or todistinguish between two or more entities with similar characteristics).
- select (DiscoverySelect): Select a resource that is appropriate to the user’sneeds (i.e., to choose an entity that meets the user’s requirements withrespect to content, physical format, etc., or to reject an entity as beinginappropriate to the user’s needs)
- obtain (DiscoveryObtain): Access a resource either physically orelectronically through an online connection to a remote computer, and/oracquire a resource through purchase, licence, loan, etc.
Usage functions
- restrict (UseRestrict): Control access to or use of a resource (i.e., torestrict access to and/or use of an entity on the basis of proprietaryrights, administrative policy, etc.).
- manage (UseManage): Manage a resource in the course of acquisition,circulation, preservation, etc.
- operate (UseOperate): Operate a resource (i.e., to open, display, play,activate, run, etc. an entity that requires specialized equipment, software,etc. for its operation).
- interpret (UseInterpret): Interpret or assess the information contained in aresource.
Management functions
- identify (ManagementIdentify): Identify a record, segment, field, or dataelement (i.e., to differentiate one logical data component from another).
- process (ManagementProcess): Process a record, segment, field, or dataelement (i.e., to add, delete, replace, output, etc. a logical data componentby means of an automated process).
- sort (ManagementSort): Sort a field for purposes of alphabetic or numericarrangement.
- display (ManagementDisplay): Display a field or data element (i.e., todisplay a field or data element with the appropriate print constant or as atracing).
java -cp$JAR de.gwdg.metadataqa.marc.cli.FunctionalAnalysis [options]<file>
with a bash script
./functional-analysis [options]<file>
or
catalogues/<catalogue>.sh functional-analysis
or
./qa-catalogue --params="[options]" functional-analysis
options:
Output files:
functional-analysis.csv
: the list of the 12 functions and their averagecount (number of support fields), and average score (percentage of allsupporting fields available in the record)functional-analysis-mapping.csv
: the mapping of functions and dataelementsfunctional-analysis-histogram.csv
: the histogram of scores and count ofrecords for each function (e.g. there arex number of records which hasj score for functiona)
It analyses the coverage of subject indexing/classification in the catalogue.It checks specific fields, which might have subject indexing information, andprovides details about how and which subject indexing schemes have been applied.
java -cp$JAR de.gwdg.metadataqa.marc.cli.ClassificationAnalysis [options]<file>Rscript scripts/classifications/classifications-type.R<output directory>
with a bash script
./classifications [options]<file>Rscript scripts/classifications/classifications-type.R<output directory>
or
catalogues/[catalogue].sh classifications
or
./qa-catalogue --params="[options]" classifications
options:
- general parameters
-w
,--emptyLargeCollectors
: empty large collectors periodically. It is amemory optimization parameter, turn it on if you run into a memory problem.
The output is a set of files:
classifications-by-records.csv
: general overview of how many records hasany subject indexingclassifications-by-schema.csv
: which subject indexing schemas are availablein the catalogues (such as DDC, UDC, MESH etc.) and where they are referredclassifications-histogram.csv
: a frequency distribution of the number ofsubjects available in records (x records have 0 subjects, y records have 1subjects, z records have 2 subjects etc.)classifications-frequency-examples.csv
: examples for particulardistributions (one record ID which has 0 subject, one which has 1 subject, etc.)classifications-by-schema-subfields.csv
: the distribution of subfields ofthose fields, which contains subject indexing information. It gives you abackground that what other contextual information behind the subject term areavailable (such as the version of the subject indexing scheme)classifications-collocations.csv
: how many record has a particular set ofsubject indexing schemesclassifications-by-type.csv
: returns the subject indexing schemes and theirtypes in order of the number of records. The types are TERM_LIST (subtypes:DICTIONARY, GLOSSARY, SYNONYM_RING), METADATA_LIKE_MODEL (NAME_AUTHORITY_LIST,GAZETTEER), CLASSIFICATION (SUBJECT_HEADING, CATEGORIZATION, TAXONOMY,CLASSIFICATION_SCHEME), RELATIONSHIP_MODEL (THESAURUS, SEMANTIC_NETWORK,ONTOLOGY).
It analyses the coverage of authority names (persons, organisations, events,uniform titles) in the catalogue. It checks specific fields, which might haveauthority names, and provides details about how and which schemes have beenapplied.
java -cp$JAR de.gwdg.metadataqa.marc.cli.AuthorityAnalysis [options]<file>
with a bash script
./authorities [options]<file>
or
catalogues/<catalogue>.sh authorities
or
./qa-catalogue --params="[options]" authorities
options:
- general parameters
-w
,--emptyLargeCollectors
: empty large collectors periodically. It is amemory optimization parameter, turn it on if you run into a memory problem
The output is a set of files:
authorities-by-records.csv
: general overview of how many records has anyauthority namesauthorities-by-schema.csv
: which authority names schemas are available inthe catalogues (such as ISNI, Gemeinsame Normdatei etc.) and where they arereferredauthorities-histogram.csv
: a frequency distribution of the number ofauthority names available in records (x records have 0 authority names, yrecords have 1 authority name, z records have 2 authority names etc.)authorities-frequency-examples.csv
: examples for particular distributions(one record ID which has 0 authority name, one which has 1 authority name,etc.)authorities-by-schema-subfields.csv
: the distribution of subfields of thosefields, which contains authority names information. It gives you a backgroundthat what other contextual information behind the authority names areavailable (such as the version of the authority name scheme)
This analysis reveals the relative importance of some fields. Pareto'sdistribution is a kind of power law distribution, and Pareto-rule of 80-20rules states that 80% of outcomes are due to 20% of causes. In catalogueoutcome is the total occurrences of the data element, causes are individualdata elements. In catalogues some data elements occurs much more frequentlythen others. This analyses highlights the distribution of the data elements:whether it is similar to Pareto's distribution or not.
It produces charts for each document type and one for the whole catalogueshowing the field frequency patterns. Each chart shows a line which is thefunction of field frequency: on the X-axis you can see the subfields orderedby the frequency (how many times a given subfield occurred in the wholecatalogue). They are ordered by frequency from the most frequent top 1% to theleast frequent 1% subfields. The Y-axis represents the cumulative occurrence(from 0% to 100%).
Before running it you should first run the completeness calculation.
With a bash script
catalogues/[catalogue].sh pareto
or
./qa-catalogue --params="[options]" pareto
options:
This analysis is based on Benjamin Schmidt's blog postA brief visual historyof MARC cataloging at the Library of Congress. (Tuesday, May 16, 2017).
It produces a chart where the Y-axis is based on the "date entered on file"data element that indicates the date the MARC record was created (008/00-05),the X-axis is based on "Date 1" element (008/07-10).
Usage:
catalogues/[catalogue].sh marc-history
or
./qa-catalogue --params="[options]" marc-history
options:
This is just a helper function which imports the results of validation intoSQLite3 database.
The prerequisite of this step is to run validation first, since it uses thefiles produced there. If you run validation withcatalogues/<catalogue>.sh
or./qa-catalogue
scripts, this importing step is already covered there.
Usage:
catalogues/[catalogue].sh sqlite
or
./qa-catalogue --params="[options]" sqlite
options:
Output:
qa_catalogue.sqlite
: the SQLite3 database with 3 tables:issue_details
,issue_groups
, andissue_summary
.
Run indexer:
java -cp$JAR de.gwdg.metadataqa.marc.cli.MarcToSolr [options] [file]
With script:
catalogues/[catalogue].sh all-solr
or
./qa-catalogue --params="[options]" all-solr
options:
- general parameters
-S <URL>
,--solrUrl <URL>
: the URL of Solr server including the core (e.g.http://localhost:8983/solr/loc)-A
,--doCommit
: send commits to Solr regularly (not needed if you set up Solr as described below)-T <type>
,--solrFieldType <type>
: a Solr field type, one of thepredefined values. See examples below.marc-tags
- the field names are MARC codeshuman-readable
- the field names areSelf Descriptive MARC codemixed
- the field names are mixed of the above (e.g.245a_Title_mainTitle
)
-C
,--indexWithTokenizedField
: index data elements as tokenized field as well (each bibliographical data elementswill be indexed twice: once as a phrase (fields suffixed with_ss
), and once as a bag of words (fields suffixedwith_txt
). [This parameter is available from v0.8.0]-D <int>
,--commitAt <int>
: commit index after this number of records [This parameter is available from v0.8.0]-E
,--indexFieldCounts
: index the count of field instances [This parameter is available from v0.8.0]-G
,--indexSubfieldCounts
: index the count of subfield instances [This parameter is available from v0.8.0]-F
,--fieldPrefix <arg>
: field prefix
The./index
file (which is used bycatalogues/[catalogue].sh
and./qa-catalogue
scripts) has additional parameters:
-Z <core>
,--core <core>
: The index name (core). If not set it will be extracted from thesolrUrl
parameter-Y <path>
,--file-path <path>
: File path-X <mask>
,--file-mask <mask>
: File mask-W
,--purge
: Purge index and exit-V
,--status
: Show the status of index(es) and exit-U
,--no-delete
: Do not delete documents in index before starting indexing (be default the script clears the index)
QA catalogue builds a Solr index which contains a) a set of fixed Solr fields that are the same for all bibliographicinput, and b) Solr fields that depend on the field names of the metadata schema (MARC, PICA, UNIMARC etc.) - these fieldsshould be mapped from metadata schema to dynamic Solr fields by an algorithm.
id
: the record ID. This comes from the identifier of the bibliographic record, so 001 for MARC21record_sni
: the JSON representation of the bibliographic recordgroupId_is
: the list of group IDs. The content comes from the data element specified by the--groupBy
parametersplit by commas (',').errorId_is
: the list of error IDs that come from the result of the validation.
The mapped fields are Solr fields that depend on the field names of the metadata schema. The final Solr field followsthe pattern:
Field prefix:
With--fieldPrefix
parameter you can set a prefix that is applied to the variable fields. This might be needed becauseSolr has a limitation: field names start with a number can not be used in some Solr parameter, such asfl
(field listselected to be retrieved from the index). Unfortunately bibliographic schemas use field names start with numbers. You canchange a mapping parameter that produces a mapped value that resembles the BIBFRAME mapping of the MARC21 field, butnot all field has such a human readable association.
Field suffixes:
*_sni
: not indexed, stored string fields -- good for storing fields used for displaying information*_ss
: not parsed, stored, indexed string fields -- good for display and facets*_tt
: parsed, not stored, indexed string fields -- good for term searches (these fields will be availabe if--indexWithTokenizedField
parameter is applied)*_is
: parsed, not stored, indexed integer fields -- good for searching for numbers, such as error or group identifiers(these fields will be availabe if--indexFieldCounts
parameter is applied)
The mapped value
With--solrFieldType
you can select the algorithm that generates the mapped value. Right now there are three formats:
marc-tags
- the field names are MARC codes (245$a
→245a
)human-readable
- the field names areSelf Descriptive MARC code(245$a
→Title_mainTitle
)mixed
- the field names are mixed of the above (e.g.245a_Title_mainTitle
)
"100a_ss":["Jung-Baek, Myong Ja"],"100ind1_ss":["Surname"],"245c_ss":["Vorgelegt von Myong Ja Jung-Baek."],"245ind2_ss":["No nonfiling characters"],"245a_ss":["S. Tret'jakov und China /"],"245ind1_ss":["Added entry"],"260c_ss":["1987."],"260b_ss":["Georg-August-Universität Göttingen,"],"260a_ss":["Göttingen :"],"260ind1_ss":["Not applicable/No information provided/Earliest available publisher"],"300a_ss":["141 p."],
"MainPersonalName_type_ss":["Surname"],"MainPersonalName_personalName_ss":["Jung-Baek, Myong Ja"],"Title_responsibilityStatement_ss":["Vorgelegt von Myong Ja Jung-Baek."],"Title_mainTitle_ss":["S. Tret'jakov und China /"],"Title_titleAddedEntry_ss":["Added entry"],"Title_nonfilingCharacters_ss":["No nonfiling characters"],"Publication_sequenceOfPublishingStatements_ss":["Not applicable/No information provided/Earliest available publisher"],"Publication_agent_ss":["Georg-August-Universität Göttingen,"],"Publication_place_ss":["Göttingen :"],"Publication_date_ss":["1987."],"PhysicalDescription_extent_ss":["141 p."],
"100a_MainPersonalName_personalName_ss":["Jung-Baek, Myong Ja"],"100ind1_MainPersonalName_type_ss":["Surname"],"245a_Title_mainTitle_ss":["S. Tret'jakov und China /"],"245ind1_Title_titleAddedEntry_ss":["Added entry"],"245ind2_Title_nonfilingCharacters_ss":["No nonfiling characters"],"245c_Title_responsibilityStatement_ss":["Vorgelegt von Myong Ja Jung-Baek."],"260b_Publication_agent_ss":["Georg-August-Universität Göttingen,"],"260a_Publication_place_ss":["Göttingen :"],"260ind1_Publication_sequenceOfPublishingStatements_ss":["Not applicable/No information provided/Earliest available publisher"],"260c_Publication_date_ss":["1987."],"300a_PhysicalDescription_extent_ss":["141 p."],
A distinct projectmetadata-qa-marc-web, provides a web application thatutilizes to build this type of Solr index in number of ways (a facetted search interface, term lists, search forvalidation errors etc.)
The tool uses different Solr indices (aka cores) to store information. In the following example we useloc
as the nameof our catalogue. There are two main indices:loc
andloc_dev
.loc_dev
is the target of the index process, it willcreate it from scratch. During the proessloc
is available and searchable. When the indexing has been successfullyfinished these two indices will be swaped, so the previousloc
will becomeloc_dev
, and the new index will beloc
.The web user interface will always use the latest version (not the dev).
Besides these two indices there is a third index that contains different kind of results of the analyses. At the time ofwriting it contains only the results of validation, but later it will cover other information as well. It can be set bythe following parameter:
-4
,--solrForScoresUrl <arg>
: the URL of the Solr server used to store scores (it is populated in thevalidate-sqlite
process which runs after validation)
During the indexing process the content of this index is meged into the_dev
index, so after a successfull end of theprocess this index is not needed anymore.
In order to make the automation easier and still flexible there are some an auxilary commands:
./qa-catalogue prepare-solr
: created these two indices, makes sure that their schemas contain the necessary fields./qa-catalogue index
: runs the indexing process./qa-catalogue postprocess-solr
: swap the two Solr cores ( and _dev)./qa-catalogue all-solr
: runs all the three steps
If you would like to maintain the Solr index yourself (e.g. because the Solr instance wuns in a cloud environment),you should skipprepare-solr
andpostprocess-solr
, and run onlyindex
. For maintaining the schema you can finda minimal viable schema among thetest resources
You can set autocommit the following way insolrconfig.xml
(inside Solr):
<autoCommit> <maxTime>${solr.autoCommit.maxTime:15000}</maxTime> <maxDocs>5000</maxDocs> <openSearcher>true</openSearcher></autoCommit>...<autoSoftCommit> <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime></autoSoftCommit>
It needs if you choose to disable QA catalogue to issue commit messages (see--commitAt
parameter), which makesindexing faster.
In schema.xml (or in Solr web interface) you should be sure that you have the following dynamic fields:
<dynamicFieldname="*_ss"type="strings"indexed="true"stored="true"/><dynamicFieldname="*_tt"type="text_general"indexed="true"stored="false"/><dynamicFieldname="*_is"type="pints"indexed="true"stored="true" /><dynamicFieldname="*_sni"type="string_big"docValues="false"multiValued="false"indexed="false"stored="true"/><copyFieldsource="*_ss"dest="_text_"/>
or use Solr API:
NAME=dnbSOLR=http://localhost:8983/solr/$NAME/schema// add copy fieldcurl -X POST -H'Content-type:application/json' --data-binary'{ "add-dynamic-field":{ "name":"*_sni", "type":"string", "indexed":false, "stored":true}}'$SOLRcurl -X POST -H'Content-type:application/json' --data-binary'{ "add-copy-field":{ "source":"*_ss", "dest":["_text_"]}}'$SOLR...
See thesolr-functions file for full code.
QA catalogue has a helper scipt to get information about the status of Solr index (Solr URL, location, the list of cores,number of documents, size in the disk, and last modification):
$ ./index --statusSolr index status at http://localhost:8983Solr directory: /opt/solr-9.3.0/server/solrcore| location| nr of docs| size| last modified....................| ...............| ..........| ..........| ...................nls| nls_1| 403946| 1002.22 MB| 2023-11-25 21:59:39nls_dev| nls_2| 403943| 987.22 MB| 2023-11-11 15:59:49nls_validation| nls_validation| 403946| 17.89 MB| 2023-11-25 21:35:44yale| yale_2| 2346976| 9.51 GB| 2023-11-11 13:12:35yale_dev| yale_1| 2346976| 9.27 GB| 2023-11-11 10:58:08
java -cp$JAR de.gwdg.metadataqa.marc.cli.utils.MarcJsonToSolr<Solr url><MARC JSON file>
The MARC JSON file is a JSON serialization of binary MARC file. See more theMARC Pipeline project.
Seehttps://pkiraly.github.io/qa-catalogue/avram-schemas.html.
since v0.7.0. Note: This is an experimental feature.
The Shapes Constraint Language (SHACL) is a formal language for validating Resource Description Framework (RDF) graphsagainst a set of conditions (expressed also in RDF). Following this idea and implementing a subset of the language,the Metadata Quality Assessment Framework provides a mechanism to define SHACL-like rules for data sources in non-RDFbased formats, such as XML, CSV and JSON (SHACL validates only RDF graphs). Shacl4Bib is the extension enabling thevalidation of bibliographic records. The rules can be defined either with YAML or JSON configuration files or with Javacode. SCHACL uses RDF notation to specify or "address" the data element about which the constraints are set. Shacl4Bibsupports Carsten Klee's MARCspec for MARC records, and PICApath for PICA. You can find more information and fulldefinition of the implemented subset of SHACL here:https://github.com/pkiraly/metadata-qa-api#defining-schema-with-a-configuration-file
Parameters:
-C <file>
,--shaclConfigurationFile <file>
: specify the SHACL like configuration file-O <file>
,--shaclOutputFile <file>
: output file (default:shacl4bib.csv
)-P <type>
,--shaclOutputType <type>
: specify what the output files should contain. Possible values:STATUS
: status only (default), where the following values appear:1
the criteria met,0
the criteria have not met,NA
: the data element is not available in the record),
SCORE
: score only. Its value is calculated the following way:- if the criteria met it returns the value of
successScore
property (or 0 if no such property) - if the criteria have not met, it returns the value of
failureScore
property (or 0 if no such property)
- if the criteria met it returns the value of
BOTH
: both status and score
Here is a simple example for setting up rules against a MARC subfield:
format:MARCfields:-name:040$apath:040$arules: -id:040$a.minCountminCount:1 -id:040$a.patternpattern:^BE-KBR00
format
represents the format of the input data. It can be eitherMARC
orPICA
fields
: the list of fields we would like to investigate. Since it is a YAMPL example, the-
and indentation denoteschild elements. Here there is only one child, so we analyse here a single subfield.name
is how the data element is called within the rule set. It could be a machine or a human readable string.path
is the "address" of the metadata element. It should be expressed in an addressing language such as MARCSpec orPICAPath (040$a contains the original cataloging agency)rules
: the parent element of the set of rules. Here we have two rules.id
: the identifier of the rule. This will be the header of the column in CSV, and it could be references elsewhere inthe SHACL configuration file.mintCount
: this specify the minimum number of instances of the data element in the recordpattern
: a regular expression which should match the values of all instances of the data element
The output contains an extra column, the record identifier, so it looks like something like this:
id,040$a.minCount,040$a.pattern17529680,1,118212975,1,118216050,1,118184955,1,118184431,1,19550740,NA,NA19551181,NA,NA118592844,1,118592704,1,118592557,1,1
The project is available from Maven Central, the central repository of open source Java projects as jar files. If youwant to use it in your Java or Scala application, put this code snippet into the list of dependencies:
pom.xml
<dependency> <groupId>de.gwdg.metadataqa</groupId> <artifactId>metadata-qa-marc</artifactId> <version>0.6.0</version></dependency>
build.sbt
libraryDependencies +="de.gwdg.metadataqa" %"metadata-qa-marc" %"0.6.0"
or you can directly download the jars fromhttp://repo1.maven.org
There is a web application for displaying and navigation through the output ofthe tool (written in PHP):
https://github.com/pkiraly/metadata-qa-marc-web/
Here is a list of data sources I am aware of so far:
- Library of Congress —https://www.loc.gov/cds/products/marcDist.php. MARC21 (UTF-8 and MARC8 encoding), MARCXML formats, open access. Alternative access point:https://www.loc.gov/collections/selected-datasets/?fa=contributor:library+of+congress.+cataloging+distribution+service.
- Harvard University Library —https://library.harvard.edu/open-metadata. MARC21 format,CC0. Institution specific features are documentedhere
- Columbia University Library —https://library.columbia.edu/bts/clio-data.html. 10M records, MARC21 and MARCXML format,CC0.
- University of Michigan Library —https://www.lib.umich.edu/open-access-bibliographic-records. 1,3M records, MARC21 and MARCXML formats,CC0.
- University of Pennsylvania Libraries —https://www.library.upenn.edu/collections/digital-projects/open-data-penn-libraries. Two datasets are available:
- Catalog records created by Penn Libraries 572K records, MARCXML format,CC0,
- Catalog records derived from other sources, 6.5M records, MARCXML format, Open Data CommonsODC-BY, use in accordance with the OCLCcommunity norms.
- Yale University —https://guides.library.yale.edu/c.php?g=923429. Three datasets are available:
- National Library of Medicine (NLM) catalogue records —https://www.nlm.nih.gov/databases/download/catalog.html. 4.2 million records, NLMXML, MARCXML and MARC21 formats.NLM Terms and Conditions
- Deutsche Nationalbibliothek —https://www.dnb.de/DE/Professionell/Metadatendienste/Datenbezug/Gesamtabzuege/gesamtabzuege_node.html (note: the records are provided in utf-8 decomposed). 23.9M records, MARC21 and MARCXML format,CC0.
- Bibliotheksverbundes Bayern —https://www.bib-bvb.de/web/b3kat/open-data. 27M records, MARCXML format,CC0.
- Leibniz-Informationszentrum Technik und Naturwissenschaften Universitätsbibliothek (TIB) —https://www.tib.eu/de/forschung-entwicklung/entwicklung/open-data. (no download link, use OAI-PMH instead) Dublin Core, MARC21, MARCXML,CC0.
- K10plus-Verbunddatenbank (K10plus union catalogue of Bibliotheksservice-Zentrum Baden Würtemberg (BSZ) and Gemensamer Bibliotheksverbund (GBV)) —https://swblod.bsz-bw.de/od/. 87M records, MARCXML format,CC0.
- Universiteitsbibliotheek Gent —https://lib.ugent.be/info/exports. Weekly data dump in Aleph Sequentialformat. It contains some Aleph fields above the standard MARC21 fields.ODC ODbL.
- Toronto Public Library —https://opendata.tplcs.ca/. 2.5 million MARC21 records,Open Data Policy
- Répertoire International des Sources Musicales —https://opac.rism.info/index.php?id=8&id=8&L=1. 800K records, MARCXML, RDF/XML,CC-BY.
- ETH-Bibliothek (Swiss Federal Institute of Technology in Zurich) —http://www.library.ethz.ch/ms/Open-Data-an-der-ETH-Bibliothek/Downloads. 2.5M records, MARCXML format.
- British library —http://www.bl.uk/bibliographic/datafree.html#m21z3950 (no download link, use z39.50 instead after asking for permission). MARC21, usage will be strictly for non-commercial purposes.
- Talis —https://archive.org/details/talis_openlibrary_contribution. 5.5 million MARC21 records contributed by Talis to Open Library under theODC PDDL.
- Oxford Medicine Online (the catalogue of medicine books published by Oxford University Press) —https://oxfordmedicine.com/page/67/. 1790 MARC21 records.
- Fennica — the Finnish National Bibliography provided by the Finnish National Library —http://data.nationallibrary.fi/download/. 1 million records, MARCXML,CC0.
- Biblioteka Narodawa (Polish National Library) —https://data.bn.org.pl/databases. 6.5 million MARC21 records.
- Magyar Nemzeti Múzeum (Hungarian National Library) —https://mnm.hu/hu/kozponti-konyvtar/nyilt-bibliografiai-adatok, 67K records, MARC21, HUNMARC, BIBFRAME,CC0
- University of Amsterdam Library —https://uba.uva.nl/en/support/open-data/data-sets-and-publication-channels/data-sets-and-publication-channels.html 2.7 million records, MARCXML,PDDL/ODC-BY. Note: the record for books are not downloadable, only other document types. One should request them via the website.
- Portugal National Library —https://opendata.bnportugal.gov.pt/downloads.htm. 1.13 million UNIMARC records in MARCXML, RDF XML, JSON, TURTLE and CSV formats.CC0
- National Library of Latvia National bibliography (2017–2020) —https://dati.lnb.lv/. 11K MARCXML records.
- Open datasets of the Czech National Library —https://www.en.nkp.cz/about-us/professional-activities/open-dataCC0
- Czech National Bibliography —https://aleph.nkp.cz/data/cnb.xml.gz
- National Authority File —https://aleph.nkp.cz/data/aut.xml.gz
- Online Catalogue of the National Library of the Czech Republic —https://aleph.nkp.cz/data/nkc.xml.gz
- Union Catalogue of the Czech Republic —https://aleph.nkp.cz/data/skc.xml.gz
- Articles from Czech Newspapers, Periodicals and Proceedings —https://aleph.nkp.cz/data/anl.xml.gz
- Online Catalogue of the Slavonic Library —https://aleph.nkp.cz/data/slk.xml.gz
- Estonian National Bibliography — as downloadable TSV or MARC21 via OAI-PMHhttps://digilab.rara.ee/en/datasets/estonian-national-bibliography/CC0
Thanks,Johann Rolschewski,Phú, and [Hugh Paterson III](https://twitter.com/thejourneyler) for their help in collecting this list! Do you know some more data sources? Please let me know.
There are two more datasource worth mention, however they do not provide MARC records, but derivatives:
- Linked Open British National Bibliography 3.2M book records in N-Triplets and RDF/XML format, CC0 license
- Linked data of Bibliothèque nationale de France. N3, NT and RDF/XML formats,Licence Ouverte/Open Licence
The tool provides two levels of customization:
- project specific tags can be defined in their own Java package, such as theseclasses for Gent data:https://github.com/pkiraly/metadata-qa-marc/tree/master/src/main/java/de/gwdg/metadataqa/marc/definition/tags/genttags
- for existing tags one can use the API described below
The different MARC versions has an identifier. This is defined in the code as an enumeration:
publicenumMarcVersion {MARC21("MARC21","MARC21"),DNB("DNB","Deutsche Nationalbibliothek"),OCLC("OCLC","OCLC"),GENT("GENT","Universiteitsbibliotheek Gent"),SZTE("SZTE","Szegedi Tudományegyetem"),FENNICA("FENNICA","National Library of Finland") ; ...}
When you add version specific modification, you have to use one of these values.
- Defining version specific indicator codes:
Indicator::putVersionSpecificCodes(MarcVersion,List<Code>);
Code is a simple object, it has two property: code and label.
example:
publicclassTag024extendsDataFieldDefinition { ...ind1 =newIndicator("Type of standard number or code") .setCodes(...) .putVersionSpecificCodes(MarcVersion.SZTE,Arrays.asList(newCode(" ","Not specified") ) ) ...}
- Defining version specific subfields:
DataFieldDefinition::putVersionSpecificSubfields(MarcVersion,List<SubfieldDefinition>)
SubfieldDefinition contains a definition of a subfield. You can construct itwith three String parameters: a code, a label and a cardinality code whichdenotes whether the subfield can be repeatable ("R") or not ("NR").
example:
publicclassTag024extendsDataFieldDefinition { ...putVersionSpecificSubfields(MarcVersion.DNB,Arrays.asList(newSubfieldDefinition("9","Standardnummer (mit Bindestrichen)","NR") ) );}
- Marking indicator codes as obsolete:
Indicator::setHistoricalCodes(List<String>)
The list should be pairs of code and description.
publicclassTag082extendsDataFieldDefinition { ...ind1 =newIndicator("Type of edition") .setCodes(...) .setHistoricalCodes(" ","No edition information recorded (BK, MU, VM, SE) [OBSOLETE]","2","Abridged NST version (BK, MU, VM, SE) [OBSOLETE]" ) ...}
- Marking subfields as obsolete:
DataFieldDefinition::setHistoricalSubfields(List<String>);
The list should be pairs of code and description.
publicclassTag020extendsDataFieldDefinition { ...setHistoricalSubfields( "b","Binding information (BK, MP, MU) [OBSOLETE]" );}
If you create new a package for the new MArc version, you should register it to several places:
a. add a case intosrc/main/java/de/gwdg/metadataqa/marc/Utils.java
:
case"zbtags":version =MarcVersion.ZB;break;
b. add an item into enumeration atsrc/main/java/de/gwdg/metadataqa/marc/definition/tags/TagCategory.java
:
ZB(23,"zbtags","ZB","Locally defined tags of the Zentralbibliothek Zürich",false),
c. modify the expected number of data elements atsrc/test/java/de/gwdg/metadataqa/marc/utils/DataElementsStaticticsTest.java
:
assertEquals(215,statistics.get(DataElementType.localFields));
d. ... and asrc/test/java/de/gwdg/metadataqa/marc/utils/MarcTagListerTest.java
:
assertEquals(2, (int)versionCounter2.get(MarcVersion.ZB));assertEquals(2, (int)versionCounter.get("zbtags"));
- Universiteitsbibliotheek Gent, Gent, Belgium
- Biblioteksentralen, Oslo, Norway
- Deutsche Digitale Bibliothek, Frankfurt am Main/Berlin, Germany
- British Library, London/Boston Spa, United Kingdom
- Országgyűlési Könyvtár, Budapest, Hungary
- Studijní a vědecká knihovna Plzeňského kraje, Plzeň, Czech Republic
- Royal Library of Belgium (KBR), Brussels, Belgium
- Gemeinsamer Bibliotheksverbund (GBV), Göttingen, Germany
- Binghampton University Libraries, Binghampton, NY, USA
- Zentralbibliothek Zürich, Zürich, Switzerland
If you use this tool as well, please contact me: pkiraly (at) gwdg (dot) de. Ireally like to hear about your use case and ideas.
- Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG): Hardware, time forresearch
- Gemeinsamer Bibliotheksverbund (GBV): contracting for feature development
- Royal Library of Belgium (KBR): contracting for feature development
- JetBrains s.r.o.:IntelliJ IDEAdevelopment tool community licence
"deployment" build (when deploying artifacts to Maven Central)
mvn clean deploy -Pdeploy
Build and test
# create the Java librarymvn clean install# create the docker base imagedocker compose -f docker/build.yml build app
Thedocker compose build
command has multiple--build-arg
arguments to override defaults:
QA_CATALOGUE_VERSION
: the QA catalogue version (default:0.7.0
, current development version is0.8.0-SNAPSHOT
)QA_CATALOGUE_WEB_VERSION
: it might be a released version such as0.7.0
, ormain
(default) to use themain branch, ordevelop
to use the develop branch.SOLR_VERSION
: the Apache Solr version you would like to use (default:8.11.1
)SOLR_INSTALL_SOURCE
: if its value isremote
docker will download it fromhttp://archive.apache.org/.If its value is a local path points to a previously downloaded package (named assolr-${SOLR_VERSION}.zip
up to version 8.x.x orsolr-${SOLR_VERSION}.tgz
from version 9.x.x) the process will copy it from thehost to the image file. Depending on the internet connection, download might take a long time, using apreviously downloaded package speeds the building process.(Note: it is not possible to specify files outside the current directory, not using symbolic links, butyou can create hard links - see an example below.)
Using the current developer version:
docker compose -f docker/build.yml build app \ --build-arg QA_CATALOGUE_VERSION=0.8.0-SNAPSHOT \ --build-arg QA_CATALOGUE_WEB_VERSION=develop \ --build-arg SOLR_VERSION=8.11.3
Using a downloaded Solr package:
# create link temporarymkdir downloadln~/Downloads/solr/solr-8.11.3.zip download/solr-8.11.3.zip# run dockerdocker compose -f docker/build.yml build app \ --build-arg QA_CATALOGUE_VERSION=0.8.0-SNAPSHOT \ --build-arg QA_CATALOGUE_WEB_VERSION=develop \ --build-arg SOLR_VERSION=8.11.3 \ --build-arg SOLR_INSTALL_SOURCE=download/solr-8.11.3.zip# delete the temporary linkrm -rf download
Then start the container with environment variableIMAGE
set tometadata-qa-marc
and run analysesas described above.
For maintainers only:
Upload to Docker Hub:
docker tag metadata-qa-marc:latest pkiraly/metadata-qa-marc:latestdocker logindocker push pkiraly/metadata-qa-marc:latest
Cleaning before and after:
# stop running containerdocker stop$(docker ps --filter name=metadata-qa-marc -q)# remove containerdocker rm$(docker ps -a --filter name=metadata-qa-marc -q)# remove imagedocker rmi$(docker images metadata-qa-marc -q)# clear build cachedocker builder prune -a -f
Feedbacks are welcome!
About
QA catalogue – a metadata quality assessment tool for library catalogue records (MARC, PICA)