Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

MARC pipeline for quality assessment preparation

License

NotificationsYou must be signed in to change notification settings

pkiraly/marc-pipeline

Repository files navigation

MARC pipeline for quality assessment preparation. The purpose of this project to provide an automatic way toconvert MARC binary or MARCXML files to JSON files ready to be processed by Apache Spark.It

  1. transforms binary MARC files to MARCXML (with yaz-marcdump)
  2. normalizes the UTF-8 encoding (with uconv),
  3. transforms MARCXML to JSON (with Catmandu)
  4. reformats the JSON files

The final JSON contains one record per line -- this is the way Apache Spark ingest files. Other differences between Catmandu produced JSON, and the JSON this project produces:

  • the order of the components is the same in every records (in Librecat output the order of components is varying)
  • thedatafield'ssubfield component is always an array of object (in Librecat output it is an object if there is onlyone subfield)

prerequisited softwares

Catmandu requires a special installation, the other two tools are available as standard *nix tools.

processing single files

  1. one-file-to-json.sh - convert xml to json with Catmandu
  2. one-json-to-formatted.sh - change the json format generated by Catmandu with theformatCatmanduOutput.php script

processing multiple files

  1. marc-to-xml.sh - convert binary MARC files inmarc directory to XML withyaz-marcdump, then splitthe files withsplit-xml.php. Each new file contains maximum 10.000 records.
  2. to-utf8.sh - convert each XML files in a directory to normal UTF-8 file with theuconv tool. The MARC to XML convertersdo not deal with the decomposed character. This step is needed if the accented charcters in XML remain decomposed(such as an a + ¨ instead of ä). SeeUnicode normalizationandCombining and precomposed characters.
  3. split-xml.sh - splits MARCXML files inmarc directory and place the new files intosplitted. The script makes use of withsplit-xml.php. Each new file contains 10.000 records the maximum. If you start with binary MARC you don't have to apply this step becausemarc-to-xml.sh already contains it.
  4. xml-to-json.sh - convert XML files insplitted directory with Catmandu. Moves converted files toconverted and .json tojson/raw
  5. format-json.sh - convert .json files injson/raw into a more convenient JSON format. Saves the new files intojson/formatted directory, moves the source file intojson/processed

directories

  • marc - put here the original binary MARC or MARCXML files
  • splitted - the script puts the splitted XML files here temporary
  • converted - after JSON conversion the scripts moves here the splitted XML files
  • json/raw - the place of the Catmandu generated JSON files before format
  • json/processes - the final place of the Catmandu generated JSON files
  • json/formatted - the formatted JSON files. This is the end result of the process. If everything went correct, you can delete the content of the other directories.

running the XML to JSON process withcron scheduler

Edit crontab with the

crontab -e

command and add the following line:

*/1 * * * * cd /to/working/directory && php toJsonLauncher.php >> launch-report.log

This script runs theone-file-to-json.sh script on each files listed in theto-json-setlist.txt file.

running the JSON formatting process withcron scheduler

Edit crontab with the

crontab -e

command and add the following line:

*/1 * * * * cd /to/working/directory && php toFormattedLauncher.php >> launch-report.log

This script runs theone-json-to-formatted.sh script on each files listed in theto-formatted-setlist.txt file.

About

MARC pipeline for quality assessment preparation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp