Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

OCR-D post-correction module based on weighted finite-state transducers

License

NotificationsYou must be signed in to change notification settings

ASVLeipzig/cor-asv-fst

Repository files navigation

OCR post-correction with error/lexicon Finite State Transducers andchararacter-level LSTM language models

Introduction

Installation

Required Ubuntu packages:

  • Python (python orpython3)
  • pip (python-pip orpython3-pip)
  • virtualenv (python-virtualenv orpython3-virtualenv)

Create and activate a virtualenv as usual.

To install Python dependencies and this module, then do:

make deps install

Which is the equivalent of:

pip install -r requirements.txtpip install -e.

In addition to the requirements listed inrequirements.txt, the toolrequires thepyninilibrary, which has to be installed from source.

Usage

The package has two user interfaces:

Command Line Interface

The package contains a suite of CLI tools to work with plaintext data (prefix:cor-asv-fst-*). The minimal working examples and data formats are describedbelow. Additionally, each tool has further optional parameters - for a detaileddescription, call the tool with the--help option.

cor-asv-fst-train

Train FST models. The basic invocation is as follows:

cor-asv-fst-train -l LEXICON_FILE -e ERROR_MODEL_FILE -t TRAINING_FILE

This will create two transducers, which will be stored inLEXICON_FILE andERROR_MODEL_FILE, respectively. As the training of the lexicon and the errormodel is done independently, any of them can be skipped by omitting therespective parameter.

TRAINING_FILE is a plain text file in tab-separated, two-column formatcontaining a line of OCR-output and the corresponding ground truth line:

» Bergebt mir, daß ih niht weiß, wie»Vergebt mir, daß ich nicht weiß, wieaus dem (Geiſte aller Nationen Mahrunqaus dem Geiſte aller Nationen NahrungKannſt Du mir die re<hée Bahn niché zeigen ?Kannſt Du mir die rechte Bahn nicht zeigen?frag zu bringen. —trag zu bringen. —ſie ins irdij<he Leben hinein, Mit leichtem,ſie ins irdiſche Leben hinein. Mit leichtem,

Each line is treated independently. Alternatively to the above, the trainingdata may also be supplied as two files:

cor-asv-fst-train -l LEXICON_FILE -e ERROR_MODEL_FILE -i INPUT_FILE -g GT_FILE

In this variant,INPUT_FILE andGT_FILE are both in tab-separated,two-column format, in which the first column is the line ID and the second theline:

>=== INPUT_FILE ===<alexis_ruhe01_1852_0018_022     ih denke. Aber was die ſelige Frau Geheimräth1nalexis_ruhe01_1852_0035_019     „Das fann ich niht, c’esl absolument impos-alexis_ruhe01_1852_0087_027     rend. In dem Augenbli> war 1hr niht wohl zualexis_ruhe01_1852_0099_012     ür die fle ſich ſchlugen.“alexis_ruhe01_1852_0147_009     ſollte. Nur Über die Familien, wo man ſie einführen>=== GT_FILE ===<alexis_ruhe01_1852_0018_022     ich denke. Aber was die ſelige Frau Geheimräthinalexis_ruhe01_1852_0035_019     „Das kann ich nicht, c'est absolument impos—alexis_ruhe01_1852_0087_027     rend. Jn dem Augenblick war ihr nicht wohl zualexis_ruhe01_1852_0099_012     für die ſie ſich ſchlugen.“alexis_ruhe01_1852_0147_009     ſollte. Nur über die Familien, wo man ſie einführen

cor-asv-fst-process

This tool applies a trained model to correct plaintext data on a line basis.The basic invocation is:

cor-asv-fst-process -i INPUT_FILE -o OUTPUT_FILE -l LEXICON_FILE -e ERROR_MODEL_FILE (-m LM_FILE)

INPUT_FILE is in the same format as for the training procedure.OUTPUT_FILEcontains the post-correction results in the same format.

LM_FILE is aocrd_keraslm language model - if supplied, it is used forrescoring.

cor-asv-fst-evaluate

This tool can be used to evaluate the post-correction results. The minimalworking invocation is:

cor-asv-fst-evaluate -i INPUT_FILE -o OUTPUT_FILE -g GT_FILE

Additionally, the parameter-M can be used to select the evaluation measure(Levenshtein by default). The files should be in the same two-column formatas described above.

OCR-D processor interfaceocrd-cor-asv-fst-process

To be used withPageXMLdocuments in anOCR-D annotation workflow.Input files need a textual annotation (TextEquiv) on the giventextequiv_level (currentlyonlyword!).

...

"tools": {"cor-asv-fst-process": {"executable":"cor-asv-fst-process","categories": ["Text recognition and optimization"      ],"steps": ["recognition/post-correction"      ],"description":"Improve text annotation by FST error and lexicon model with character-level LSTM language model","input_file_grp": ["OCR-D-OCR-TESS","OCR-D-OCR-KRAK","OCR-D-OCR-OCRO","OCR-D-OCR-CALA","OCR-D-OCR-ANY"      ],"output_file_grp": ["OCR-D-COR-ASV"      ],"parameters": {"textequiv_level": {"type":"string","enum": ["word"],"default":"word","description":"PAGE XML hierarchy level to read TextEquiv input on (output will always be word level)"        },"errorfst_file": {"type":"string","format":"uri","content-type":"application/vnd.openfst","description":"path of FST file for error model","required":true,"cacheable":true        },"lexiconfst_file": {"type":"string","format":"uri","content-type":"application/vnd.openfst","description":"path of FST file for lexicon model","required":true,"cacheable":true        },"pruning_weight": {"type":"number","format":"float","description":"transition weight for pruning the hypotheses in each word window FST","default":5.0        },"rejection_weight": {"type":"number","format":"float","description":"transition weight (per character) for unchanged input in each word window FST","default":1.5        },"keraslm_file": {"type":"string","format":"uri","content-type":"application/x-hdf;subtype=bag","description":"path of h5py weight/config file for language model trained with keraslm","required":true,"cacheable":true        },"beam_width": {"type":"number","format":"integer","description":"maximum number of best partial paths to consider during beam search in language modelling","default":100        },"lm_weight": {"type":"number","format":"float","description":"share of the LM scores over the FST output confidences","default":0.5        }      }    }  }

...

Testing

...

About

OCR-D post-correction module based on weighted finite-state transducers

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp