- Notifications
You must be signed in to change notification settings - Fork4
OCR-D post-correction module based on weighted finite-state transducers
License
ASVLeipzig/cor-asv-fst
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
OCR post-correction with error/lexicon Finite State Transducers andchararacter-level LSTM language models
Required Ubuntu packages:
- Python (
python
orpython3
) - pip (
python-pip
orpython3-pip
) - virtualenv (
python-virtualenv
orpython3-virtualenv
)
Create and activate a virtualenv as usual.
To install Python dependencies and this module, then do:
make deps install
Which is the equivalent of:
pip install -r requirements.txtpip install -e.
In addition to the requirements listed inrequirements.txt
, the toolrequires thepyninilibrary, which has to be installed from source.
The package has two user interfaces:
The package contains a suite of CLI tools to work with plaintext data (prefix:cor-asv-fst-*
). The minimal working examples and data formats are describedbelow. Additionally, each tool has further optional parameters - for a detaileddescription, call the tool with the--help
option.
Train FST models. The basic invocation is as follows:
cor-asv-fst-train -l LEXICON_FILE -e ERROR_MODEL_FILE -t TRAINING_FILE
This will create two transducers, which will be stored inLEXICON_FILE
andERROR_MODEL_FILE
, respectively. As the training of the lexicon and the errormodel is done independently, any of them can be skipped by omitting therespective parameter.
TRAINING_FILE
is a plain text file in tab-separated, two-column formatcontaining a line of OCR-output and the corresponding ground truth line:
» Bergebt mir, daß ih niht weiß, wie»Vergebt mir, daß ich nicht weiß, wieaus dem (Geiſte aller Nationen Mahrunqaus dem Geiſte aller Nationen NahrungKannſt Du mir die re<hée Bahn niché zeigen ?Kannſt Du mir die rechte Bahn nicht zeigen?frag zu bringen. —trag zu bringen. —ſie ins irdij<he Leben hinein, Mit leichtem,ſie ins irdiſche Leben hinein. Mit leichtem,
Each line is treated independently. Alternatively to the above, the trainingdata may also be supplied as two files:
cor-asv-fst-train -l LEXICON_FILE -e ERROR_MODEL_FILE -i INPUT_FILE -g GT_FILE
In this variant,INPUT_FILE
andGT_FILE
are both in tab-separated,two-column format, in which the first column is the line ID and the second theline:
>=== INPUT_FILE ===<alexis_ruhe01_1852_0018_022 ih denke. Aber was die ſelige Frau Geheimräth1nalexis_ruhe01_1852_0035_019 „Das fann ich niht, c’esl absolument impos-alexis_ruhe01_1852_0087_027 rend. In dem Augenbli> war 1hr niht wohl zualexis_ruhe01_1852_0099_012 ür die fle ſich ſchlugen.“alexis_ruhe01_1852_0147_009 ſollte. Nur Über die Familien, wo man ſie einführen>=== GT_FILE ===<alexis_ruhe01_1852_0018_022 ich denke. Aber was die ſelige Frau Geheimräthinalexis_ruhe01_1852_0035_019 „Das kann ich nicht, c'est absolument impos—alexis_ruhe01_1852_0087_027 rend. Jn dem Augenblick war ihr nicht wohl zualexis_ruhe01_1852_0099_012 für die ſie ſich ſchlugen.“alexis_ruhe01_1852_0147_009 ſollte. Nur über die Familien, wo man ſie einführen
This tool applies a trained model to correct plaintext data on a line basis.The basic invocation is:
cor-asv-fst-process -i INPUT_FILE -o OUTPUT_FILE -l LEXICON_FILE -e ERROR_MODEL_FILE (-m LM_FILE)
INPUT_FILE
is in the same format as for the training procedure.OUTPUT_FILE
contains the post-correction results in the same format.
LM_FILE
is aocrd_keraslm
language model - if supplied, it is used forrescoring.
This tool can be used to evaluate the post-correction results. The minimalworking invocation is:
cor-asv-fst-evaluate -i INPUT_FILE -o OUTPUT_FILE -g GT_FILE
Additionally, the parameter-M
can be used to select the evaluation measure(Levenshtein
by default). The files should be in the same two-column formatas described above.
OCR-D processor interfaceocrd-cor-asv-fst-process
To be used withPageXMLdocuments in anOCR-D annotation workflow.Input files need a textual annotation (TextEquiv
) on the giventextequiv_level
(currentlyonlyword
!).
...
"tools": {"cor-asv-fst-process": {"executable":"cor-asv-fst-process","categories": ["Text recognition and optimization" ],"steps": ["recognition/post-correction" ],"description":"Improve text annotation by FST error and lexicon model with character-level LSTM language model","input_file_grp": ["OCR-D-OCR-TESS","OCR-D-OCR-KRAK","OCR-D-OCR-OCRO","OCR-D-OCR-CALA","OCR-D-OCR-ANY" ],"output_file_grp": ["OCR-D-COR-ASV" ],"parameters": {"textequiv_level": {"type":"string","enum": ["word"],"default":"word","description":"PAGE XML hierarchy level to read TextEquiv input on (output will always be word level)" },"errorfst_file": {"type":"string","format":"uri","content-type":"application/vnd.openfst","description":"path of FST file for error model","required":true,"cacheable":true },"lexiconfst_file": {"type":"string","format":"uri","content-type":"application/vnd.openfst","description":"path of FST file for lexicon model","required":true,"cacheable":true },"pruning_weight": {"type":"number","format":"float","description":"transition weight for pruning the hypotheses in each word window FST","default":5.0 },"rejection_weight": {"type":"number","format":"float","description":"transition weight (per character) for unchanged input in each word window FST","default":1.5 },"keraslm_file": {"type":"string","format":"uri","content-type":"application/x-hdf;subtype=bag","description":"path of h5py weight/config file for language model trained with keraslm","required":true,"cacheable":true },"beam_width": {"type":"number","format":"integer","description":"maximum number of best partial paths to consider during beam search in language modelling","default":100 },"lm_weight": {"type":"number","format":"float","description":"share of the LM scores over the FST output confidences","default":0.5 } } } }
...
...