ASVLeipzig/cor-asv-fstPublic

NotificationsYou must be signed in to change notification settings
Fork4
Star11

OCR-D post-correction module based on weighted finite-state transducers

License

Apache-2.0 license

11 stars 4 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 190 Commits
models @ f8e0d66		models @ f8e0d66
ocrd_cor_asv_fst		ocrd_cor_asv_fst
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
ocrd-tool.json		ocrd-tool.json
requirements.txt		requirements.txt
setup.py		setup.py

Repository files navigation

cor-asv-fst

OCR post-correction with error/lexicon Finite State Transducers andchararacter-level LSTM language models

Introduction

Installation

Required Ubuntu packages:

Python (python orpython3)
pip (python-pip orpython3-pip)
virtualenv (python-virtualenv orpython3-virtualenv)

Create and activate a virtualenv as usual.

To install Python dependencies and this module, then do:

make deps install

Which is the equivalent of:

pip install -r requirements.txtpip install -e.

In addition to the requirements listed inrequirements.txt, the toolrequires thepyninilibrary, which has to be installed from source.

Usage

The package has two user interfaces:

Command Line Interface

The package contains a suite of CLI tools to work with plaintext data (prefix:cor-asv-fst-*). The minimal working examples and data formats are describedbelow. Additionally, each tool has further optional parameters - for a detaileddescription, call the tool with the--help option.

`cor-asv-fst-train`

Train FST models. The basic invocation is as follows:

cor-asv-fst-train -l LEXICON_FILE -e ERROR_MODEL_FILE -t TRAINING_FILE

This will create two transducers, which will be stored inLEXICON_FILE andERROR_MODEL_FILE, respectively. As the training of the lexicon and the errormodel is done independently, any of them can be skipped by omitting therespective parameter.

TRAINING_FILE is a plain text file in tab-separated, two-column formatcontaining a line of OCR-output and the corresponding ground truth line:

» Bergebt mir, daß ih niht weiß, wie»Vergebt mir, daß ich nicht weiß, wieaus dem (Geiſte aller Nationen Mahrunqaus dem Geiſte aller Nationen NahrungKannſt Du mir die re<hée Bahn niché zeigen ?Kannſt Du mir die rechte Bahn nicht zeigen?frag zu bringen. —trag zu bringen. —ſie ins irdij<he Leben hinein, Mit leichtem,ſie ins irdiſche Leben hinein. Mit leichtem,

Each line is treated independently. Alternatively to the above, the trainingdata may also be supplied as two files:

cor-asv-fst-train -l LEXICON_FILE -e ERROR_MODEL_FILE -i INPUT_FILE -g GT_FILE

In this variant,INPUT_FILE andGT_FILE are both in tab-separated,two-column format, in which the first column is the line ID and the second theline:

>=== INPUT_FILE ===<alexis_ruhe01_1852_0018_022     ih denke. Aber was die ſelige Frau Geheimräth1nalexis_ruhe01_1852_0035_019     „Das fann ich niht, c’esl absolument impos-alexis_ruhe01_1852_0087_027     rend. In dem Augenbli> war 1hr niht wohl zualexis_ruhe01_1852_0099_012     ür die fle ſich ſchlugen.“alexis_ruhe01_1852_0147_009     ſollte. Nur Über die Familien, wo man ſie einführen>=== GT_FILE ===<alexis_ruhe01_1852_0018_022     ich denke. Aber was die ſelige Frau Geheimräthinalexis_ruhe01_1852_0035_019     „Das kann ich nicht, c'est absolument impos—alexis_ruhe01_1852_0087_027     rend. Jn dem Augenblick war ihr nicht wohl zualexis_ruhe01_1852_0099_012     für die ſie ſich ſchlugen.“alexis_ruhe01_1852_0147_009     ſollte. Nur über die Familien, wo man ſie einführen

`cor-asv-fst-process`

This tool applies a trained model to correct plaintext data on a line basis.The basic invocation is:

cor-asv-fst-process -i INPUT_FILE -o OUTPUT_FILE -l LEXICON_FILE -e ERROR_MODEL_FILE (-m LM_FILE)

INPUT_FILE is in the same format as for the training procedure.OUTPUT_FILEcontains the post-correction results in the same format.

LM_FILE is aocrd_keraslm language model - if supplied, it is used forrescoring.

`cor-asv-fst-evaluate`

This tool can be used to evaluate the post-correction results. The minimalworking invocation is:

cor-asv-fst-evaluate -i INPUT_FILE -o OUTPUT_FILE -g GT_FILE

Additionally, the parameter-M can be used to select the evaluation measure(Levenshtein by default). The files should be in the same two-column formatas described above.

OCR-D processor interface`ocrd-cor-asv-fst-process`

To be used withPageXMLdocuments in anOCR-D annotation workflow.Input files need a textual annotation (TextEquiv) on the giventextequiv_level (currentlyonlyword!).

...

"tools": {"cor-asv-fst-process": {"executable":"cor-asv-fst-process","categories": ["Text recognition and optimization"      ],"steps": ["recognition/post-correction"      ],"description":"Improve text annotation by FST error and lexicon model with character-level LSTM language model","input_file_grp": ["OCR-D-OCR-TESS","OCR-D-OCR-KRAK","OCR-D-OCR-OCRO","OCR-D-OCR-CALA","OCR-D-OCR-ANY"      ],"output_file_grp": ["OCR-D-COR-ASV"      ],"parameters": {"textequiv_level": {"type":"string","enum": ["word"],"default":"word","description":"PAGE XML hierarchy level to read TextEquiv input on (output will always be word level)"        },"errorfst_file": {"type":"string","format":"uri","content-type":"application/vnd.openfst","description":"path of FST file for error model","required":true,"cacheable":true        },"lexiconfst_file": {"type":"string","format":"uri","content-type":"application/vnd.openfst","description":"path of FST file for lexicon model","required":true,"cacheable":true        },"pruning_weight": {"type":"number","format":"float","description":"transition weight for pruning the hypotheses in each word window FST","default":5.0        },"rejection_weight": {"type":"number","format":"float","description":"transition weight (per character) for unchanged input in each word window FST","default":1.5        },"keraslm_file": {"type":"string","format":"uri","content-type":"application/x-hdf;subtype=bag","description":"path of h5py weight/config file for language model trained with keraslm","required":true,"cacheable":true        },"beam_width": {"type":"number","format":"integer","description":"maximum number of best partial paths to consider during beam search in language modelling","default":100        },"lm_weight": {"type":"number","format":"float","description":"share of the LM scores over the FST output confidences","default":0.5        }      }    }  }

...

Testing

...

About

OCR-D post-correction module based on weighted finite-state transducers

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

cor-asv-fst

Introduction

Installation

Usage

Command Line Interface

`cor-asv-fst-train`

`cor-asv-fst-process`

`cor-asv-fst-evaluate`

OCR-D processor interface`ocrd-cor-asv-fst-process`

Testing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors5

Uh oh!

Languages

Movatterモバイル変換

License

ASVLeipzig/cor-asv-fst

Folders and files

Latest commit

History

Repository files navigation

cor-asv-fst

Introduction

Installation

Usage

Command Line Interface

cor-asv-fst-train

cor-asv-fst-process

cor-asv-fst-evaluate

OCR-D processor interfaceocrd-cor-asv-fst-process

Testing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors5

Uh oh!

Languages

`cor-asv-fst-train`

`cor-asv-fst-process`

`cor-asv-fst-evaluate`

OCR-D processor interface`ocrd-cor-asv-fst-process`

Packages