Movatterモバイル変換

NotificationsYou must be signed in to change notification settings
Fork32
Star320

Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.

microsoft.github.io/genalog/

License

MIT license

320 stars 32 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
devops		devops
docs		docs
example		example
genalog		genalog
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
RELEASE.md		RELEASE.md
SECURITY.md		SECURITY.md
VERSION.txt		VERSION.txt
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py
tox.ini		tox.ini

Repository files navigation

Genalog is an open source, cross-platform python package forgenerating document images with synthetic noise that mimics scanned analog documents (thus the namegenalog). You can also add various text degradations to these images. The purpose of this tool is to provide a fast and efficient way to generate synthetic documents from text data by leveraging layout from templates that you create in simple HTML format.

This repo is now in maintenance mode withlimited support.

Overview

Genalog has various capabilities:

Flexible format Image Generation
Custom image degradation
Extract Text from Images using Cognitive Search Pipeline
Get OCR Performance Metrics

The aim of this project is to provide a complete solution for generating synthetic images from any text data rich in natural language and to imitate most of OCR noises founded in scanned text documents.

Please refer to ourGenalog documentation for more tutorials.

Installation

See theGenalog install guide for more details.

To install the latest release:

pip install genalog

Extra Installation Steps in MacOs and Windows

We have a dependency onWeasyprint, which in turn has non-python dependencies includingPango,cairo andGDK-PixBuf that need to be installed separately.

So far,Pango,cairo andGDK-PixBuf libraries are available inUbuntu-18.04 and later by default.

If you are running on Windows, MacOS, or other Linux distributions, please seeinstallation instructions from WeasyPrint.

NOTE: If you encounter the errors likeno library called "libcairo-2" was found, this is probably due to the three extra dependencies missing.

Getting Started

The following is a summary of the common applications scenarios of Genalog. Please refer theJupyter notebook examples that make use of the core code base of Genalog and repository utilities.

TLDR

If you are interested in a full document generation and degration pipeline, please see the following notebook:

	Description	Indepth Jupyter Notebook Examples
1	Analog Document Generation Pipeline	Demo Notebook

Else we have in-depth walkthroughs of each of the module in Genalog.

	Steps	Indepth Jupyter Notebook Examples	Quick Start Guides
1	Create Template for Image Generation	Demo Notebook	Here is our guide to Document Generation
2	Degrade Prebuilt Images	Demo Notebook	Here is our guide to Image Degradation
3	Get Text From Images Using OCR	Demo Notebook	Here is our guide to Extracting Text
4	Align Text Produced from OCR with Ground Truth Text	Demo Notebook	Here is our guide to Text Alignment
5	NER Label Propagation from Ground Truth to OCR Tokens	Demo Notebook	Here is our guide to Label Propagation

We also provide notebooks for the complete end-to-end scenario of generating a synthetic dataset connecting all the components of genalog:

	Scenario	Indepth Jupyter Notebook
1	Synthetic Dataset Generation with LABELED NER Dataset	Demo Notebook

Other Requirements:

If you want to use the OCR Capabilities of Azure to Extract Text from the Images You'll require the following resources:
1. Azure Cognitive Search ServiceQuickstart Guide Here
2. Azure Blob StorageQuickstart Guide Here
SeeAzure Docs for more information on Azure Cognitive Search.

Package Release

Please seeRELEASE.md for more details on the release process.

Development with the Repo

We usetox to orchestrate most of the CI procedure. This will ensure the maximum environment parity between local dev boxes and remote CI pipelines.

git clone https://github.com/microsoft/genalog.git
pip install tox
To run static analysis:tox -e flake8
To run the test suites:tox -e -- -m "not azure"

Repo Structure

genalog├────genalog│       ├─── generation                      # generate text images│       ├──── degradation                    # methods for image degradation│       ├──── ocr                            # running the Azure Search Pipeline│       └──── text                           # methods to Align OCR Output Text with ├────devops                                  # CI/CD pipelines├────docs                                    # containing online documentaions├────examples                                # example Jupyter Notebooks for Various ├────tests                                   # tests├────tox.ini                                 # CI orchestration and configurations├────README.md└────LICENSE

Trademark Notice

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.

Microsoft Open Source Code of Conduct

This project has adopted theMicrosoft Open Source Code of Conduct. For more information see theCode of Conduct FAQ or contactopencode@microsoft.com with any additional questions or comments.

Contribution Guidelines

This project welcomes contributions and suggestions. Most contributions require you toagree to a Contributor License Agreement (CLA) declaring that you have the right to,and actually do, grant us the rights to use your contribution. For details, visithttps://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you needto provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow theinstructions provided by the bot. You will only need to do this once across all repositories using our CLA.

This project has adopted theMicrosoft Open Source Code of Conduct.For more information see theCode of Conduct FAQor contactopencode@microsoft.com with any additional questions or comments.

Citing`genalog`

If you findgenalog helpful to your work, please consider citing our tool andpaper using the following BibTeX entry:

@article{  gupte2021genalog,  title={Lights, Camera, Action! A Framework to Improve NLP Accuracy over OCR documents},  author={Gupte, Amit and Romanov, Alexey and Mantravadi, Sahitya and Banda, Dalitso and Liu, Jianjie and Khan, Raza and Meenal, Lakshmanan Ramu and Han, Benjamin and Srinivasan, Soundar},  journal={Document Intelligence Workshop at KDD 2021},  year={2021}}

Collaborators

Genalog was originally developed by theMAIDAP team at Microsoft Cambridge NERD in association with the Text Analytics Team in Redmond.

About

Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.

microsoft.github.io/genalog/

Code of conduct

Security policy

Activity

Custom properties

Stars

320 stars

Watchers

10 watching

Forks

32 forks

Report repository

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Folders and files

Latest commit

History

Repository files navigation

Overview

Installation

Extra Installation Steps in MacOs and Windows

Getting Started

TLDR

Other Requirements:

Package Release

Development with the Repo

Repo Structure

Trademark Notice

Microsoft Open Source Code of Conduct

Contribution Guidelines

Citing`genalog`

Collaborators

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases1

Packages

Contributors3

Languages

Movatterモバイル変換

License

microsoft/genalog

Folders and files

Latest commit

History

Repository files navigation

Overview

Installation

Extra Installation Steps in MacOs and Windows

Getting Started

TLDR

Other Requirements:

Package Release

Development with the Repo

Repo Structure

Trademark Notice

Microsoft Open Source Code of Conduct

Contribution Guidelines

Citinggenalog

Collaborators

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases1

Packages0

Contributors3

Languages

Citing`genalog`

Packages