elint-tech/wordmazePublic

NotificationsYou must be signed in to change notification settings
Fork0
Star0

Words and textboxes made amazing.

License

MIT license

0 stars 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github		.github
scripts		scripts
tests		tests
wordmaze		wordmaze
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
VERSION		VERSION
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Repository files navigation

WordMaze

Words and textboxes made amazing.

About

WordMaze is a standardized format for text extracted from documents.

When designingOCR engines, developers have to decide how to give their clients the list of extracted textboxes, including their position in the page, the text they contain and the confidence associated with that extraction.

Many patterns arise in the wild, for instance:

(x1,x2,y1,y2,text,confidence)# a flat tuple((x1,y1), (x2,y2),text,confidence)# nested tuples{'x1':x1,'x2':x2,'y1':y1,'y2':y2,'text':text,'confidence':confidence}# a dict{'x':x1,'y':y1,'w':width,'h':height,'text':text,'conf':confidence}# another dict...# and many others

With WordMaze, textboxes are defined using a unified interface:

fromwordmazeimportTextBoxtextbox=TextBox(x1=x1,x2=x2,y1=y1,y2=y2,text=text,confidence=confidence)# ortextbox=TextBox(x1=x,width=w,y1=y,height=h,text=text,confidence=conf)

Usage

Perhaps the best example of usage ispdfmap.PDFMaze, the first application of WordMaze in a public repository.

The exact expected behaviour of every piece of code in WordMaze can be checked out at thetests folder.

There are three main groups of objects defined in WordMaze:

Textboxes

`Box`es

The first and most fundamental(data)class is theBox, which contains only positional information of a textbox inside a document's page:

fromwordmazeimportBoxbox1=Box(x1=3,x2=14,y1=15,y2=92)# using coordinatesbox2=Box(x1=3,width=11,y1=15,height=77)# using coordinates and sizesbox3=Box(x1=3,x2=14,y2=92,height=77)# mixing everything

We enforcex1<=x2 andy1<=y2 (ifx1>x2, for instance, their values are automatically swapped upon initialization). Whether(y1, y2) means(top, bottom) or(bottom, top) depends on the context.

Boxes have some interesting attributes to facilitate further calculation using them:

fromwordmazeimportBoxbox=Box(x1=1,x2=3,y1=10,y2=22)# coordinates:print(box.x1)# 1print(box.x2)# 3print(box.y1)# 10print(box.y2)# 22# sizes:print(box.height)# 12print(box.width)# 2# midpoints:print(box.xmid)# 2print(box.ymid)# 16

`Textbox`es

To include textual information in a textbox, use aTextBox:

fromwordmazeimportTextBoxtextbox=TextBox(# Box arguments:x1=3,x2=14,y1=15,height=77,# textual content:text='Dr. White.',# confidence with which this text was extracted:confidence=0.85# 85% confidence)

Note thatTextBoxes inherit fromBoxes, so you can inspect.x1,.width and so on as shown previously. Moreover, you have two more properties:

# textbox from the previous exampleprint(textbox.text)# Dr. White.print(textbox.confidence)# 0.85

`PageTextBox`es

If you also wish to include the page number from which your textbox was extracted, you can use aPageTextBox:

fromwordmazeimportPageTextBoxtextbox=PageTextBox(# TextBox arguments:x1=2,x2=10,y1=5,height=20,text='Sichermann and Sichelero and the same person!',confidence=0.6,# page info:page=3# this textbox was extracted from the 4th page of the document)print(textbox.page)# 3

Note that page counting starts from0 as is common in Python, so that page #3 is the 4th page of the document.

Pages

The basics

Pages are a representation of a document's page. They contain information regarding their size, their coordinate system's origin and their textboxes. For instance:

fromwordmazeimportPage,Shape,Originpage=Page(shape=Shape(height=210,width=297),# A4 page size in mmorigin=Origin.TOP_LEFT)print(page.shape.height)# 210print(page.shape.width)# 297print(page.origin)# Origin.TOP_LEFT

APage is aMutableSequence ofTextBoxes:

page=Page(shape=Shape(height=210,width=297),# A4 page size in mmorigin=Origin.TOP_LEFT,entries=[# define textboxes at initializationTextBox(...),TextBox(...),...])page.append(TextBox(...))# list-like appendfortextboxinpage:# iterationassertisinstance(textbox,TextBox)print(page[3])# 4th textbox

Different origins

There are twoOrigins your page may have:

Origin.TOP_LEFT:y==0 means top,y==page.shape.height means bottom;
Origin.BOTTOM_LEFT:y==0 means bottom,y==page.shape.height means top;

If one textbox provider returned textboxes inOrigin.BOTTOM_LEFT coordinates, but you'd like to have them inOrigin.TOP_LEFT coordinates, you can usePage.rebase as follows:

bad_page=Page(shape=Shape(width=10,height=10),origin=Origin.BOTTOM_LEFT,entries=[TextBox(x1=2,x2=3,y1=7,y2=8,text='Lofi defi',confidence=0.99)])nice_page=bad_page.rebase(Origin.TOP_LEFT)assertnice_page.shape==bad_page.shape# rebasing preserves page shapeprint(nice_page[0].y1,nice_page[0].y2)# 2 3

Transforming and filtering`TextBox`es

You can easily modify and filter outTextBoxes contained in aPage usingPage.map andPage.filter, which behave likemap andfilter where the iterable is fixed and equal to the page's textboxes:

page=Page(...)defpad(textbox:TextBox,horizontal,vertical)->TextBox:returnTextBox(x1=textbox.x1-horizontal,x2=textbox.x2+horizontal,y1=textbox.y1-vertical,y2=textbox.y2+vertical,text=textbox.text,confidence=textbox.confidence)# get a new page with textboxes padded by 3 to the left and to the right# and by 5 to the top and to the bottompadded_page=page.map(lambdatextbox:pad(textbox,horizontal=3,vertical=5))# filters out textboxes with low confidencegood_page=padded_page.filter(lambdatextbox:textbox.confidence>=0.25)

Page.map andPage.filter also accept keywords. Each keyword accepts a function that accepts the respective property and operates on it. Better shown in code. The previous padding and filtering can be equivalently written as:

# get a new page with textboxes padded by 3 to the left and to the right# and by 5 to the top and to the bottompadded_page=page.map(x1=lambdax1:x1-3,x2=lambdax2:x2+3,y1=lambday1:y1-5,y2=lambday2:y2+5,)# filters out textboxes with low confidencegood_page=padded_page.filter(confidence=lambdaconf:conf>=0.25)

`tuple`s and`dict`s

You can also convert page's textboxes totuples ordicts withPage.tuples andPage.dicts:

page=Page(...)fortplinpage.tuples():# prints a tuple in the form# (x1, x2, y1, y2, text, confidence)print(tpl)fordctinpage.dicts():# prints a dict in the form# {'x1': x1, 'x2': x2, 'y1': y1, 'y2': y2, 'text': text, 'confidence': confidence}print(dct)

`WordMaze`s

The top-level class from WordMaze is, of course, aWordMaze.WordMazes are simply sequences ofPages:

fromwordmazeimportWordMazewm=WordMaze([Page(...),Page(...),...])forpageinwm:# iteratingprint(page.shape)first_page=wm[0]# indexing

WordMaze objects also provide aWordMaze.map and aWordMaze.filter functions, which work the same thing thatPage.map andPage.filter do.

If you wish to accessWordMaze's pages shapes, there is the propertyWordMaze.shapes, which is atuple satisfyingwm.shapes[N] == wm[N].shape.

Additionally, you can iterate overWordMaze's textboxes in two ways:

wm=WordMaze(...)# 1forpageinwm:fortextboxinpage:print(textbox)# 2fortextboxinwm.textboxes():print(textbox)

The main difference between #1 and #2 is that the textboxes in #1 are instances ofTextBox, whereas the ones in #2 arePageTextBoxes including their containing page index.

WordMaze objects also have aWordMaze.tuples and aWordMaze.dicts which behave just like theirPage counterpart except that they also return their page's number:

wm=WordMaze(...)fortplinwm.tuples():# prints a tuple in the form# (x1, x2, y1, y2, text, confidence, page_number)print(tpl)fordctinwm.dicts():# prints a dict in the form# {'x1': x1, 'x2': x2, 'y1': y1, 'y2': y2, 'text': text, 'confidence': confidence, 'page': page_number}print(dct)

Installing

Install WordMaze fromPyPI:

pip install wordmaze

Projects using WordMaze

elint-tech/pdfmap: easily extract textboxes from PDF files.

About

Words and textboxes made amazing.

pypi.org/project/wordmaze/

Releases8

Release v0.3.6 Latest

Jun 4, 2021

+ 7 releases

Packages

No packages published

Movatterモバイル変換

License

elint-tech/wordmaze

Folders and files

Latest commit

History

Repository files navigation

WordMaze

About

Usage

Textboxes

Boxes

Textboxes

PageTextBoxes

Pages

The basics

Different origins

Transforming and filteringTextBoxes

tuples anddicts

WordMazes

Installing

Projects using WordMaze

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases8

Packages0

Uh oh!

Contributors2

Uh oh!

Languages

`Box`es

`Textbox`es

`PageTextBox`es

Transforming and filtering`TextBox`es

`tuple`s and`dict`s

`WordMaze`s

Packages