- Notifications
You must be signed in to change notification settings - Fork0
Words and textboxes made amazing.
License
elint-tech/wordmaze
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Words and textboxes made amazing.
WordMaze is a standardized format for text extracted from documents.
When designingOCR engines, developers have to decide how to give their clients the list of extracted textboxes, including their position in the page, the text they contain and the confidence associated with that extraction.
Many patterns arise in the wild, for instance:
(x1,x2,y1,y2,text,confidence)# a flat tuple((x1,y1), (x2,y2),text,confidence)# nested tuples{'x1':x1,'x2':x2,'y1':y1,'y2':y2,'text':text,'confidence':confidence}# a dict{'x':x1,'y':y1,'w':width,'h':height,'text':text,'conf':confidence}# another dict...# and many others
With WordMaze, textboxes are defined using a unified interface:
fromwordmazeimportTextBoxtextbox=TextBox(x1=x1,x2=x2,y1=y1,y2=y2,text=text,confidence=confidence)# ortextbox=TextBox(x1=x,width=w,y1=y,height=h,text=text,confidence=conf)
Perhaps the best example of usage ispdfmap.PDFMaze
, the first application of WordMaze in a public repository.
The exact expected behaviour of every piece of code in WordMaze can be checked out at thetests folder.
There are three main groups of objects defined in WordMaze:
The first and most fundamental(data)class is theBox
, which contains only positional information of a textbox inside a document's page:
fromwordmazeimportBoxbox1=Box(x1=3,x2=14,y1=15,y2=92)# using coordinatesbox2=Box(x1=3,width=11,y1=15,height=77)# using coordinates and sizesbox3=Box(x1=3,x2=14,y2=92,height=77)# mixing everything
We enforcex1<=x2
andy1<=y2
(ifx1>x2
, for instance, their values are automatically swapped upon initialization). Whether(y1, y2)
means(top, bottom)
or(bottom, top)
depends on the context.
Box
es have some interesting attributes to facilitate further calculation using them:
fromwordmazeimportBoxbox=Box(x1=1,x2=3,y1=10,y2=22)# coordinates:print(box.x1)# 1print(box.x2)# 3print(box.y1)# 10print(box.y2)# 22# sizes:print(box.height)# 12print(box.width)# 2# midpoints:print(box.xmid)# 2print(box.ymid)# 16
To include textual information in a textbox, use aTextBox
:
fromwordmazeimportTextBoxtextbox=TextBox(# Box arguments:x1=3,x2=14,y1=15,height=77,# textual content:text='Dr. White.',# confidence with which this text was extracted:confidence=0.85# 85% confidence)
Note thatTextBox
es inherit fromBox
es, so you can inspect.x1
,.width
and so on as shown previously. Moreover, you have two more properties:
# textbox from the previous exampleprint(textbox.text)# Dr. White.print(textbox.confidence)# 0.85
If you also wish to include the page number from which your textbox was extracted, you can use aPageTextBox
:
fromwordmazeimportPageTextBoxtextbox=PageTextBox(# TextBox arguments:x1=2,x2=10,y1=5,height=20,text='Sichermann and Sichelero and the same person!',confidence=0.6,# page info:page=3# this textbox was extracted from the 4th page of the document)print(textbox.page)# 3
Note that page counting starts from0
as is common in Python, so that page #3 is the 4th page of the document.
Page
s are a representation of a document's page. They contain information regarding their size, their coordinate system's origin and their textboxes. For instance:
fromwordmazeimportPage,Shape,Originpage=Page(shape=Shape(height=210,width=297),# A4 page size in mmorigin=Origin.TOP_LEFT)print(page.shape.height)# 210print(page.shape.width)# 297print(page.origin)# Origin.TOP_LEFT
APage
is aMutableSequence
ofTextBox
es:
page=Page(shape=Shape(height=210,width=297),# A4 page size in mmorigin=Origin.TOP_LEFT,entries=[# define textboxes at initializationTextBox(...),TextBox(...),...])page.append(TextBox(...))# list-like appendfortextboxinpage:# iterationassertisinstance(textbox,TextBox)print(page[3])# 4th textbox
There are twoOrigin
s your page may have:
Origin.TOP_LEFT
:y==0
means top,y==page.shape.height
means bottom;Origin.BOTTOM_LEFT
:y==0
means bottom,y==page.shape.height
means top;
If one textbox provider returned textboxes inOrigin.BOTTOM_LEFT
coordinates, but you'd like to have them inOrigin.TOP_LEFT
coordinates, you can usePage.rebase
as follows:
bad_page=Page(shape=Shape(width=10,height=10),origin=Origin.BOTTOM_LEFT,entries=[TextBox(x1=2,x2=3,y1=7,y2=8,text='Lofi defi',confidence=0.99)])nice_page=bad_page.rebase(Origin.TOP_LEFT)assertnice_page.shape==bad_page.shape# rebasing preserves page shapeprint(nice_page[0].y1,nice_page[0].y2)# 2 3
You can easily modify and filter outTextBox
es contained in aPage
usingPage.map
andPage.filter
, which behave likemap
andfilter
where the iterable is fixed and equal to the page's textboxes:
page=Page(...)defpad(textbox:TextBox,horizontal,vertical)->TextBox:returnTextBox(x1=textbox.x1-horizontal,x2=textbox.x2+horizontal,y1=textbox.y1-vertical,y2=textbox.y2+vertical,text=textbox.text,confidence=textbox.confidence)# get a new page with textboxes padded by 3 to the left and to the right# and by 5 to the top and to the bottompadded_page=page.map(lambdatextbox:pad(textbox,horizontal=3,vertical=5))# filters out textboxes with low confidencegood_page=padded_page.filter(lambdatextbox:textbox.confidence>=0.25)
Page.map
andPage.filter
also accept keywords. Each keyword accepts a function that accepts the respective property and operates on it. Better shown in code. The previous padding and filtering can be equivalently written as:
# get a new page with textboxes padded by 3 to the left and to the right# and by 5 to the top and to the bottompadded_page=page.map(x1=lambdax1:x1-3,x2=lambdax2:x2+3,y1=lambday1:y1-5,y2=lambday2:y2+5,)# filters out textboxes with low confidencegood_page=padded_page.filter(confidence=lambdaconf:conf>=0.25)
You can also convert page's textboxes totuple
s ordict
s withPage.tuples
andPage.dicts
:
page=Page(...)fortplinpage.tuples():# prints a tuple in the form# (x1, x2, y1, y2, text, confidence)print(tpl)fordctinpage.dicts():# prints a dict in the form# {'x1': x1, 'x2': x2, 'y1': y1, 'y2': y2, 'text': text, 'confidence': confidence}print(dct)
The top-level class from WordMaze is, of course, aWordMaze
.WordMaze
s are simply sequences ofPage
s:
fromwordmazeimportWordMazewm=WordMaze([Page(...),Page(...),...])forpageinwm:# iteratingprint(page.shape)first_page=wm[0]# indexing
WordMaze
objects also provide aWordMaze.map
and aWordMaze.filter
functions, which work the same thing thatPage.map
andPage.filter
do.
If you wish to accessWordMaze
's pages shapes, there is the propertyWordMaze.shapes
, which is atuple
satisfyingwm.shapes[N] == wm[N].shape
.
Additionally, you can iterate overWordMaze
's textboxes in two ways:
wm=WordMaze(...)# 1forpageinwm:fortextboxinpage:print(textbox)# 2fortextboxinwm.textboxes():print(textbox)
The main difference between #1 and #2 is that the textboxes in #1 are instances ofTextBox
, whereas the ones in #2 arePageTextBox
es including their containing page index.
WordMaze
objects also have aWordMaze.tuples
and aWordMaze.dicts
which behave just like theirPage
counterpart except that they also return their page's number:
wm=WordMaze(...)fortplinwm.tuples():# prints a tuple in the form# (x1, x2, y1, y2, text, confidence, page_number)print(tpl)fordctinwm.dicts():# prints a dict in the form# {'x1': x1, 'x2': x2, 'y1': y1, 'y2': y2, 'text': text, 'confidence': confidence, 'page': page_number}print(dct)
Install WordMaze fromPyPI:
pip install wordmaze
- elint-tech/pdfmap: easily extract textboxes from PDF files.
About
Words and textboxes made amazing.