Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Words and textboxes made amazing.

License

NotificationsYou must be signed in to change notification settings

elint-tech/wordmaze

Repository files navigation

Words and textboxes made amazing.

About

WordMaze is a standardized format for text extracted from documents.

When designingOCR engines, developers have to decide how to give their clients the list of extracted textboxes, including their position in the page, the text they contain and the confidence associated with that extraction.

Many patterns arise in the wild, for instance:

(x1,x2,y1,y2,text,confidence)# a flat tuple((x1,y1), (x2,y2),text,confidence)# nested tuples{'x1':x1,'x2':x2,'y1':y1,'y2':y2,'text':text,'confidence':confidence}# a dict{'x':x1,'y':y1,'w':width,'h':height,'text':text,'conf':confidence}# another dict...# and many others

With WordMaze, textboxes are defined using a unified interface:

fromwordmazeimportTextBoxtextbox=TextBox(x1=x1,x2=x2,y1=y1,y2=y2,text=text,confidence=confidence)# ortextbox=TextBox(x1=x,width=w,y1=y,height=h,text=text,confidence=conf)

Usage

Perhaps the best example of usage ispdfmap.PDFMaze, the first application of WordMaze in a public repository.

The exact expected behaviour of every piece of code in WordMaze can be checked out at thetests folder.

There are three main groups of objects defined in WordMaze:

Textboxes

Boxes

The first and most fundamental(data)class is theBox, which contains only positional information of a textbox inside a document's page:

fromwordmazeimportBoxbox1=Box(x1=3,x2=14,y1=15,y2=92)# using coordinatesbox2=Box(x1=3,width=11,y1=15,height=77)# using coordinates and sizesbox3=Box(x1=3,x2=14,y2=92,height=77)# mixing everything

We enforcex1<=x2 andy1<=y2 (ifx1>x2, for instance, their values are automatically swapped upon initialization). Whether(y1, y2) means(top, bottom) or(bottom, top) depends on the context.

Boxes have some interesting attributes to facilitate further calculation using them:

fromwordmazeimportBoxbox=Box(x1=1,x2=3,y1=10,y2=22)# coordinates:print(box.x1)# 1print(box.x2)# 3print(box.y1)# 10print(box.y2)# 22# sizes:print(box.height)# 12print(box.width)# 2# midpoints:print(box.xmid)# 2print(box.ymid)# 16

Textboxes

To include textual information in a textbox, use aTextBox:

fromwordmazeimportTextBoxtextbox=TextBox(# Box arguments:x1=3,x2=14,y1=15,height=77,# textual content:text='Dr. White.',# confidence with which this text was extracted:confidence=0.85# 85% confidence)

Note thatTextBoxes inherit fromBoxes, so you can inspect.x1,.width and so on as shown previously. Moreover, you have two more properties:

# textbox from the previous exampleprint(textbox.text)# Dr. White.print(textbox.confidence)# 0.85

PageTextBoxes

If you also wish to include the page number from which your textbox was extracted, you can use aPageTextBox:

fromwordmazeimportPageTextBoxtextbox=PageTextBox(# TextBox arguments:x1=2,x2=10,y1=5,height=20,text='Sichermann and Sichelero and the same person!',confidence=0.6,# page info:page=3# this textbox was extracted from the 4th page of the document)print(textbox.page)# 3

Note that page counting starts from0 as is common in Python, so that page #3 is the 4th page of the document.

Pages

The basics

Pages are a representation of a document's page. They contain information regarding their size, their coordinate system's origin and their textboxes. For instance:

fromwordmazeimportPage,Shape,Originpage=Page(shape=Shape(height=210,width=297),# A4 page size in mmorigin=Origin.TOP_LEFT)print(page.shape.height)# 210print(page.shape.width)# 297print(page.origin)# Origin.TOP_LEFT

APage is aMutableSequence ofTextBoxes:

page=Page(shape=Shape(height=210,width=297),# A4 page size in mmorigin=Origin.TOP_LEFT,entries=[# define textboxes at initializationTextBox(...),TextBox(...),...])page.append(TextBox(...))# list-like appendfortextboxinpage:# iterationassertisinstance(textbox,TextBox)print(page[3])# 4th textbox

Different origins

There are twoOrigins your page may have:

  • Origin.TOP_LEFT:y==0 means top,y==page.shape.height means bottom;
  • Origin.BOTTOM_LEFT:y==0 means bottom,y==page.shape.height means top;

If one textbox provider returned textboxes inOrigin.BOTTOM_LEFT coordinates, but you'd like to have them inOrigin.TOP_LEFT coordinates, you can usePage.rebase as follows:

bad_page=Page(shape=Shape(width=10,height=10),origin=Origin.BOTTOM_LEFT,entries=[TextBox(x1=2,x2=3,y1=7,y2=8,text='Lofi defi',confidence=0.99)])nice_page=bad_page.rebase(Origin.TOP_LEFT)assertnice_page.shape==bad_page.shape# rebasing preserves page shapeprint(nice_page[0].y1,nice_page[0].y2)# 2 3

Transforming and filteringTextBoxes

You can easily modify and filter outTextBoxes contained in aPage usingPage.map andPage.filter, which behave likemap andfilter where the iterable is fixed and equal to the page's textboxes:

page=Page(...)defpad(textbox:TextBox,horizontal,vertical)->TextBox:returnTextBox(x1=textbox.x1-horizontal,x2=textbox.x2+horizontal,y1=textbox.y1-vertical,y2=textbox.y2+vertical,text=textbox.text,confidence=textbox.confidence)# get a new page with textboxes padded by 3 to the left and to the right# and by 5 to the top and to the bottompadded_page=page.map(lambdatextbox:pad(textbox,horizontal=3,vertical=5))# filters out textboxes with low confidencegood_page=padded_page.filter(lambdatextbox:textbox.confidence>=0.25)

Page.map andPage.filter also accept keywords. Each keyword accepts a function that accepts the respective property and operates on it. Better shown in code. The previous padding and filtering can be equivalently written as:

# get a new page with textboxes padded by 3 to the left and to the right# and by 5 to the top and to the bottompadded_page=page.map(x1=lambdax1:x1-3,x2=lambdax2:x2+3,y1=lambday1:y1-5,y2=lambday2:y2+5,)# filters out textboxes with low confidencegood_page=padded_page.filter(confidence=lambdaconf:conf>=0.25)

tuples anddicts

You can also convert page's textboxes totuples ordicts withPage.tuples andPage.dicts:

page=Page(...)fortplinpage.tuples():# prints a tuple in the form# (x1, x2, y1, y2, text, confidence)print(tpl)fordctinpage.dicts():# prints a dict in the form# {'x1': x1, 'x2': x2, 'y1': y1, 'y2': y2, 'text': text, 'confidence': confidence}print(dct)

WordMazes

The top-level class from WordMaze is, of course, aWordMaze.WordMazes are simply sequences ofPages:

fromwordmazeimportWordMazewm=WordMaze([Page(...),Page(...),...])forpageinwm:# iteratingprint(page.shape)first_page=wm[0]# indexing

WordMaze objects also provide aWordMaze.map and aWordMaze.filter functions, which work the same thing thatPage.map andPage.filter do.

If you wish to accessWordMaze's pages shapes, there is the propertyWordMaze.shapes, which is atuple satisfyingwm.shapes[N] == wm[N].shape.

Additionally, you can iterate overWordMaze's textboxes in two ways:

wm=WordMaze(...)# 1forpageinwm:fortextboxinpage:print(textbox)# 2fortextboxinwm.textboxes():print(textbox)

The main difference between #1 and #2 is that the textboxes in #1 are instances ofTextBox, whereas the ones in #2 arePageTextBoxes including their containing page index.

WordMaze objects also have aWordMaze.tuples and aWordMaze.dicts which behave just like theirPage counterpart except that they also return their page's number:

wm=WordMaze(...)fortplinwm.tuples():# prints a tuple in the form# (x1, x2, y1, y2, text, confidence, page_number)print(tpl)fordctinwm.dicts():# prints a dict in the form# {'x1': x1, 'x2': x2, 'y1': y1, 'y2': y2, 'text': text, 'confidence': confidence, 'page': page_number}print(dct)

Installing

Install WordMaze fromPyPI:

pip install wordmaze

Projects using WordMaze


[8]ページ先頭

©2009-2025 Movatter.jp