- Notifications
You must be signed in to change notification settings - Fork0
Define models to represent a textual document, e.g. a PDF, preserving the hierarchy of the content.
License
OneOffTech/parse-document-model-python
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Parse Document Model (Python) provides Pydantic models for representing text documents using a hierarchical model.This library allows you to define documents as a hierarchy of (specialised) nodes where each node can represent a document, page, text, heading, body, and more.
These models aim to preserve the underlying structure of text documents for further processing, such as creating a table of contents or transforming between formats, e.g. converting a parsed PDF to Markdown.
- Hierarchical structure: The document is modelled as a hierarchy of nodes. Each node can represent a part of thedocument itself, pages, text.
- Rich text support: Nodes can represent not only the content but also the formatting (e.g. bold, italic) applied to the text.
- Attributes: Each node can have attributes that provide additional information such as page number,bounding box, etc.
- Built-in validation and types: Built with
Pydantic
, ensuring type safety, validation and effortless creation of complex document structures.
Requirements
- Python 3.12 or above (Python 3.9, 3.10 and 3.11 are supported on best-effort).
Next steps
We want to represent the document structure using a hierarchy so that the inherited structure is preserved when chapters, sections and headings are used. Consider a generic document with two pages, one heading per page and one paragraph of text. The resulting representation might be the following.
Document ├─Page │ ├─Text (category: heading) │ └─Text (category: body) └─Page ├─Text (category: heading) └─Text (category: body)
At a glance you can see the structure, the document is composed of two pages and there are two headings. To do so we defined a hierarchy around the concept of a Node, like a node in a graph.
classDiagram class Node Node <|-- StructuredNode Node <|-- Text StructuredNode <|-- Document StructuredNode <|-- Page
This is the abstract class from which all other nodes inherit.
Each node has:
category
: The type of the node (e.g.,doc
,page
,heading
).attributes
: Optional field to attach extra data to a node. SeeAttributes.
This extends theNode
. It is used to represent the hierarchy as a node whose content is a list of other nodes, such as likeDocument
andPage
.
content
: List ofNode
.
This is the root node of a document.
category
: Always set to"doc"
.attributes
: Document-wide attributes can be set here.content
: List ofPage
nodes that form the document.
Represents a page in the document:
category
: Always set to"page"
.attributes
: Can contain metadata like page number.content
: List ofText
nodes on the page.
This node represent a paragraph, a heading or any text within the document.
category
: Thecategory of the text within the document, e.g.heading
,title
content
: A string representing the textual content.marks
: List ofmarks applied to the text, such as bold, italic, etc.attributes
: Can contain metadata like the bounding box representing where this portion of text is located in the page.
Each block of text is assigned acategory.
abstract
: The abstract of the document.acknowledgments
: Section acknowledging contributors.affiliation
: Author's institutional affiliation.appendix
: Text within an appendix.authors
: List of authors.body
: Main body text of the document.caption
: Caption associated with a figure or table.categories
: Categories or topics listed in the document.figure
: Represents a figure or an image.footer
: Represents the footer of the page.footnote
: Text at the bottom of the page providing additional information.formula
: Mathematical formula or equation.general-terms
: General terms section.heading
: Any heading within the document.keywords
: List of keywords.itemize-item
: Item in a list or bullet point.other
: Any other unclassified text.page-header
: Represents the header of the page.reference
: References or citations within the document.table
: Represents a table.title
: The title of the document.toc
: Table of contents.
Marks are used to add style or functionality to the text within aText
node.For example, bold text, italic text, links and custom styles such as font or colour.
Mark Types
Bold
: Represents bold text.Italic
: Represents italic text.TextStyle
: Allows customization of font and color.Link
: Represents a hyperlink.
Marks are validated and enforced with the help ofPydantic
model validators.
Attributes are optional fields that can store additional information for each node. Some predefined attributes are:
DocumentAttributes
: General attributes for the document (currently reserved for the future).PageAttributes
: Specific page related attributes, such as the page number.TextAttributes
: Text related attributes, such as bounding boxes or level.BoundingBox
: A box that specifies the position of a text in the page.Level
: The specific level of the text within a document, for example, for headings.
Parse Document Model is distributed with PyPI. You can install it withpip
.
pip install parse-document-model
Here’s how you can represent a simple document with one page and some text:
fromdocument_model_python.documentimportDocument,Page,Textdoc=Document(category="doc",content=[Page(category="page",content=[Text(category="heading",content="Welcome to parse-document-model",marks=["bold"] ),Text(category="body",content="This is an example text using the document model." ) ] ) ])
Parse Document Model is tested usingpytest. Tests run for each commit and pull request.
Install the dependencies.
pip install -r requirements.txt -r requirements-dev.txt
Execute the test suite.
pytest
Thank you for considering contributing to the Parse Document Model! The contribution guide can be found in theCONTRIBUTING.md file.
[NOTE]Consider opening adiscussion before submitting a pull request with changes to the model structures.
Please reviewour security policy on how to report security vulnerabilities.
The project is provided and supported byOneOff-Tech (UG).
The format and structure takes inspiration fromProseMirror.
The MIT License (MIT). Please seeLicense File for more information.
About
Define models to represent a textual document, e.g. a PDF, preserving the hierarchy of the content.
Topics
Resources
License
Security policy
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Uh oh!
There was an error while loading.Please reload this page.
Contributors3
Uh oh!
There was an error while loading.Please reload this page.