Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

LyraPDF: convert a PDF to JSON or MarkDown

License

NotificationsYou must be signed in to change notification settings

vpec/lyrapdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LyraPDF is a project based onpdfminer.six,which extracts text from a PDF document and processes it so that the originalstructure from the text is reconstructed.

The output format can be MarkDown or JSON.

Getting started

Download the repository or clone it:

git clone https://github.com/vpec/lyrapdf.git

In order to execute this project,python3 is needed.

sudo apt install python3sudo apt install python3-pip

Also, some libraries must be installed:

pip install pdfminer.sixpip install sspipe

To convert PDFs inside a directory to JSON (more information inUsage examples) use:

python3 -m lyrapdf folder/with/pdfs

This repository also includes a Natural Lenguage Understanding proof of conceptindocbot directory, which uses the following dependencies:

pip install snips-nlu

How it works

The original document is processed by a pipeline composed by multipleprocessing steps until reaching the final output format: JSON. It's alsopossible to get the output file in MarkDown format.

Pipeline diagram

Almost all processing is based on the use of regular expressions. More details canbe found looking at the code, as each small step is contained inside a method.

Features

Keep only the content you actually need

It removes unnecessary elements as page numbers. This is made thanks to the positioninside the page provided by the HTML format.

Document before page number removal

Document after page number removal

Smart text relevance

The whole text does not have the same relevance. Some words are titles, others are just... text.

A classification has been made based on the size of each text block.All text that is contained within 95% of the cumulative character sizewithin the document is considered standard. Larger sizes are considered titles.

For relevance of each title, several intervals have been established,thanks to a deterministic and optimal version of the K-means algorithm.

The result of this process is a MarkDown document.

Document after conversion to MarkDown

Paragraph debugging

Text is debugged in order to get paragraphs merged into one line.Some other small processing is done to remove defects from the text.

Debugged MarkDown text

Structured format

It is possible to obtain an output in JSON structured format, from aconversion from MarkDown, where the level of relevance of the textis indicated (1 is the most important, 7 is the standard text).

JSON output file

Usage examples

I want to extract, process and convert to JSON text from PDFs inside a directory.From git root directory, run:

python3 -m lyrapdf folder/with/pdfs

I want to extract, process and convert to JSON text from PDFs inside a directory, using 4 CPU threads.From git root directory, run:

python3 -m lyrapdf folder/with/pdfs --threads 4

I want to extract, process and convert to MARKDOWN text from PDFs inside a directory, using 4 CPU threads.From git root directory, run:

python3 -m lyrapdf folder/with/pdfs --format md --threads 4

or

python3 -m lyrapdf folder/with/pdfs --format markdown --threads 4

I want to print information about command line arguments.From git root directory, run:

python3 -m lyrapdf --help

About

This tool was made as a Bachelor’s Degree Final Project inComputer Science at Universidad de Zaragoza.

About

LyraPDF: convert a PDF to JSON or MarkDown

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp