vpec/lyrapdfPublic

NotificationsYou must be signed in to change notification settings
Fork1
Star10

LyraPDF: convert a PDF to JSON or MarkDown

License

MIT license

10 stars 1 fork Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
docbot		docbot
lyrapdf		lyrapdf
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
auto_diff.sh		auto_diff.sh
generate_dataset.sh		generate_dataset.sh

Repository files navigation

LyraPDF: convert a PDF to JSON or MarkDown

LyraPDF is a project based onpdfminer.six,which extracts text from a PDF document and processes it so that the originalstructure from the text is reconstructed.

The output format can be MarkDown or JSON.

Getting started

Download the repository or clone it:

git clone https://github.com/vpec/lyrapdf.git

In order to execute this project,python3 is needed.

sudo apt install python3sudo apt install python3-pip

Also, some libraries must be installed:

pip install pdfminer.sixpip install sspipe

To convert PDFs inside a directory to JSON (more information inUsage examples) use:

python3 -m lyrapdf folder/with/pdfs

This repository also includes a Natural Lenguage Understanding proof of conceptindocbot directory, which uses the following dependencies:

pip install snips-nlu

How it works

The original document is processed by a pipeline composed by multipleprocessing steps until reaching the final output format: JSON. It's alsopossible to get the output file in MarkDown format.

Almost all processing is based on the use of regular expressions. More details canbe found looking at the code, as each small step is contained inside a method.

Features

Keep only the content you actually need

It removes unnecessary elements as page numbers. This is made thanks to the positioninside the page provided by the HTML format.

Smart text relevance

The whole text does not have the same relevance. Some words are titles, others are just... text.

A classification has been made based on the size of each text block.All text that is contained within 95% of the cumulative character sizewithin the document is considered standard. Larger sizes are considered titles.

For relevance of each title, several intervals have been established,thanks to a deterministic and optimal version of the K-means algorithm.

The result of this process is a MarkDown document.

Paragraph debugging

Text is debugged in order to get paragraphs merged into one line.Some other small processing is done to remove defects from the text.

Structured format

It is possible to obtain an output in JSON structured format, from aconversion from MarkDown, where the level of relevance of the textis indicated (1 is the most important, 7 is the standard text).

Usage examples

I want to extract, process and convert to JSON text from PDFs inside a directory.From git root directory, run:

python3 -m lyrapdf folder/with/pdfs

I want to extract, process and convert to JSON text from PDFs inside a directory, using 4 CPU threads.From git root directory, run:

python3 -m lyrapdf folder/with/pdfs --threads 4

I want to extract, process and convert to MARKDOWN text from PDFs inside a directory, using 4 CPU threads.From git root directory, run:

python3 -m lyrapdf folder/with/pdfs --format md --threads 4

python3 -m lyrapdf folder/with/pdfs --format markdown --threads 4

I want to print information about command line arguments.From git root directory, run:

python3 -m lyrapdf --help

About

This tool was made as a Bachelor’s Degree Final Project inComputer Science at Universidad de Zaragoza.

About

LyraPDF: convert a PDF to JSON or MarkDown

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

LyraPDF: convert a PDF to JSON or MarkDown

Getting started

How it works

Features

Keep only the content you actually need

Smart text relevance

Paragraph debugging

Structured format

Usage examples

About

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

vpec/lyrapdf

Folders and files

Latest commit

History

Repository files navigation

LyraPDF: convert a PDF to JSON or MarkDown

Getting started

How it works

Features

Keep only the content you actually need

Smart text relevance

Paragraph debugging

Structured format

Usage examples

About

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages