- Notifications
You must be signed in to change notification settings - Fork1
LyraPDF: convert a PDF to JSON or MarkDown
License
vpec/lyrapdf
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
LyraPDF is a project based onpdfminer.six,which extracts text from a PDF document and processes it so that the originalstructure from the text is reconstructed.
The output format can be MarkDown or JSON.
Download the repository or clone it:
git clone https://github.com/vpec/lyrapdf.git
In order to execute this project,python3 is needed.
sudo apt install python3sudo apt install python3-pip
Also, some libraries must be installed:
pip install pdfminer.sixpip install sspipe
To convert PDFs inside a directory to JSON (more information inUsage examples) use:
python3 -m lyrapdf folder/with/pdfs
This repository also includes a Natural Lenguage Understanding proof of conceptindocbot directory, which uses the following dependencies:
pip install snips-nlu
The original document is processed by a pipeline composed by multipleprocessing steps until reaching the final output format: JSON. It's alsopossible to get the output file in MarkDown format.
Almost all processing is based on the use of regular expressions. More details canbe found looking at the code, as each small step is contained inside a method.
It removes unnecessary elements as page numbers. This is made thanks to the positioninside the page provided by the HTML format.
The whole text does not have the same relevance. Some words are titles, others are just... text.
A classification has been made based on the size of each text block.All text that is contained within 95% of the cumulative character sizewithin the document is considered standard. Larger sizes are considered titles.
For relevance of each title, several intervals have been established,thanks to a deterministic and optimal version of the K-means algorithm.
The result of this process is a MarkDown document.
Text is debugged in order to get paragraphs merged into one line.Some other small processing is done to remove defects from the text.
It is possible to obtain an output in JSON structured format, from aconversion from MarkDown, where the level of relevance of the textis indicated (1 is the most important, 7 is the standard text).
I want to extract, process and convert to JSON text from PDFs inside a directory.From git root directory, run:
python3 -m lyrapdf folder/with/pdfs
I want to extract, process and convert to JSON text from PDFs inside a directory, using 4 CPU threads.From git root directory, run:
python3 -m lyrapdf folder/with/pdfs --threads 4
I want to extract, process and convert to MARKDOWN text from PDFs inside a directory, using 4 CPU threads.From git root directory, run:
python3 -m lyrapdf folder/with/pdfs --format md --threads 4
or
python3 -m lyrapdf folder/with/pdfs --format markdown --threads 4
I want to print information about command line arguments.From git root directory, run:
python3 -m lyrapdf --help
This tool was made as a Bachelor’s Degree Final Project inComputer Science at Universidad de Zaragoza.