lazyFrogLOL/llmdocparserPublic

NotificationsYou must be signed in to change notification settings
Fork8
Star266

A package for parsing PDFs and analyzing their content using LLMs.

License

MIT license

266 stars 8 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
llmdocparser		llmdocparser
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
setup.py		setup.py

Repository files navigation

LLMDocParser

A package for parsing PDFs and analyzing their content using LLMs.

This package is an improvement based on the concept ofgptpdf.

Method

gptpdf uses PyMuPDF to parse PDFs, identifying both text and non-text regions. It then merges or filters the text regions based on certain rules, and inputs the final results into a multimodal model for parsing. This method is particularly effective.

Based on this concept, I made some minor improvements.

Main Process

Using a layout analysis model, each page of the PDF is parsed to identify the type of each region, which includes Text, Title, Figure, Figure caption, Table, Table caption, Header, Footer, Reference, and Equation. The coordinates of each region are also obtained.

Layout Analysis Result Example:

[{'header': ((101, 66, 436, 102), 0)}, {'header': ((1038, 81, 1088, 95), 1)}, {'title': ((106, 215, 947, 284), 2)}, {'text': ((101, 319, 835, 390), 3)}, {'text': ((100, 565, 579, 933), 4)}, {'text': ((100, 967, 573, 1025), 5)}, {'text': ((121, 1055, 276, 1091), 6)}, {'reference': ((101, 1124, 562, 1429), 7)}, {'text': ((610, 565, 1089, 930), 8)}, {'text': ((613, 976, 1006, 1045), 9)}, {'title': ((612, 1114, 726, 1129), 10)}, {'text': ((611, 1165, 1089, 1431), 11)}, {'title': ((1011, 1471, 1084, 1492), 12)}]

This result includes the type, coordinates, and reading order of each region. By using this result, more precise rules can be set to parse the PDF.

Finally, input the images of the corresponding regions into a multimodal model, such as GPT-4o or Qwen-VL, to directly obtain text blocks that are friendly to RAG solutions.

img_path	type	page_no	filename	content	filepath
{absolute_path}/page_1_title.png	Title	1	attention is all you need	[Text Block 1]	{file_absolute_path}
{absolute_path}/page_1_text.png	Text	1	attention is all you need	[Text Block 2]	{file_absolute_path}
{absolute_path}/page_2_figure.png	Figure	2	attention is all you need	[Text Block 3]	{file_absolute_path}
{absolute_path}/page_2_figure_caption.png	Figure caption	2	attention is all you need	[Text Block 4]	{file_absolute_path}
{absolute_path}/page_3_table.png	Table	3	attention is all you need	[Text Block 5]	{file_absolute_path}
{absolute_path}/page_3_table_caption.png	Table caption	3	attention is all you need	[Text Block 6]	{file_absolute_path}
{absolute_path}/page_1_header.png	Header	1	attention is all you need	[Text Block 7]	{file_absolute_path}
{absolute_path}/page_2_footer.png	Footer	2	attention is all you need	[Text Block 8]	{file_absolute_path}
{absolute_path}/page_3_reference.png	Reference	3	attention is all you need	[Text Block 9]	{file_absolute_path}
{absolute_path}/page_1_equation.png	Equation	1	attention is all you need	[Text Block 10]	{file_absolute_path}

See more in llm_parser.py main function.

Installation

pip install llmdocparser

Installation from Source

To install this project from source, follow these steps:

Clone the Repository:
First, clone the repository to your local machine. Open your terminal and run the following commands:
```
git clone https://github.com/lazyFrogLOL/llmdocparser.gitcd llmdocparser
```
Install Dependencies:
This project uses Poetry for dependency management. Make sure you have Poetry installed. If not, you can follow the instructions in thePoetry Installation Guide.
Once Poetry is installed, run the following command in the project's root directory to install the dependencies:
```
poetry install
```
This will read thepyproject.toml file and install all the required dependencies for the project.

Usage

fromllmdocparser.llm_parserimportget_image_contentcontent,cost=get_image_content(llm_type="azure",pdf_path="path/to/your/pdf",output_dir="path/to/output/directory",max_concurrency=5,azure_deployment="azure-gpt-4o",azure_endpoint="your_azure_endpoint",api_key="your_api_key",api_version="your_api_version")print(content)print(cost)

Parameters

llm_type: str
The options are azure, openai, dashscope.
pdf_path: str
Path to the PDF file.
output_dir: str
Output directory to store all parsed images.
max_concurrency: int
Number of GPT parsing worker threads. Batch calling details:Batch Support

If using Azure, the azure_deployment and azure_endpoint parameters need to be passed; otherwise, only the API key needs to be provided.

base_url: str
OpenAI Compatible Server url. Detail:OpenAI-Compatible Server

Cost

Using the 'Attention Is All You Need' paper for analysis, the model chosen is GPT-4o, costing as follows:

Total Tokens: 44063Prompt Tokens: 33812Completion Tokens: 10251Total Cost (USD): $0.322825

Average cost per page: $0.0215

Star History

About

A package for parsing PDFs and analyzing their content using LLMs.

Releases

7tags

Packages

No packages published

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Folders and files

Latest commit

History

Repository files navigation

LLMDocParser

Method

Main Process

Installation

Installation from Source

Usage

Cost

Star History

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

lazyFrogLOL/llmdocparser

Folders and files

Latest commit

History

Repository files navigation

LLMDocParser

Method

Main Process

Installation

Installation from Source

Usage

Cost

Star History

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages