pd3f/pd3fPublic

NotificationsYou must be signed in to change notification settings
Fork39
Star313

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

License

AGPL-3.0 license

313 stars 39 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
imgs		imgs
pd3f-dashboard		pd3f-dashboard
pd3f-ocr		pd3f-ocr
pd3f		pd3f
.editorconfig		.editorconfig
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
dev.sh		dev.sh
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
example_api.py		example_api.py
poetry.lock		poetry.lock
prod.sh		prod.sh
push_image.sh		push_image.sh
pyproject.toml		pyproject.toml

Repository files navigation

`pd3f`

Experimental, use with care.

pd3f is a PDFtext extraction pipeline that is self-hosted, local-first and Docker-based.Itreconstructs the originalcontinuous text with the help ofmachine learning.

pd3f can OCR scanned PDFs withOCRmyPDF (Tesseract) and extracts tables withCamelot andTabula.It's built upon the output ofParsr.Parsr detects hierarchies of text and splits the text into words, lines and paragraphs.

Even though Parsr brings some structure to the PDF, the text is still scrambled, i.e., due to hyphens.The underlying Python packagepd3f-core tries to reconstruct the original continuous text by removing hyphens, new lines and / or spaces.It useslanguage models to guess how the original text looked like.

pd3f is especially useful for languages with long words such as German.It was mainly developed to parse German letters and official documents.Besides Germanpd3f supports English, Spanish, French and Italian.More languages will be added a later stage.

pd3f includes a Web-based GUI and aFlask-based microservice (API).You can find a demo atdemo.pd3f.com.

Documentation

Check out the full Documentation at:https://pd3f.com/docs/

Future Work / TODO

PDFs are hard to process and it's hard to extract information.So the results of this tool may not satisfy you.There will be more work to improve this software but altogether, it's unlikely that it will successfully extract all the information anytime soon.

Here some things that will get improved.

statics about how long processing (per page) took in the past

calculate runtime based onjob.started_at andjob.ended_at
Get average runtime of jobs and store data in redis list

more information about PDF

NER
entity linking
extract keywords
usetextacy

add more language

check if flair has model
what to do if there is no fast model?

Python client

simple client based on request
send whole folders

Markdown / HTML export

go beyond text

use pdf-scripts / allow more processing

reduce size
repair PDF
detect if scanned
force to OCR again

improve logs / get better feedback

show uncertainty of ML model
allow different log levels

Related Work

https://github.com/axa-group/Parsr
https://github.com/jzillmann/pdf-to-markdown
some PDF processing tools inmy blog post

Development

Install and usepoetry.

Initially run:

./dev.sh --build

Omit--build if the Docker images do not need to get build.Right now Docker + poetry is not able to cache the installs so building the image all the time is uncool.

Contributing

If you have aquestion, found abug or want to propose a newfeature, have a look at theissues page.

Pull requests are especially welcomed when they fix bugs or improve the code quality.

License

Affero General Public License 3.0

About

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

pd3f.com

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Folders and files

Latest commit

History

Repository files navigation

`pd3f`

Documentation

Future Work / TODO

statics about how long processing (per page) took in the past

more information about PDF

add more language

Python client

Markdown / HTML export

use pdf-scripts / allow more processing

improve logs / get better feedback

Related Work

Development

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages

Contributors3

Languages

Movatterモバイル変換

License

pd3f/pd3f

Folders and files

Latest commit

History

Repository files navigation

pd3f

Documentation

Future Work / TODO

statics about how long processing (per page) took in the past

more information about PDF

add more language

Python client

Markdown / HTML export

use pdf-scripts / allow more processing

improve logs / get better feedback

Related Work

Development

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages0

Contributors3

Languages

`pd3f`

Packages