Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
/pd3fPublic

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

License

NotificationsYou must be signed in to change notification settings

pd3f/pd3f

Repository files navigation

pd3f

Experimental, use with care.

pd3f is a PDFtext extraction pipeline that is self-hosted, local-first and Docker-based.Itreconstructs the originalcontinuous text with the help ofmachine learning.

pd3f can OCR scanned PDFs withOCRmyPDF (Tesseract) and extracts tables withCamelot andTabula.It's built upon the output ofParsr.Parsr detects hierarchies of text and splits the text into words, lines and paragraphs.

Even though Parsr brings some structure to the PDF, the text is still scrambled, i.e., due to hyphens.The underlying Python packagepd3f-core tries to reconstruct the original continuous text by removing hyphens, new lines and / or spaces.It useslanguage models to guess how the original text looked like.

pd3f is especially useful for languages with long words such as German.It was mainly developed to parse German letters and official documents.Besides Germanpd3f supports English, Spanish, French and Italian.More languages will be added a later stage.

pd3f includes a Web-based GUI and aFlask-based microservice (API).You can find a demo atdemo.pd3f.com.

Documentation

Check out the full Documentation at:https://pd3f.com/docs/

Future Work / TODO

PDFs are hard to process and it's hard to extract information.So the results of this tool may not satisfy you.There will be more work to improve this software but altogether, it's unlikely that it will successfully extract all the information anytime soon.

Here some things that will get improved.

statics about how long processing (per page) took in the past

  • calculate runtime based onjob.started_at andjob.ended_at
  • Get average runtime of jobs and store data in redis list

more information about PDF

  • NER
  • entity linking
  • extract keywords
  • usetextacy

add more language

  • check if flair has model
  • what to do if there is no fast model?

Python client

  • simple client based on request
  • send whole folders

Markdown / HTML export

  • go beyond text

use pdf-scripts / allow more processing

  • reduce size
  • repair PDF
  • detect if scanned
  • force to OCR again

improve logs / get better feedback

  • show uncertainty of ML model
  • allow different log levels

Related Work

Development

Install and usepoetry.

Initially run:

./dev.sh --build

Omit--build if the Docker images do not need to get build.Right now Docker + poetry is not able to cache the installs so building the image all the time is uncool.

Contributing

If you have aquestion, found abug or want to propose a newfeature, have a look at theissues page.

Pull requests are especially welcomed when they fix bugs or improve the code quality.

License

Affero General Public License 3.0


[8]ページ先頭

©2009-2025 Movatter.jp