pdf-parser

Easily deployable and scalable backend server that efficiently converts various document formats (pdf, docx, pptx, html, images, etc) into Markdown. With support for both CPU and GPU processing, it is Ideal for large-scale workflows, it offers text/table extraction, OCR, and batch processing with sync/async endpoints.

api markdown-parser pdf-converter pdf-conversion pdf-parsing pdf-parser fastapi pdf-chatbot pdf-to-markdown

UpdatedMar 4, 2025
Python

titipata /scipdf_parser

Star399

Python PDF parser for scientific publications: content and figures

pdf parser pdf-parser python-parser grobid scipdf-parser

UpdatedMar 21, 2024
Python

iamarunbrahma /vision-parse

Star321

Parse PDFs into markdown using Vision LLMs

text-extraction pdf-parser document-parser pdf-to-markdown

UpdatedFeb 8, 2025
Python

lazyFrogLOL /llmdocparser

Star266

A package for parsing PDFs and analyzing their content using LLMs.

nlp ocr chunking document-analysis pdf-parser pdfparser rag llm text-chunking

UpdatedAug 6, 2024
Python

michelcrypt4d4mus /pdfalyzer

Star258

Analyze PDFs. With colors. And Yara.

pdf malware-analysis pdf-documents pdf-format pdf-parser malicious-pdf-files

UpdatedDec 14, 2024
Python

ispras /dedoc

Star222

Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser

html pdf ocr table-of-contents excel html-parser docx documents doc scanned-documents txt document-analysis odt pdf-parser table-recognition docx-parser document-content-extraction logical-structure-extraction

UpdatedFeb 14, 2025
Python

sypht-team /sypht-python-client

Star162

A python client for the Sypht API

python extract api-client python3 information-extraction data-extraction invoice python3-library pdf-parser receipt-scanner extract-data-from-pdf extract-fields receipt-capture document-capture sypht sypht-api sypht-python-client invoice-parser receipt-reader receipt-scanning

UpdatedJul 10, 2024
Python

codereverser /casparser

Star142

Parser for Consolidated Account Statements (CAS) generated from CAMS/Karvy/Kfintech

parser python3 cas capital-gain mutual-funds cams pdf-parser capital-gains capital-gains-calculator consolidated-account-statements karvy mutual-fund-portfolio kfintech 112a

UpdatedFeb 26, 2025
Python

sypht-team /sypht-java-client

Star87

A Java client for the Sypht API

java information-retrieval extract api-client java8 data-extraction invoice information-retrieval-engine pdf-parser receipt-scanner extract-data-from-pdf extract-fields receipt-capture document-capture sypht sypht-api sypht-java-client invoice-parser receipt-reader receipt-scanning

UpdatedJun 4, 2021
Java

datalogics /adobe-pdf-library-samples

Star81

Sample code for the Datalogics C++, Java, and .NET interfaces of the Adobe PDF Library

pdf ocr pdf-converter pdf-document pdf-conversion pdf-generation pdf-to-text pdf-manipulation pdfa pdf-split pdf-merger pdf-parser pdf-to-image pdf-tools pdf-compression pdf-lib pdf-render ocr-pdf pdf-to-office

UpdatedMay 22, 2023

BitMiracle /Docotic.Pdf.Samples

Star75

C# and VB.NET samples for Docotic.Pdf library

pdf-forms extract-images html-to-pdf pdf-generation pdf-to-text extract-text pdf-manipulation net-core pdf-merge pdf-library pdf-parser pdf-to-image pdf-signature print-pdf pdf-compression pdf-annotation pdf-flattener sign-pdf images-to-pdf docotic-pdf

UpdatedMar 5, 2025
Visual Basic .NET

drmingler /smart-llm-loader

Star59

smart-llm-loader is a lightweight yet powerful Python package that transforms any document into LLM-ready chunks. Spend less time on preprocessing headaches and more time building what matters. From RAG systems to chatbots to document Q&A, SmartLLMLoader handles the heavy lifting so you can focus on creating exceptional AI applications.

markdown chatbot pdf-converter gemini openai chunking pdf-parser claude rag langchain llama-index pdf-to-markdown

UpdatedFeb 14, 2025
Python

tuffstuff9 /nextjs-pdf-parser

Star59

Next.js template for seamless PDF parsing using pdf2json and FilePond. Ideal for developers seeking a ready-to-use solution for PDF content extraction in Next.js projects.

nextjs content-extraction pdf-parsing react-pdf pdf-parser pdf2json filepond pdf-upload pdf-parse nextjs-pdf-parser nextjs-pdf react-pdf-parser nextjs-pdf-parse nextjs-pdf-parsing