document-parsing

Star

Here are 57 public repositories matching this topic...

Language:All

Filter by language

All57 Python32 Jupyter Notebook6 HTML4 TypeScript3 Java2 C++1 JavaScript1 PHP1 Shell1

Sort:Most stars

Sort options

Most stars Fewest stars Most forks Fewest forks Recently updated Least recently updated

PaddlePaddle /PaddleOCR

Star65.3k

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

ocr pdf-parser kie document-translation rag chineseocr ai4science pp-ocr document-parsing pp-structure pdf-extractor-rag pdf2markdown paddleocr-vl

UpdatedNov 28, 2025
Python

docling-project /docling

Star45.3k

Get your documents ready for gen AI

html markdown pdf ai convert xlsx pdf-converter docx documents pptx pdf-to-text tables document-parser pdf-to-json document-parsing

UpdatedNov 27, 2025
Python

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

nlp pdf machine-learning natural-language-processing information-retrieval ocr deep-learning ml docx preprocessing pdf-to-text data-pipelines donut document-image-processing document-parser pdf-to-json document-image-analysis llm document-parsing langchain

UpdatedNov 24, 2025
HTML

run-llama /llama_cloud_services

Star4.2k

Knowledge Agents and Management in the Cloud

pdf parsing document pptx structured-data pdf-to-text pdf-to-excel tables docx-to-markdown document-parser pdf-document-processor pdf-to-json document-parsing ppt-to-json pdf-to-markdown ppt-to-markdown

UpdatedNov 28, 2025
TypeScript

enoch3712 /ExtractThinker

Star1.5k

ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

python nlp pdf machine-learning ocr ai openai pdf-to-text document-processing document-image-analysis document-intelligence llm document-parsing langchain

UpdatedAug 27, 2025
Python

NanoNets /docstrange

Star1.1k

Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.

markdown ocr ai structured-data tables pdf-parser document-parser structured-data-capture pdf-to-json llm document-parsing image-to-markdown pdf-to-markdown

UpdatedOct 31, 2025
Python

opendataloader-project /opendataloader-pdf

Star784

Safe, Open, High-Performance — PDF for AI

html markdown pdf json sdk recognition ai pdf-converter documents dataloader tables ocr-recognition document-parser pdf-to-html pdf-to-json document-parsing pdf-to-markdown

UpdatedNov 27, 2025
Java

edenai /edenai-apis

Star459

Eden AI: simplify the use and deployment of AI technologies by providing a unique API that connects to the best possible AI engines

python nlp api natural-language-processing text-to-speech ocr ai computer-vision aggregator machine-translation image-processing speech-recognition speech-to-text optical-character-recognition ai-as-a-service video-recognition pre-trained-model document-parsing

UpdatedNov 25, 2025
Python

harishdeivanayagam /rowfill

Star366

Open-source spreadsheets platform for deep research and document processing

pdf ocr nextjs vision openai document llama pdfs vision-api unstructured unstructured-data document-extraction image-ocr ocr-javascript llm document-parsing ollama langgraph

UpdatedSep 25, 2025
TypeScript

GiftMungmeeprued /document-parsers-list

Star165

A comprehensive list of document parsers, covering PDF-to-text conversion and layout extraction. Each tested for support of tables, equations, handwriting, two-column layouts, and multi-column layouts.

pdf ocr preprocessing pdf-to-text document-image-processing data-pipeline document-parser document-parsing langchain

UpdatedJul 14, 2025

AdemBoukhris457 /Documents-Parsing-Lab

Star75

Jupyter notebooks testing different OCR models for document parsing (Dolphin, MonkeyOCR, Marker, Nanonets, ...)

ocr ai parsing-data document-parsing genai

UpdatedNov 1, 2025
Jupyter Notebook

CycloneBoy /pdf_table

Star53

A Unified Toolkit for Deep Learning-Based Table Extraction

pdf ocr ai table layout-analysis pdf-to-html table-recognition document-parsing

UpdatedNov 21, 2024
Python

papercast-dev /papercast

Star52

A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines.

python nlp pipeline podcast pdf-converter tts arxiv pdf-to-text dag document-parser pdf-document-processor grobid semantic-scholar document-parsing

UpdatedMar 17, 2025
Python

Unstructured-IO /community

Star29

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

open-source community machine-learning deep-learning nlp-parsing data-pipeline ocr-python document-ai preprocessing-data document-parsing

UpdatedApr 7, 2023

Hyland /DocumentFilters

Star24

Document Filters is an SDK for applications like content indexing, e-discovery, data migration, and feeding data into AI/ML models by extracting data from unstructured sources. It gives the ability to perform deep inspection, data extraction, output manipulation, and conversion for virtually any type of document, in any programming language.

html markdown pdf machine-learning sdk ai convert xlsx ml pdf-converter docx pptx preprocessing tables document-parser llm document-parsing

UpdatedNov 12, 2025
C++

docling-project /docling4j

Star20

Docling4j brings the functionalities of Docling in document understanding to Java® projects

java pdf ai pdf-converter documents document-parser pdf-to-json document-understanding document-parsing docling

UpdatedMar 31, 2025
Java

acenji /ats

Star13

Applicant Tracking System (ATS): A powerful platform leveraging generative AI and soft-match algorithms to analyze resumes against job descriptions. Built with React and Node.js, it streamlines hiring insights. Future plans include expanding to investor pitches and other structured documents.

nodejs reactjs sorting-algorithms ats keyword-extraction nlp-machine-learning job-matching resume-analysis applicant-tracking-system document-parsing generative-ai investor-pitches