document-parsing
Here are 57 public repositories matching this topic...
Language:All
Sort:Most stars
Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.
- Updated
Nov 28, 2025 - Python
Get your documents ready for gen AI
- Updated
Nov 27, 2025 - Python
Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.
- Updated
Nov 24, 2025 - HTML
Knowledge Agents and Management in the Cloud
- Updated
Nov 28, 2025 - TypeScript
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
- Updated
Aug 27, 2025 - Python
Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.
- Updated
Oct 31, 2025 - Python
Safe, Open, High-Performance — PDF for AI
- Updated
Nov 27, 2025 - Java
Eden AI: simplify the use and deployment of AI technologies by providing a unique API that connects to the best possible AI engines
- Updated
Nov 25, 2025 - Python
Open-source spreadsheets platform for deep research and document processing
- Updated
Sep 25, 2025 - TypeScript
A comprehensive list of document parsers, covering PDF-to-text conversion and layout extraction. Each tested for support of tables, equations, handwriting, two-column layouts, and multi-column layouts.
- Updated
Jul 14, 2025
Jupyter notebooks testing different OCR models for document parsing (Dolphin, MonkeyOCR, Marker, Nanonets, ...)
- Updated
Nov 1, 2025 - Jupyter Notebook
A Unified Toolkit for Deep Learning-Based Table Extraction
- Updated
Nov 21, 2024 - Python
A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines.
- Updated
Mar 17, 2025 - Python
Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
- Updated
Apr 7, 2023
Document Filters is an SDK for applications like content indexing, e-discovery, data migration, and feeding data into AI/ML models by extracting data from unstructured sources. It gives the ability to perform deep inspection, data extraction, output manipulation, and conversion for virtually any type of document, in any programming language.
- Updated
Nov 12, 2025 - C++
Docling4j brings the functionalities of Docling in document understanding to Java® projects
- Updated
Mar 31, 2025 - Java
Applicant Tracking System (ATS): A powerful platform leveraging generative AI and soft-match algorithms to analyze resumes against job descriptions. Built with React and Node.js, it streamlines hiring insights. Future plans include expanding to investor pitches and other structured documents.
- Updated
Apr 15, 2025 - JavaScript
Official implementation of our ECCVW paper "μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context"
- Updated
Aug 30, 2024 - Python
Tool for converting First National Bank (FNB) bank statement PDFs into useful structured data
- Updated
Oct 31, 2024 - Python
The metadata and text content extractor for almost every file type.
- Updated
Feb 3, 2025 - Python
Improve this page
Add a description, image, and links to thedocument-parsing topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with thedocument-parsing topic, visit your repo's landing page and select "manage topics."