document-processing
Here are 171 public repositories matching this topic...
Language:All
Sort:Most stars
A system for agentic LLM-powered data processing and ETL
- Updated
Jul 8, 2025 - Python
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
- Updated
Jun 9, 2025 - Python
Generic framework for historical document processing
- Updated
Jul 9, 2021 - Python
TWIX is an open-source data extraction tool that reconstructs structured data from documents at scale, accurately and at low cost, by inferring the shared underlying visual template across documents
- Updated
May 29, 2025 - Python
⚡ Cloud-native, AI-powered, document processing pipelines on AWS.
- Updated
Mar 18, 2025 - TypeScript
A full-featured Document Management Platform / Document Layer for your application, providing storage, discovery, processing, and retrieval. Deploys directly into your Amazon Web Services Cloud. Please 🌟 star to support our work!
- Updated
Jul 11, 2025 - Java
A Python framework for multi-modal document understanding with Amazon Bedrock
- Updated
Jun 19, 2025 - Python
Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.
- Updated
Nov 22, 2024 - Python
Retrieval of fully structured data made easy. Use LLMs or custom models. Specialized on PDFs and HTML files. Extensive support of tabular data extraction and multimodal queries.
- Updated
Jul 8, 2025 - Python
An include filter for Pandoc
- Updated
Dec 6, 2020 - Haskell
A Model Context Protocol (MCP) server implementation exposes document processing capabilities through natural language, supporting both direct human interaction and AI agent tool calling.
- Updated
Jul 4, 2025 - TypeScript
Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates search indexes from the analyzed data.
- Updated
Jun 15, 2025 - JavaScript
Unofficial mirror of git://git.lyx.org/lyx.git (updates daily. not affiliated with lyx.org.)
- Updated
Mar 21, 2023 - C++
ResumeTex is an AI-powered tool that converts standard PDF resumes into professionally formatted LaTeX documents. This service helps you create elegant, structured resumes without needing to learn LaTeX syntax.
- Updated
Jul 11, 2025 - JavaScript
A comprehensive list of annotated training datasets classified by use case.
- Updated
Jul 8, 2022
An advanced distributed knowledge fabric for intelligent document processing, featuring multi-document agents, optimized query handling, and semantic understanding.
- Updated
Aug 17, 2024 - Python
Semantic extraction from conference proceedings.
- Updated
Jul 26, 2020 - Python
This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified and returned. Tables are retrieved formatted as a CSV.
- Updated
Sep 11, 2020 - Python
Low-Cost LLM-Powered Data Processing with Theoretical Guarantees
- Updated
May 1, 2025 - Python
tokyo, a REST API, when given any type of document 📄, Identifies mime-type 🧐. Suggests extension 🦔. Alas Extracts text 💪.
- Updated
Jun 13, 2020 - Clojure
Improve this page
Add a description, image, and links to thedocument-processing topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with thedocument-processing topic, visit your repo's landing page and select "manage topics."