document-processing

TWIX is an open-source data extraction tool that reconstructs structured data from documents at scale, accurately and at low cost, by inferring the shared underlying visual template across documents

document-processing document-data-extraction

UpdatedMay 29, 2025
Python

awslabs /project-lakechain

Star181

⚡ Cloud-native, AI-powered, document processing pipelines on AWS.

aws machine-learning natural-language-processing computer-vision serverless hacktoberfest document-processing aws-cdk generative-ai retrieval-augmented-generation

UpdatedMar 18, 2025
TypeScript

A full-featured Document Management Platform / Document Layer for your application, providing storage, discovery, processing, and retrieval. Deploys directly into your Amazon Web Services Cloud. Please 🌟 star to support our work!

aws ocr serverless headless cloud-storage document-database amazon-web-services dms document-management optical-character-recognition document-processing document-management-system document-api document-apis intelligent-document-processing document-layer

UpdatedJul 19, 2025
Java

awslabs /rhubarb

Star93

A Python framework for multi-modal document understanding with Amazon Bedrock

multi-modal document-processing generative-ai intelligent-document-processing amazon-bedrock

UpdatedJun 19, 2025
Python

iamarunbrahma /pdf-to-markdown

Star86

Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.

python information-retrieval document-conversion pdf-converter text-extraction pdf-parsing document-processing rag pdf-extraction retrieval-augmented-generation pdf-to-markdown

UpdatedNov 22, 2024
Python

parsee-ai /parsee-core

Star72

Retrieval of fully structured data made easy. Use LLMs or custom models. Specialized on PDFs and HTML files. Extensive support of tabular data extraction and multimodal queries.

structured-data document-processing multimodal llm

UpdatedJul 8, 2025
Python

steindani /pandoc-include

Star62

An include filter for Pandoc

markdown pandoc pandoc-filter document-processing

UpdatedDec 6, 2020
Haskell

PSPDFKit /nutrient-document-engine-mcp-server

Star55

A Model Context Protocol (MCP) server implementation exposes document processing capabilities through natural language, supporting both direct human interaction and AI agent tool calling.

document-processing document-processor agentic-ai mcp-server

UpdatedJul 14, 2025
TypeScript

aws-solutions /enhanced-document-understanding-on-aws

Star38

Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates search indexes from the analyzed data.

document-analysis document-processing

UpdatedJun 15, 2025
JavaScript

cburschka /lyx

Star37

Unofficial mirror of git://git.lyx.org/lyx.git (updates daily. not affiliated with lyx.org.)

latex mirror lyx document-processing

UpdatedMar 21, 2023
C++

abdullahshafiq-20 /ResumeTex

Star37

ResumeTex is an AI-powered tool that converts standard PDF resumes into professionally formatted LaTeX documents. This service helps you create elegant, structured resumes without needing to learn LaTeX syntax.

nodejs resume open-source tex automation express latex reactjs developer-tools job-application pdf-parsing document-processing tailwindcss pdf-to-latex google-generative-ai ai-resume-generator resume-converter

UpdatedJul 18, 2025
JavaScript

kili-technology /awesome-datasets

Star35

A comprehensive list of annotated training datasets classified by use case.

nlp data ocr annotation opendata dataset corpora datasets public-data ner entity-extraction open-datasets document-processing entity-recognition awesome-public-datasets awesome-data-science public-dataset public-datasets opendatasets awesome-datasets

UpdatedJul 8, 2022

jmanhype /DSPy-Multi-Document-Agents

Star33

An advanced distributed knowledge fabric for intelligent document processing, featuring multi-document agents, optimized query handling, and semantic understanding.

nlp distributed-systems ai query-optimization knowledge-management document-processing vector-search

UpdatedAug 17, 2024
Python

afrozas /proceedings

Star31

Semantic extraction from conference proceedings.

semantic conferences spacy document-processing

UpdatedJul 26, 2020
Python

MBAigner /PDFSegmenter

Star23

This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified and returned. Tables are retrieved formatted as a CSV.

python pdf csv table annotations cluster-analysis document-processing layout-analysis detection-model page-segmentation