document-analysis
Here are 228 public repositories matching this topic...
Language:All
Sort:Most stars
Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.
- Updated
Dec 16, 2025 - Python
The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.
- Updated
Dec 17, 2025 - Python
A system for agentic LLM-powered data processing and ETL
- Updated
Nov 29, 2025 - Python
Read and extract text and other content from PDFs in C# (port of PDFBox)
- Updated
Dec 7, 2025 - C#
An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)
- Updated
Aug 25, 2025 - Python
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
- Updated
Apr 9, 2025 - C++
A curated list of resources for Document Understanding (DU) topic
- Updated
Jun 2, 2023
Open-source platform for extracting structured data from documents using AI.
- Updated
May 15, 2025 - JavaScript
AI-powered document analysis platform built with Next.js, LangChain, PostgreSQL + pgvector. Upload, organize, and chat with documents. Includes predictive missing-document detection, role-based workflows, and page-level insight extraction.
- Updated
Dec 18, 2025 - JavaScript
This repository provides train&test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.
- Updated
Jul 20, 2020 - Jupyter Notebook
Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser
- Updated
Dec 16, 2025 - Python
Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)
- Updated
Jul 25, 2024 - Python
AssemblyLine 4: File triage and malware analysis
- Updated
Dec 18, 2025 - Python
Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (ACL 2022)
- Updated
Oct 31, 2022 - Python
Pandora is an analysis framework to discover if a file is suspicious and conveniently show the results
- Updated
Dec 15, 2025 - Python
A package for parsing PDFs and analyzing their content using LLMs.
- Updated
Aug 6, 2024 - Python
RObust document image BINarization
- Updated
Aug 2, 2024 - Python
YOLO models trained by DocLayNet - power your Document Intelligent by Layout Analysis
- Updated
Aug 3, 2025 - Python
Document Visual Question Answering
- Updated
Jul 30, 2020 - Python
Powerful web application that combines Streamlit, LangChain, and Pinecone to simplify document analysis. Powered by OpenAI's GPT-3, RAG enables dynamic, interactive document conversations, making it ideal for efficient document retrieval and summarization.
- Updated
Jul 4, 2024 - Python
Improve this page
Add a description, image, and links to thedocument-analysis topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with thedocument-analysis topic, visit your repo's landing page and select "manage topics."