pdf-extraction
Here are 66 public repositories matching this topic...
Language:All
Sort:Most stars
A polyglot document intelligence framework with a Rust core. Extract text, metadata, and structured information from PDFs, Office documents, images, and 50+ formats. Available for Rust, Python, Ruby, Go, and TypeScript/Node.js—or use via CLI, REST API, or MCP server.
- Updated
Nov 29, 2025 - HTML
Free open-source web software for signing PDF (alone or with others) and also organize pages, edit medata and compress pdf
- Updated
Nov 9, 2025 - JavaScript
Use TradeRepublic in terminal and mass download all documents
- Updated
Nov 29, 2025 - Python
JavaScript bindings for MuPDF
- Updated
Aug 25, 2025
Convert your PDFs and EPUBs into audiobooks effortlessly. Features intelligent text extraction, customizable text-to-speech settings, and efficient processing for low-resource systems.
- Updated
Mar 28, 2025 - Python
Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.
- Updated
Nov 22, 2024 - Python
Translate many large PDF Reports for free using Python.
- Updated
Dec 31, 2022 - Jupyter Notebook
Powerful PDF data extraction library powered by AI vision models. Transform PDFs into structured, validated data using TypeScript, Zod, and AI providers like Scaleway and Ollama.
- Updated
Sep 14, 2025 - TypeScript
Extract presentation slides from videos with accurate timestamps
- Updated
Aug 25, 2025 - Shell
This sample project provides a preview of the PDF Extract API. Using the sample project and this documentation, you will easily be able to integrate the PDF Extract API in your own server-side code.
- Updated
Apr 8, 2024 - Java
A tool to automatically extract GRI disclosure codes from corporate sustainability reports, enabling efficient analysis of environmental, social, and governance (ESG) data. Supports English and Indonesian reports.
- Updated
Jun 9, 2025 - Python
Anyparser Typescript SDK for RAG/ETL Pipelines - File Content Extraction. Supports extraction from various file formats including PDF, Microsoft Office documents, OCR/Image to Text, Audio to Text, and Website to Text.
- Updated
Feb 26, 2025 - TypeScript
A high-performance Python library for extracting structured content from PDF documents with layout-aware text extraction. pdf_to_json preserves document structure including headings (H1-H6) and body text, outputting clean JSON format.
- Updated
Nov 24, 2025 - Python
A Python + C implementation for image-based PDF page layout analysis and content extraction.
- Updated
Apr 13, 2023 - C++
AI-powered financial forecasting agent that extracts quarterly metrics, runs RAG on earnings transcripts, and generates structured next-quarter outlook via FastAPI + Ollama.
- Updated
Nov 28, 2025 - Python
Analyze your resume, GitHub profile, and a job description together. Extract skills from each source, compare them, and get insights on skill gaps, overlaps, and match scores to improve your resume and public profile.
- Updated
Nov 26, 2025 - Python
MCP server for academic paper search and retrieval
- Updated
Oct 7, 2025 - Python
PDF Analyzer** ist ein effizientes Python-Tool zur automatischen Analyse von PDF-Dokumenten.
- Updated
Jun 30, 2025 - Python
Improve this page
Add a description, image, and links to thepdf-extraction topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with thepdf-extraction topic, visit your repo's landing page and select "manage topics."