pdf-extraction

Star

Here are 66 public repositories matching this topic...

Language:All

Filter by language

All66 Python39 Jupyter Notebook6 HTML4 TypeScript4 Java2 JavaScript2 C++1 Go1 R1 Shell1

Sort:Most stars

Sort options

Most stars Fewest stars Most forks Fewest forks Recently updated Least recently updated

Goldziher /kreuzberg

Sponsor

Star2.5k

A polyglot document intelligence framework with a Rust core. Extract text, metadata, and structured information from PDFs, Office documents, images, and 50+ formats. Available for Rust, Python, Ruby, Go, and TypeScript/Node.js—or use via CLI, REST API, or MCP server.

ruby python java rust golang node ffi wasm tesseract text-extraction metadata-extraction table-extraction pdfium rag pdf-extraction document-intelligence

UpdatedNov 29, 2025
HTML

24eme /signaturepdf

Star689

Free open-source web software for signing PDF (alone or with others) and also organize pages, edit medata and compress pdf

php pdf js signature pdf-manipulation pdf-merge pdf-format pdf-rotate pdf-merger pdf-meta-editor pdf-tools pdf-signature pdf-compression pdf-editor pdf-sign pdf-extraction pdf-signer pdf-metadata pdf-compressor

UpdatedNov 9, 2025
JavaScript

pytr-org /pytr

Star633

Use TradeRepublic in terminal and mass download all documents

portfolio finance terminal-app portfolio-performance pdf-extraction traderepublic-statements traderepublic

UpdatedNov 29, 2025
Python

ArtifexSoftware /mupdf.js

Star585

JavaScript bindings for MuPDF

javascript pdf typescript wasm mupdf pdf-viewer pdf-extraction

UpdatedAug 25, 2025

mateogon /pdf-narrator

Star140

Convert your PDFs and EPUBs into audiobooks effortlessly. Features intelligent text extraction, customizable text-to-speech settings, and efficient processing for low-resource systems.

pdf text-to-speech audiobook tts epub low-resource pdf-extraction pdf-to-audiobook immersive-reading kokoro-tts audiobook-generator pdf-audiobook

UpdatedMar 28, 2025
Python

iamarunbrahma /pdf-to-markdown

Star104

Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.

python information-retrieval document-conversion pdf-converter text-extraction pdf-parsing document-processing rag pdf-extraction retrieval-augmented-generation pdf-to-markdown

UpdatedNov 22, 2024
Python

pcschreiber1 /PDF_Extraction-Translation

Star34

Translate many large PDF Reports for free using Python.

python pdf-extraction pdf-translation

UpdatedDec 31, 2022
Jupyter Notebook

aidalinfo /extract-kit

Star7

Powerful PDF data extraction library powered by AI vision models. Transform PDFs into structured, validated data using TypeScript, Zod, and AI providers like Scaleway and Ollama.

pdf document-processing ai-sdk pdf-extraction vision-llm

UpdatedSep 14, 2025
TypeScript

MarkShawn2020 /video2ppt

Star6

Extract presentation slides from videos with accurate timestamps

python opencv video-processing cli-tool frame-extraction pdf-extraction video-to-slides presentation-extraction

UpdatedAug 25, 2025
Shell

adobe /pdftools-extract-java-sdk-samples

Star6

This sample project provides a preview of the PDF Extract API. Using the sample project and this documentation, you will easily be able to integrate the PDF Extract API in your own server-side code.

java pdf extract pdf-extraction

UpdatedApr 8, 2024
Java

rrayhka /GRI-Extractor

Star3

A tool to automatically extract GRI disclosure codes from corporate sustainability reports, enabling efficient analysis of environmental, social, and governance (ESG) data. Supports English and Indonesian reports.

python nlp machine-learning pattern-matching tf-idf gri groq pdf-extraction streamlit sustainability-developoment-goals llm sustainability-reporting

UpdatedJun 9, 2025
Python

anyparser /anyparserjs

Star3

Anyparser Typescript SDK for RAG/ETL Pipelines - File Content Extraction. Supports extraction from various file formats including PDF, Microsoft Office documents, OCR/Image to Text, Audio to Text, and Website to Text.

crawler ocr microsoft-word web-crawler text-extraction artificial-intelligence knowledgebase ms-office microsoft-office etl-pipeline rag pdf-extraction n8n-nodes langchain retrieval-augmented-generation graph-rag cache-augmented-generation anyparser

UpdatedFeb 26, 2025
TypeScript

Rushi-Balapure /pdf_to_json

Star2

A high-performance Python library for extracting structured content from PDF documents with layout-aware text extraction. pdf_to_json preserves document structure including headings (H1-H6) and body text, outputting clean JSON format.

python nlp pdf json cross-platform offline python-library text-extraction data-extraction pdf-parser cli-tool document-processing layout-analysis pdf-to-json pdf-processing pdf-extraction document-parsing cpu-only structure-extraction

UpdatedNov 24, 2025
Python

heshiming /paddlefish

Star2

A Python + C implementation for image-based PDF page layout analysis and content extraction.

pdf image-processing image-segmentation image-analysis pdf-extractor table-extraction layout-analysis pdf-extraction

UpdatedApr 13, 2023
C++

abhay-yemekar /forecastgpt-financial-outlook-agent

Star2

AI-powered financial forecasting agent that extracts quarterly metrics, runs RAG on earnings transcripts, and generates structured next-quarter outlook via FastAPI + Ollama.

mysql ai-agents faiss rag fastapi pdf-extraction financial-forecasting langchain ollama llama3

UpdatedNov 28, 2025
Python

souvik03-136 /TenderBot

Star2

Task

opencv machine-learning ocr deep-learning curl text-recognition postman flask-api data-parsing table-extraction document-processing camelot pytesseract json-output github-actions ai-pipeline pdf-extraction pdfplumber google-tapas google-gemini

UpdatedMar 15, 2025
Python

JaweriaAsif745 /Resume-github-job-analyzer

Star1

Analyze your resume, GitHub profile, and a job description together. Extract skills from each source, compare them, and get insights on skill gaps, overlaps, and match scores to improve your resume and public profile.

pdf-extraction nlp-project streamlit-app resume-analyzer github-skill-extractor