pdf-text-extraction

PDF 문서에서 GPU 가속 처리로 고품질 질의응답(QA) 데이터를 자동 생성하고 LLM을 효율적으로 파인튜닝하는 솔루션입니다. Unstructured 라이브러리와 AWS Bedrock Claude로 도메인 특화 QA 쌍을 생성하고, LoRA 기법으로 경량 모델을 훈련합니다.

processing docker aws gpu cuda bedrock data-extraction pdf-generation claude unstructured distillation finetuning sagemaker pdf-text-extraction data-argumantation llm awosome unsloth processing-job text-disti

UpdatedNov 11, 2025
Jupyter Notebook

vijayengineer /PDFTextSpeechConverter

Star6

Converts scanned documents and ordinary documents into speech mp3 using Amazon Polly

pdf text images speech aws-polly audiobook synthesis scanned-documents pdf-text-extraction

UpdatedDec 30, 2020
Python

PrathameshDhande22 /PdfTxtBot

Star4

A Telegram bot which extract Text from PDF, also extract the Images of PDF Pages. Made with Python

python telegram telegram-bot python3 python-telegram-bot image-extractor python-telegram pdf-text pdf-text-extraction pdf-image

UpdatedFeb 27, 2023
Python

eli64s /pdflex

Sponsor

Star3

CLI for merging PDF contexts.

pdf-converter pdf-document pdf-generator pdf-manipulation pdf-extractor pdf-library pdf-parser pdf-data-extraction pdf-processor pdf-tools pdf-document-processor python-pdf pdf-search pdf-text-extraction pdf-python pdf-automation python-pdf-tools pdf-document-parser pdf-regex

UpdatedMar 20, 2025
Python

Zeeshanahmad4 /NLP-Pdf-Minning-Extracting-text-from-pdf

Star3

NLP Pdf Minning Extracting text from pdf

python pdf pdf-converter text-extraction pdfkit pdf-files extract-text pdftotext pdf-format pdf-document-processor pdftoimage pdftools pdftohtml pdf-text-extraction pdfcon

UpdatedApr 2, 2020
Python

kushalpatel0265 /Resume-Parser

Star3

A resume parser that extracts key details from PDF files using Groq's LLM

python nlp api google-colab pdf-text-extraction streamlit-webapp llm

UpdatedApr 14, 2025
Jupyter Notebook

rithulkamesh /docproc

Sponsor

Star2

Opinionated and Sophisticated Document Region Analyzer.

python machine-learning ocr text-classification text-extraction data-extraction region-detection content-extraction document-analysis layout-analysis pdf-processing pdf-text-extraction document-parsing equation-detection mathematical-symbols

UpdatedApr 13, 2025
Python

VirajMadhu /pdf_key_matcher

Star2

Highlights the key matches between your Given PDF and the description text

python open-source pdf cv python-script python3 text-extraction terminal-based ats text-compression pdf-text-extraction virajmadhu

UpdatedDec 4, 2024
Python

bladeacer /pdf-fmt

Sponsor

Star1

A PDF text extractor, processor and formatter. Supports regex based exclusions and other niceties.

python pdf text-formatting pdf-text-extraction

UpdatedNov 8, 2025
Python

holasoymas /text-finder

Star1

PDF Text Finder Console App along with page number

csharp console-app pdf-text-extraction pdf-text-processing

UpdatedMar 20, 2025
C#

rmottanet /unchainedtext

Star1

UnchainedText: Break free from PDFs! Easily extract raw text to .txt for preprocessing.

extractor text-extraction data-extraction text-processing pdf-text-extraction text-extraction-tool

UpdatedApr 2, 2024
Python

towfique-elahe /pdf-to-structured-csv

Star0

A Python-based tool for extracting structured data from PDFs using OCR and regex, and exporting it to CSV. Ideal for processing invoices, logs, or scanned documents into organized, usable datasets.

ocr data-extraction pdf-to-csv document-processing pytesseract pdf2image python-automation pdf-text-extraction structured-data-extraction regex-parsing

UpdatedOct 30, 2024
Jupyter Notebook

Spikes2012 /DjangoBusPriority

Star0

This is for Technology Application Project at Swinburne University of Technology

django file-upload text-extraction image-to-text webapplication pdf-text-extraction

UpdatedJun 6, 2023
Python

RealBlueSwan /BSPDFDataExtractor

Star0

Extracts Data from provided PDF using key words to identify relevant datapoints. Using UglyToad PDFPIG(great lib btw)

pdf-text-extraction

UpdatedJul 20, 2024
C#

ahsan-javed-ds /file-text-extractor-java-project

Star0

Multiple File Format (PDF/DOC/DOCX/XLSX/XLS/CSV) Text Extraction Utility Project in Java Programming Language

java maven log4j intellij text-extraction pdfbox java-programming java-project apache-tika apache-poi apache-maven pdf-text-extraction jdk17 doc-text-extraction docx-text-extraction xls-text-extraction xlsx-text-extraction csv-text-extraction

UpdatedOct 24, 2024
Java

nsourlos /OCR_and_RAG

Star0

Tests of OCR and RAG with LLMs

information-retrieval ocr gemini openai mistral document-processing cohere rag pdf-text-extraction colpali qwen2-vl

UpdatedJun 23, 2025
Jupyter Notebook

A robust, modular web crawler built in Python for extracting and saving content from websites. This crawler is specifically designed to extract text content from both HTML and PDF files, saving them in a structured format with metadata.

rate-limiting http-requests error-handling html-parsing data-collection text-processing web-crawling content-extraction yaml-configuration data-scraping python-crawler modular-design metadata-storage url-normalization pdf-text-extraction structured-data-storage concurrent-crawling data-extraction-pipeline data-preservation-and-recovery

UpdatedNov 18, 2024
Python

Keremunce /nodejs-pdf-extractor

Star0

Node.js + Express app that extracts plain text from uploaded PDFs, with a browser UI for manual tests and pdf-parse driving the extraction pipeline.

nodejs javascript express rest-api web-app file-upload document-processing backend-service pdf-parse pdf-text-extraction

UpdatedNov 5, 2025
HTML

Improve this page

Add a description, image, and links to thepdf-text-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with thepdf-text-extraction topic, visit your repo's landing page and select "manage topics."

Learn more

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdf-text-extraction

Here are 20 public repositories matching this topic...

houking-can /PDFSDK

mamiriqbal1 /rag_book_qa_prompt

hyeonsangjeon /PDF2LLM-Tuning-Studio

vijayengineer /PDFTextSpeechConverter

PrathameshDhande22 /PdfTxtBot

eli64s /pdflex

Zeeshanahmad4 /NLP-Pdf-Minning-Extracting-text-from-pdf

kushalpatel0265 /Resume-Parser

rithulkamesh /docproc

VirajMadhu /pdf_key_matcher

bladeacer /pdf-fmt

holasoymas /text-finder

rmottanet /unchainedtext

towfique-elahe /pdf-to-structured-csv

Spikes2012 /DjangoBusPriority

RealBlueSwan /BSPDFDataExtractor

ahsan-javed-ds /file-text-extractor-java-project

nsourlos /OCR_and_RAG

simonpierreboucher /Crawler

Keremunce /nodejs-pdf-extractor

Improve this page

Add this topic to your repo