content-extraction

Next.js template for seamless PDF parsing using pdf2json and FilePond. Ideal for developers seeking a ready-to-use solution for PDF content extraction in Next.js projects.

nextjs content-extraction pdf-parsing react-pdf pdf-parser pdf2json filepond pdf-upload pdf-parse nextjs-pdf-parser nextjs-pdf react-pdf-parser nextjs-pdf-parse nextjs-pdf-parsing

UpdatedDec 8, 2023
TypeScript

gregors /boilerpipe-ruby

Star43

Pure ruby implementation of the Boilerpipe content extraction algorithm tuned for online articles

news webscraping content-extraction boilerpipe boilerpipe-algorithm

UpdatedFeb 21, 2021
Ruby

oiwn /dom-content-extraction

Star37

DOM Based Content Extraction via Text Density

scraping content-extraction dom-based

UpdatedSep 23, 2025
Rust

nikitautiu /learnhtml

Star34

Web content extraction using machine learning

html deep-learning content-extraction

UpdatedMar 3, 2021
HTML

spences10 /mcp-jinaai-reader

Star29

🔍 Model Context Protocol (MCP) tool for parsing websites using the Jina.ai Reader

mcp documentation-tool text-extraction web-scraping content-extraction web-content jinaai llm-tools model-context-protocol

UpdatedApr 5, 2025
JavaScript

gdamdam /sumo

Star20

Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more

nlp nltk automatic-summarization content-extraction semantic-analysis sentence-extraction entity-recognition

UpdatedJan 15, 2019
Python

pdfix /pdfix_sdk_example_cpp

Star19

Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...

html metadata pdf converter accessibility conversion tagging pdf-converter pdf-forms wcag digital-signature sign extract-data watermark autotag pdf-manipulation content-extraction pdf-data-extraction pdfua pdf2html

UpdatedOct 7, 2025
C++

bencmc /youtube_video_summarizer

Star15

This repository houses a Python application for extracting YouTube video transcripts and summarizing its content.

python natural-language-processing youtube-api video-processing openai text-summarization text-processing natural content-extraction streamlit transcript-analysis gpt-35-turbo langchain-python

UpdatedSep 29, 2023
Python

timoteostewart /benson

Star14

Benson turns a list of URLs into mp3s of the contents of each web page - take control over your reading backlog!

productivity web-scraping content-extraction boilerplate-removal

UpdatedOct 30, 2024
Python

LandWhale2 /TD-Spider

Star13

Via Text Density Simple Web Crawler With Go

golang data-mining opensource dom web-crawler scraping content-extraction keyword-search text-density

UpdatedMar 19, 2023
Go

peremenov /seize

Star12

Seize is light Node or Browser web-page content extractor inspired by arc90 readability and Safari Reader

dom extract reader readability content-extraction text-score

UpdatedMay 20, 2017
HTML

kamjin3086 /Crawell

Star11

📸 Crawell – 网页图片/正文一键提取、Markdown 转换与批量下载的浏览器扩展，本地化，免费 Crawell browser extension for one-click image & article extraction, Markdown conversion and bulk download – 100 % local processing.

react chrome-extension markdown typescript firefox-addon web-scraping browser-extension edge-extension content-extraction image-downloader tailwindcss privacy-first

UpdatedJul 31, 2025
TypeScript

amirthfultehrani /Youtube-Transcript-Copier

Star11

A userscript that adds a button to YouTube video pages for copying the transcript with or without timestamps.

javascript productivity clipboard helper automation youtube web video utilities tool accessibility userscript text-extraction greasemonkey transcript data-extraction tampermonkey browser-extension content-extraction violentmonkey

UpdatedOct 11, 2025
JavaScript

vakharwalad23 /mark-minion

Sponsor

Star11

The Ultimate Web Content Extraction & Conversion Tool for AI/LLM Applications. Convert almost any web content into clean Markdown with intelligent AI processing.

typescript web-scraping content-extraction document-processing tweets-extraction markdown-conversion puppeteer cloudflare-worker ai-powered