content-extraction

A powerful MCP server extension providing web search and content extraction capabilities. Integrates DuckDuckGo search functionality and URL content extraction into your MCP environment, enabling AI assistants to search the web and extract webpage content programmatically.

crawler cheerio mcp web-crawler duckduckgo web-scraper web-scraping google-search content-extraction duckduckgo-search web-search ai-assistant ai-tools web-content mcp-server web-search-agent

UpdatedFeb 13, 2026
JavaScript

mvasilkov /readability2

Star109

Readability2 converts HTML to plain text.

javascript html readability plaintext content-extraction

UpdatedDec 12, 2018
TypeScript

tuffstuff9 /nextjs-pdf-parser

Star67

Next.js template for seamless PDF parsing using pdf2json and FilePond. Ideal for developers seeking a ready-to-use solution for PDF content extraction in Next.js projects.

nextjs content-extraction pdf-parsing react-pdf pdf-parser pdf2json filepond pdf-upload pdf-parse nextjs-pdf-parser nextjs-pdf react-pdf-parser nextjs-pdf-parse nextjs-pdf-parsing

UpdatedDec 8, 2023
TypeScript

gregors /boilerpipe-ruby

Star43

Pure ruby implementation of the Boilerpipe content extraction algorithm tuned for online articles

news webscraping content-extraction boilerpipe boilerpipe-algorithm

UpdatedFeb 21, 2021
Ruby

oiwn /dom-content-extraction

Sponsor

Star38

DOM Based Content Extraction via Text Density

rust scraping web-crawling content-extraction dom-based

UpdatedSep 23, 2025
Rust

nikitautiu /learnhtml

Star34

Web content extraction using machine learning

html deep-learning content-extraction

UpdatedMar 3, 2021
HTML

spences10 /mcp-jinaai-reader

Star30

🔍 Model Context Protocol (MCP) tool for parsing websites using the Jina.ai Reader

mcp documentation-tool text-extraction web-scraping content-extraction web-content jinaai llm-tools model-context-protocol

UpdatedApr 5, 2025
JavaScript

gdamdam /sumo

Star20

Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more

nlp nltk automatic-summarization content-extraction semantic-analysis sentence-extraction entity-recognition

UpdatedJan 15, 2019
Python

pdfix /pdfix_sdk_example_cpp

Star19

Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...

html metadata pdf converter accessibility conversion tagging pdf-converter pdf-forms wcag digital-signature sign extract-data watermark autotag pdf-manipulation content-extraction pdf-data-extraction pdfua pdf2html

UpdatedFeb 20, 2026
C++

timoteostewart /benson

Star16

Benson turns a list of URLs into mp3s of the contents of each web page - take control over your reading backlog!

productivity web-scraping content-extraction boilerplate-removal

UpdatedOct 30, 2024
Python

bencmc /youtube_video_summarizer

Star15

This repository houses a Python application for extracting YouTube video transcripts and summarizing its content.

python natural-language-processing youtube-api video-processing openai text-summarization text-processing natural content-extraction streamlit transcript-analysis gpt-35-turbo langchain-python

UpdatedSep 29, 2023
Python

amirthfultehrani /Youtube-Transcript-Copier

Star14

A userscript that adds a button to YouTube video pages for copying the transcript with or without timestamps.

javascript productivity clipboard helper automation youtube web video utilities tool accessibility userscript text-extraction greasemonkey transcript data-extraction tampermonkey browser-extension content-extraction violentmonkey

UpdatedJan 25, 2026
JavaScript

jocmp /mercury-parser

Sponsor

Star14

Extract meaningful content from the chaos of a web page

nodejs javascript rss web-scraping html-parser readability content-extraction article-parser mercury-parser reader-mode

UpdatedFeb 20, 2026
JavaScript

LandWhale2 /TD-Spider

Star13

Via Text Density Simple Web Crawler With Go

golang data-mining opensource dom web-crawler scraping content-extraction keyword-search text-density

UpdatedMar 19, 2023
Go

peremenov /seize

Star12

Seize is light Node or Browser web-page content extractor inspired by arc90 readability and Safari Reader

dom extract reader readability content-extraction text-score

UpdatedMay 20, 2017
HTML

kamjin3086 /Crawell

Star12

📸 Crawell – 网页图片/正文一键提取、Markdown 转换与批量下载的浏览器扩展，本地化，免费 Crawell browser extension for one-click image & article extraction, Markdown conversion and bulk download – 100 % local processing.

react chrome-extension markdown typescript firefox-addon web-scraping browser-extension edge-extension content-extraction image-downloader tailwindcss privacy-first

UpdatedJul 31, 2025
TypeScript

Improve this page

Add a description, image, and links to thecontent-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with thecontent-extraction topic, visit your repo's landing page and select "manage topics."

Learn more

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly