web-data-extraction
Here are 47 public repositories matching this topic...
Language:All
Sort:Most stars
🔥 The Web Data API for AI - Turn entire websites into LLM-ready markdown or structured data
- Updated
Feb 20, 2026 - TypeScript
Open-source, production-grade web scraping engine built for LLMs. Scrape and crawl the entire web, clean markdown, ready for your agents.
- Updated
Feb 2, 2026 - TypeScript
AI based web-wrapper for web-content-extraction
- Updated
Feb 6, 2023 - Python
qcrawl - fast async web crawling & scraping framework for Python.
- Updated
Dec 7, 2025 - Python
Using LLMs and AI browser automation to robustly extract web data
- Updated
Sep 30, 2025 - TypeScript
The this.url class is designed to fetch and parse URL data, returning an object with structured information that can then be used for machine learning algorithms in a database or other storage.
- Updated
Aug 26, 2025 - JavaScript
Quick guide with code example how to use Java for web scraping
- Updated
Dec 18, 2024
GNewsScraper is a TypeScript package that scrapes article data from Google News based on a keyword or phrase. It returns the results as an array of JSON objects, making it convenient to access and use the scraped information
- Updated
Aug 19, 2023 - TypeScript
An API wrapper for Scrappey.com written in Node.js (cloudflare bypass & solver)
- Updated
Jan 10, 2024 - JavaScript
AI-based web extractor
- Updated
Feb 25, 2023 - Python
Open-source web crawler
- Updated
Jul 21, 2018 - Python
The Tableau Web Data Connector for Facebook Insights API
- Updated
Jun 26, 2017 - JavaScript
Java Framework which is used by the Web Data Commons project to extract Microdata, Microformats and RDFa data, Web graphs, and HTML tables from the web crawls provided by the Common Crawl Foundation.
- Updated
Dec 13, 2022 - Java
A pipeline to scrape, extract, and analyze book data from web pages to insights.
- Updated
Jan 1, 2026 - HTML
RealShotPDF is a Chrome extension designed to simplify the process of creating PDF documents from web content. The extension allows users to navigate through selected webpages, parse and display links in a tree view, and generate PDFs for the chosen pages. It operates locally without sending any data to external servers.
- Updated
Mar 1, 2024 - TypeScript
Lightfeed SDK to search and filter web data
- Updated
Jun 7, 2025 - Python
High-performance HTML to Markdown converter with full GitHub Flavored Markdown support. Written in Rust, available for Node.js and as a native Rust crate.
- Updated
Jan 29, 2026 - Rust
This repository contains the code and data download links to reproduce the building process of the 2021 Schema.org Table Corpus.
- Updated
May 12, 2021 - Python
Improve this page
Add a description, image, and links to theweb-data-extraction topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with theweb-data-extraction topic, visit your repo's landing page and select "manage topics."