web-data-extraction
Here are 40 public repositories matching this topic...
Language:All
Sort:Most stars
🔥 The Web Data API for AI - Turn entire websites into LLM-ready markdown or structured data
- Updated
Dec 18, 2025 - TypeScript
AI based web-wrapper for web-content-extraction
- Updated
Feb 6, 2023 - Python
The this.url class is designed to fetch and parse URL data, returning an object with structured information that can then be used for machine learning algorithms in a database or other storage.
- Updated
Aug 26, 2025 - JavaScript
Using LLMs and AI browser automation to robustly extract web data
- Updated
Sep 30, 2025 - TypeScript
qcrawl - fast async web crawling & scraping framework for Python.
- Updated
Dec 7, 2025 - Python
Quick guide with code example how to use Java for web scraping
- Updated
Dec 18, 2024
GNewsScraper is a TypeScript package that scrapes article data from Google News based on a keyword or phrase. It returns the results as an array of JSON objects, making it convenient to access and use the scraped information
- Updated
Aug 19, 2023 - TypeScript
AI-based web extractor
- Updated
Feb 25, 2023 - Python
An API wrapper for Scrappey.com written in Node.js (cloudflare bypass & solver)
- Updated
Jan 10, 2024 - JavaScript
Open-source web crawler
- Updated
Jul 21, 2018 - Python
Java Framework which is used by the Web Data Commons project to extract Microdata, Microformats and RDFa data, Web graphs, and HTML tables from the web crawls provided by the Common Crawl Foundation.
- Updated
Dec 13, 2022 - Java
The Tableau Web Data Connector for Facebook Insights API
- Updated
Jun 26, 2017 - JavaScript
A pipeline to scrape, extract, and analyze book data from web pages to insights.
- Updated
Sep 30, 2025 - HTML
RealShotPDF is a Chrome extension designed to simplify the process of creating PDF documents from web content. The extension allows users to navigate through selected webpages, parse and display links in a tree view, and generate PDFs for the chosen pages. It operates locally without sending any data to external servers.
- Updated
Mar 1, 2024 - TypeScript
Lightfeed SDK to search and filter web data
- Updated
Jun 7, 2025 - Python
This repository contains the code and data download links to reproduce the building process of the 2021 Schema.org Table Corpus.
- Updated
May 12, 2021 - Python
Get and process multiple resources from web, using asyncio (aiohttp) to fetch the data and multiprocessing/multithreading for processing it.
- Updated
Mar 4, 2021 - Python
A web data extraction library written in golang.
- Updated
Nov 20, 2025 - Go
Improve this page
Add a description, image, and links to theweb-data-extraction topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with theweb-data-extraction topic, visit your repo's landing page and select "manage topics."