web-data-extraction

The this.url class is designed to fetch and parse URL data, returning an object with structured information that can then be used for machine learning algorithms in a database or other storage.

web-scraping url-parsing metadata-extraction web-data-extraction neurons-me-ecosystem structured-url-data machine-learning-urls data-driven-web-analysis intelligent-link-processing ai-ready-url-processing

UpdatedAug 26, 2025
JavaScript

luminati-io /java-web-scraping

Star27

Quick guide with code example how to use Java for web scraping

java maven scraping-websites web-data-extraction

UpdatedDec 18, 2024

GNewsScraper is a TypeScript package that scrapes article data from Google News based on a keyword or phrase. It returns the results as an array of JSON objects, making it convenient to access and use the scraped information

typescript web-scraping json-parsing web-crawling google-news data-scraping google-news-scraper web-data-extraction web-automation keyword-search gnews news-scraping gnews-api article-extraction gnews-scraper

UpdatedAug 19, 2023
TypeScript

DemonMartin /scrappey-wrapper

Star12

An API wrapper for Scrappey.com written in Node.js (cloudflare bypass & solver)

web-scraping data-extraction web-data-extraction scraping-framework scraping-tool cloudflare-bypass web-scraping-solution cloudflare-solver api-scraping scraping-solution website-data-extraction scraping-library cloudflare-anti-bot scraping-service data-scraping-tool website-scraping-tool turnstile-solver

UpdatedJan 10, 2024
JavaScript

jjonescz /awe

Sponsor

Star12

AI-based web extractor

deep-learning information-extraction web-scraping web-data-extraction structured-web-data

UpdatedFeb 25, 2023
Python

Boomslet /Web_Crawler

Star9

Open-source web crawler

python url html open-source website opensource links web-crawler urls free data-extraction webcrawler web-crawling web-data-extraction urllib web-crawler-python

UpdatedJul 21, 2018
Python

kaizenplatform /FacebookInsightsConnector

Star8

The Tableau Web Data Connector for Facebook Insights API

facebook tableau facebook-insights web-data-extraction

UpdatedJun 26, 2017
JavaScript

wbsg-uni-mannheim /WDCFramework

Star8

Java Framework which is used by the Web Data Commons project to extract Microdata, Microformats and RDFa data, Web graphs, and HTML tables from the web crawls provided by the Common Crawl Foundation.

schema-org json-ld microdata web-data-extraction

UpdatedDec 13, 2022
Java

SaurabhSSB /BookMiner

Star8

A pipeline to scrape, extract, and analyze book data from web pages to insights.

python books jupyter-notebook eda data-visualization web-scraping data-analysis html-parsing beautifulsoup csv-export data-pipeline web-data-extraction data-science-project project-portfolio book-dataset

UpdatedJan 1, 2026
HTML

lekhmanrus /real-shot-pdf

Star7

RealShotPDF is a Chrome extension designed to simplify the process of creating PDF documents from web content. The extension allows users to navigate through selected webpages, parse and display links in a tree view, and generate PDFs for the chosen pages. It operates locally without sending any data to external servers.

chrome-extension pdf angular web-scraping knowledgebase browser-extension knowledge-base pdf-generator pdf-generation web-crawling hacktoberfest pdf-merger web-data-extraction pdf-downloader data-preservation webpage-to-pdf link-parsing gpt-integration web-content-capture local-data-processing

UpdatedMar 1, 2024
TypeScript

lightfeed /sdk

Star5

Lightfeed SDK to search and filter web data

crawler etl extract data-engineering business-intelligence data-extraction data-integration structured-data knowledge-base webscraping data-pipeline web-data-extraction ai-agents rag vector-database web-data-management llm embedding-search llm-scraper llm-extraction

UpdatedJun 7, 2025
Python

oxpath /oxpath

Star5

OXPath from Oxford

scraper web ajax web-data-extraction

UpdatedMay 20, 2022
Java

vakra-dev /supermarkdown

Star4

High-performance HTML to Markdown converter with full GitHub Flavored Markdown support. Written in Rust, available for Node.js and as a native Rust crate.

html markdown rust converter github-flavored-markdown ai html-to-markdown wasm commonmark web-scraping html-parser data-extraction text-processing content-extraction web-data-extraction ai-agents llm

UpdatedJan 29, 2026
Rust

wbsg-uni-mannheim /schemaorg-tables

Star3

This repository contains the code and data download links to reproduce the building process of the 2021 Schema.org Table Corpus.

schema-org web-data-extraction web-tables

UpdatedMay 12, 2021
Python

Improve this page

Add a description, image, and links to theweb-data-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with theweb-data-extraction topic, visit your repo's landing page and select "manage topics."

Learn more

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

web-data-extraction

Here are 47 public repositories matching this topic...

firecrawl /firecrawl

ScrapeGraphAI /Scrapegraph-ai

vakra-dev /reader

MohamedHmini /iww

crawlcore /qcrawl

lightfeed /extractor

neurons-me /this.url

luminati-io /java-web-scraping

dstark5 /gnews-scraper

DemonMartin /scrappey-wrapper

jjonescz /awe

Boomslet /Web_Crawler

kaizenplatform /FacebookInsightsConnector

wbsg-uni-mannheim /WDCFramework

SaurabhSSB /BookMiner

lekhmanrus /real-shot-pdf

lightfeed /sdk

oxpath /oxpath

vakra-dev /supermarkdown

wbsg-uni-mannheim /schemaorg-tables

Improve this page

Add this topic to your repo