web-crawling

Star

Here are 319 public repositories matching this topic...

Language:All

Filter by language

All319 Python157 Jupyter Notebook43 JavaScript22 HTML16 Java13 TypeScript11 C#8 Go8 PHP5 C++4

Sort:Most stars

Sort options

Most stars Fewest stars Most forks Fewest forks Recently updated Least recently updated

apify /crawlee

Star18.5k

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

nodejs javascript npm crawler scraper automation typescript web-crawler headless scraping crawling web-scraping web-crawling headless-chrome apify puppeteer playwright

UpdatedJul 18, 2025
TypeScript

apify /crawlee-python

Star5.8k

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

python crawler scraper automation web-crawler headless scraping crawling pip web-scraping beautifulsoup web-crawling hacktoberfest headless-chrome apify playwright

UpdatedJul 18, 2025
Python

omkarcloud /botasaurus

Sponsor

Star2.1k

The All in One Framework to Build Undefeatable Scrapers

anti-bot web-crawling bot-detection python-scraper anti-detect undetected scraping-framework undetectable python-web-scraper scraping-tool cloudflare-bypass scraping-python python-web-scraping anti-detection cloudflare-scrape bypass-cloudflare web-scraping-python undetected-chromedriver antidetect-browser anti-detect-browser

UpdatedJun 11, 2025
Python

brightdata /brightdata-mcp

Star953

A powerful Model Context Protocol (MCP) server that provides an all-in-one solution for public web access.

mcp scraping web-scraping data-extraction data-collection structured-data web-crawling browser-automation ai-agents web-data scraping-tools anti-bot-detection llm ai-integrations mcp-server modelcontextprotocol

UpdatedJul 10, 2025
JavaScript

platonai /PulsarRPA

Star895

PulsarRPA: An AI-Enabled, Super-Fast, Thread-Safe Browser Automation Solution! 💖

web-crawler web-scraper web-scraping dom-manipulation web-crawling browser-automation dom-api ai-agents web-extraction rpa web-extractor llm ai-crawler text-to-action browser-use ai-rpa ai-browser-control

UpdatedJul 13, 2025
Kotlin

cxcscmu /Craw4LLM

Star633

Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"

crawler web-crawler crawling web-crawling pre-training pretraining large-language-models llm

UpdatedFeb 24, 2025
Python

scrapehero-code /amazon-scraper

Star394

A simple web scraper to extract Product Data and Pricing from Amazon

web-scraping web-crawling page-scraper web-scraping-tutorials amazon-scraper scrape-products

UpdatedJun 13, 2023
Python

crwlrsoft /crawler

Star366

Library for Rapid (Web) Crawler and Scraper Development

php crawler scraper web-crawler scraping crawling web-scraper web-scraping scraping-websites web-crawling hacktoberfest

UpdatedJun 10, 2025
PHP

godkingjay /selenium-twitter-scraper

Sponsor

Star277

This is a Twitter Scraper which uses Selenium for scraping tweets. It is capable of scraping tweets from home, user profile, hashtag, query or search, and advanced searches.

scraper twitter selenium collaborate web-crawling hacktoberfest twitter-scraper selenium-scraper hacktoberfest-accepted

UpdatedApr 12, 2025
Jupyter Notebook

spyboy-productions /omnisci3nt

Star270

Omnisci3nt – See What They’ve Tried to Hide Extract deep intelligence from any domain. From subdomains to SSL certs, archived secrets to exposed ports — Omnisci3nt gives you the full picture in seconds.

osint whois ssl-certificate ip-lookup web-crawling directory-enumeration port-scanning admin-panel-finder admin-login-finder website-hacking admin-panel-finder-of-any-website subdomain-enumeration pentesting-tools technology-analysis web-reconnaissance dns-enumeration reconnaissance-tool wayback-machine-access dmarc-record-examination social-media-and-email-discovery

UpdatedApr 15, 2025
Python

jrbadiabo /Bet-on-Sibyl

Star264

Machine Learning Model for Sport Predictions (Football, Basketball, Baseball, Hockey, Soccer & Tennis)

python machine-learning algorithms scikit-learn machine-learning-algorithms selenium web-scraping beautifulsoup machinelearning predictive-analysis python-2 web-crawling sports-stats sportsanalytics

UpdatedFeb 12, 2017
Jupyter Notebook

TurnerSoftware /InfinityCrawler

Star253

A simple but powerful web crawler library for .NET

crawler spider web-crawler robots-txt web-crawling

UpdatedDec 15, 2023
C#

ayakashi-io /ayakashi

Star214

⚡ Ayakashi.io - The next generation web scraping framework

data-mining automation web-scraping web-crawling headless-chrome

UpdatedJun 29, 2023
TypeScript

serpapi /clauneck

Star182

A tool for scraping emails, social media accounts, and much more information from websites using Google Search Results.

ruby open-source rubygem automation command-line email email-marketing data-extraction serp command-line-tool webscraping web-crawling data-extractor email-extractor email-scraper social-media-scraper email-extraction email-extract-with-proxy

UpdatedMar 19, 2024
Ruby

scrapinghub /scrapy-training

Star174

Scrapy Training companion code

python training web-scraping scrapy web-crawling

UpdatedJan 30, 2019
Python

brianmadden /krawler

Star131

A web crawling framework written in Kotlin

kotlin link-checker framework web-crawler webcrawler web-crawling crawler4j

UpdatedJun 29, 2021
Kotlin

leogregianin /bancocentralbrasil

Star125

💵 💰 🇧🇷 Informações sobre taxas oficiais diárias de Inflação, Selic, Poupança, Dólar, Dólar PTAX, Euro e Euro PTAX pelo site do Banco Central do Brasil

money brazil web-scraping brasil web-crawling banco-central