web-crawler

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

nodejs javascript npm crawler scraper automation typescript web-crawler headless scraping crawling web-scraping web-crawling headless-chrome apify puppeteer playwright

UpdatedDec 17, 2025
TypeScript

crawlab-team /crawlab

Star12.1k

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台，支持任何语言和框架

go docker platform crawler spider web-crawler scrapy webcrawler scrapyd-ui webspider crawling-tasks crawlab spiders-management

UpdatedDec 5, 2025
Go

ssssssss-team /spider-flow

Star11.1k

新一代爬虫平台，以图形化方式定义爬虫流程，不写代码即可完成爬虫。

crawler spider web-crawler jsoup xpath webcrawler webspider web-spider spider-flow

UpdatedJun 14, 2023
Java

apify /crawlee-python

Star7.3k

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

python crawler scraper automation web-crawler headless scraping crawling pip web-scraping beautifulsoup web-crawling hacktoberfest headless-chrome apify playwright

UpdatedDec 17, 2025
Python

BruceDone /awesome-crawler

Star7k

A collection of awesome web crawler,spider in different languages

crawler scraper awesome spider web-crawler web-scraper node-crawler

UpdatedJun 16, 2024

adithya-s-k /omniparse

Sponsor

Star6.8k

Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks

ocr parser-library web-crawler parse-server whisper-api ingestion-api vision-transformer omniparser

UpdatedDec 12, 2025
Python

firecrawl /firecrawl-mcp-server

Star5.1k

🔥 Official Firecrawl MCP Server - Adds powerful web scraping and search to Cursor, Claude and any other LLM clients.

mcp web-crawler web-scraping data-collection batch-processing content-extraction search-api claude llm-tools firecrawl model-context-protocol mcp-server firecrawl-ai javascript-rendering

UpdatedNov 22, 2025
JavaScript

apache /nutch

Star3.1k

Apache Nutch is an extensible and scalable web crawler

java hadoop web-crawler nutch crawling apache

UpdatedDec 11, 2025
Java

jasonxtn /Argus

Star2.5k

The Ultimate Information Gathering Toolkit

osint web-crawler whois-lookup virustotal information-gathering server-info dns-lookup reconnaissance cms-detection recon-tools email-harvester ssl-analitcs directory-finder txt-records pastebin-monitoring

UpdatedDec 10, 2025
Python

oxylabs /ai-crawler-py

Star2.3k

Crawl a website starting from a URL, find relevant pages, and extract data – all guided by your natural language prompt.

ai web-crawler ai-agents ai-crawler ai-studio ai-scraping ai-web-crawler crawl-agent

UpdatedOct 13, 2025

sjdirect /abot

Star2.3k

Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.

c-sharp unit-testing crawler spider csharp parsing cross-platform web-crawler netcore pluggable spiders csharp-library abot netcore2 netstandard20 netcore3 javascript-renderer netstandard21 abot-nuget netsta

UpdatedSep 9, 2024
C#

xianhu /PSpider

Star1.8k

简单易用的Python爬虫框架，QQ交流群：597510560

python crawler multi-threading spider multiprocessing web-crawler proxies python-spider web-spider

UpdatedJun 10, 2022
Python

MarginaliaSearch /MarginaliaSearch

Sponsor

Star1.6k

Internet search engine for text-oriented websites. Indexing the small, old and weird web.

java search-engine web-crawler indexer language-processing web-scale internet-search no-cloud self-hostable small-web alt-search

UpdatedDec 10, 2025
HTML

JustinBeckwith /linkinator

Star1.1k

Broken link checker that crawls websites and validates links. Find broken links, dead links, and invalid URLs in websites, documentation, and local files. Perfect for SEO audits and CI/CD.

nodejs testing html link-checker typescript seo web-crawler ci-cd 404 broken-links dead-links website-crawler seo-tools link-validator broken-link-checker url-validator