D4Vinci/ScraplingPublic

NotificationsYou must be signed in to change notification settings
Fork461
Star8.1k

🕷️ An undetectable, powerful, flexible, high-performance Python library to make Web Scraping Easy and Effortless as it should be!

License

BSD-3-Clause license

8.1k stars 461 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 852 Commits
.github		.github
docs		docs
images		images
scrapling		scrapling
tests		tests
.bandit.yml		.bandit.yml
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
ROADMAP.md		ROADMAP.md
benchmarks.py		benchmarks.py
cleanup.py		cleanup.py
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
ruff.toml		ruff.toml
setup.cfg		setup.cfg
tox.ini		tox.ini

Repository files navigation

Easy, effortless Web Scraping as it should be!

Selection methods · Choosing a fetcher · CLI · MCP mode · Migrating from Beautifulsoup

Stop fighting anti-bot systems. Stop rewriting selectors after every website update.

Scrapling isn't just another Web Scraping library. It's the firstadaptive scraping library that learns from website changes and evolves with them. While other libraries break when websites update their structure, Scrapling automatically relocates your elements and keeps your scrapers running.

Built for the modern Web, Scrapling features its own rapid parsing engine and fetchers to handle all Web Scraping challenges you face or will face. Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone.

>>fromscrapling.fetchersimportFetcher,AsyncFetcher,StealthyFetcher,DynamicFetcher>>StealthyFetcher.adaptive=True# Fetch websites' source under the radar!>>page=StealthyFetcher.fetch('https://example.com',headless=True,network_idle=True)>>print(page.status)200>>products=page.css('.product',auto_save=True)# Scrape data that survives website design changes!>># Later, if the website structure changes, pass `adaptive=True`>>products=page.css('.product',adaptive=True)# and Scrapling still finds them!

Sponsors

_{Do you want to show your ad here? Clickhere and choose the tier that suites you!}

Key Features

Advanced Websites Fetching with Session Support

HTTP Requests: Fast and stealthy HTTP requests with theFetcher class. Can impersonate browsers' TLS fingerprint, headers, and use HTTP3.
Dynamic Loading: Fetch dynamic websites with full browser automation through theDynamicFetcher class supporting Playwright's Chromium, real Chrome, and custom stealth mode.
Anti-bot Bypass: Advanced stealth capabilities withStealthyFetcher using a modified version of Firefox and fingerprint spoofing. Can bypass all types of Cloudflare's Turnstile and Interstitial with automation easily.
Session Management: Persistent session support withFetcherSession,StealthySession, andDynamicSession classes for cookie and state management across requests.
Async Support: Complete async support across all fetchers and dedicated async session classes.

Adaptive Scraping & AI Integration

🔄Smart Element Tracking: Relocate elements after website changes using intelligent similarity algorithms.
🎯Smart Flexible Selection: CSS selectors, XPath selectors, filter-based search, text search, regex search, and more.
🔍Find Similar Elements: Automatically locate elements similar to found elements.
🤖MCP Server to be used with AI: Built-in MCP server for AI-assisted Web Scraping and data extraction. The MCP server features custom, powerful capabilities that utilize Scrapling to extract targeted content before passing it to the AI (Claude/Cursor/etc), thereby speeding up operations and reducing costs by minimizing token usage. (demo video)

High-Performance & battle-tested Architecture

🚀Lightning Fast: Optimized performance outperforming most Python scraping libraries.
🔋Memory Efficient: Optimized data structures and lazy loading for a minimal memory footprint.
⚡Fast JSON Serialization: 10x faster than the standard library.
🏗️Battle tested: Not only does Scrapling have 92% test coverage and full type hints coverage, but it has been used daily by hundreds of Web Scrapers over the past year.

Developer/Web Scraper Friendly Experience

🎯Interactive Web Scraping Shell: Optional built-in IPython shell with Scrapling integration, shortcuts, and new tools to speed up Web Scraping scripts development, like converting curl requests to Scrapling requests and viewing requests results in your browser.
🚀Use it directly from the Terminal: Optionally, you can use Scrapling to scrape a URL without writing a single code!
🛠️Rich Navigation API: Advanced DOM traversal with parent, sibling, and child navigation methods.
🧬Enhanced Text Processing: Built-in regex, cleaning methods, and optimized string operations.
📝Auto Selector Generation: Generate robust CSS/XPath selectors for any element.
🔌Familiar API: Similar to Scrapy/BeautifulSoup with the same pseudo-elements used in Scrapy/Parsel.
📘Complete Type Coverage: Full type hints for excellent IDE support and code completion.
🔋Ready Docker image: With each release, a Docker image containing all browsers is automatically built and pushed.

Getting Started

Basic Usage

fromscrapling.fetchersimportFetcher,StealthyFetcher,DynamicFetcherfromscrapling.fetchersimportFetcherSession,StealthySession,DynamicSession# HTTP requests with session supportwithFetcherSession(impersonate='chrome')assession:# Use latest version of Chrome's TLS fingerprintpage=session.get('https://quotes.toscrape.com/',stealthy_headers=True)quotes=page.css('.quote .text::text')# Or use one-off requestspage=Fetcher.get('https://quotes.toscrape.com/')quotes=page.css('.quote .text::text')# Advanced stealth mode (Keep the browser open until you finish)withStealthySession(headless=True,solve_cloudflare=True)assession:page=session.fetch('https://nopecha.com/demo/cloudflare',google_search=False)data=page.css('#padded_content a')# Or use one-off request style, it opens the browser for this request, then closes it after finishingpage=StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare')data=page.css('#padded_content a')# Full browser automation (Keep the browser open until you finish)withDynamicSession(headless=True,disable_resources=False,network_idle=True)assession:page=session.fetch('https://quotes.toscrape.com/',load_dom=False)data=page.xpath('//span[@class="text"]/text()')# XPath selector if you prefer it# Or use one-off request style, it opens the browser for this request, then closes it after finishingpage=DynamicFetcher.fetch('https://quotes.toscrape.com/')data=page.css('.quote .text::text')

Advanced Parsing & Navigation

fromscrapling.fetchersimportFetcher# Rich element selection and navigationpage=Fetcher.get('https://quotes.toscrape.com/')# Get quotes with multiple selection methodsquotes=page.css('.quote')# CSS selectorquotes=page.xpath('//div[@class="quote"]')# XPathquotes=page.find_all('div', {'class':'quote'})# BeautifulSoup-style# Same asquotes=page.find_all('div',class_='quote')quotes=page.find_all(['div'],class_='quote')quotes=page.find_all(class_='quote')# and so on...# Find element by text contentquotes=page.find_by_text('quote',tag='div')# Advanced navigationfirst_quote=page.css_first('.quote')quote_text=first_quote.css('.text::text')quote_text=page.css('.quote').css_first('.text::text')# Chained selectorsquote_text=page.css_first('.quote .text').text# Using `css_first` is faster than `css` if you want the first elementauthor=first_quote.next_sibling.css('.author::text')parent_container=first_quote.parent# Element relationships and similaritysimilar_elements=first_quote.find_similar()below_elements=first_quote.below_elements()

You can use the parser right away if you don't want to fetch websites like below:

fromscrapling.parserimportSelectorpage=Selector("<html>...</html>")

And it works precisely the same way!

Async Session Management Examples

importasynciofromscrapling.fetchersimportFetcherSession,AsyncStealthySession,AsyncDynamicSessionasyncwithFetcherSession(http3=True)assession:# `FetcherSession` is context-aware and can work in both sync/async patternspage1=session.get('https://quotes.toscrape.com/')page2=session.get('https://quotes.toscrape.com/',impersonate='firefox135')# Async session usageasyncwithAsyncStealthySession(max_pages=2)assession:tasks= []urls= ['https://example.com/page1','https://example.com/page2']forurlinurls:task=session.fetch(url)tasks.append(task)print(session.get_pool_stats())# Optional - The status of the browser tabs pool (busy/free/error)results=awaitasyncio.gather(*tasks)print(session.get_pool_stats())

CLI & Interactive Shell

Scrapling v0.3 includes a powerful command-line interface:

# Launch interactive Web Scraping shellscrapling shell# Extract pages to a file directly without programming (Extracts the content inside `body` tag by default)# If the output file ends with `.txt`, then the text content of the target will be extracted.# If ended with `.md`, it will be a markdown representation of the HTML content, and `.html` will be the HTML content right away.scrapling extract get'https://example.com' content.mdscrapling extract get'https://example.com' content.txt --css-selector'#fromSkipToProducts' --impersonate'chrome'# All elements matching the CSS selector '#fromSkipToProducts'scrapling extract fetch'https://example.com' content.md --css-selector'#fromSkipToProducts' --no-headlessscrapling extract stealthy-fetch'https://nopecha.com/demo/cloudflare' captchas.html --css-selector'#padded_content a' --solve-cloudflare

Note

There are many additional features, but we want to keep this page concise, such as the MCP server and the interactive Web Scraping Shell. Check out the full documentationhere

Performance Benchmarks

Scrapling isn't just powerful—it's also blazing fast, and the updates since version 0.3 have delivered exceptional performance improvements across all operations.

Text Extraction Speed Test (5000 nested elements)

#	Library	Time (ms)	vs Scrapling
1	Scrapling	1.92	1.0x
2	Parsel/Scrapy	1.99	1.036x
3	Raw Lxml	2.33	1.214x
4	PyQuery	20.61	~11x
5	Selectolax	80.65	~42x
6	BS4 with Lxml	1283.21	~698x
7	MechanicalSoup	1304.57	~679x
8	BS4 with html5lib	3331.96	~1735x

Element Similarity & Text Search Performance

Scrapling's adaptive element finding capabilities significantly outperform alternatives:

Library	Time (ms)	vs Scrapling
Scrapling	1.87	1.0x
AutoScraper	10.24	5.476x

All benchmarks represent averages of 100+ runs. Seebenchmarks.py for methodology.

Installation

Scrapling requires Python 3.10 or higher:

pip install scrapling

Starting with v0.3.2, this installation only includes the parser engine and its dependencies, without any fetchers or commandline dependencies.

Optional Dependencies

If you are going to use any of the extra features below, the fetchers, or their classes, then you need to install fetchers' dependencies, and then install their browser dependencies with
```
pip install"scrapling[fetchers]"scrapling install
```
This downloads all browsers with their system dependencies and fingerprint manipulation dependencies.
Extra features:
- Install the MCP server feature:
```
pip install"scrapling[ai]"
```
- Install shell features (Web Scraping shell and theextract command):
```
pip install"scrapling[shell]"
```
- Install everything:
```
pip install"scrapling[all]"
```
Don't forget that you need to install the browser dependencies withscrapling install after any of these extras (if you didn't already)

Docker

You can also install a Docker image with all extras and browsers with the following command from DockerHub:

docker pull pyd4vinci/scrapling

Or download it from the GitHub registry:

docker pull ghcr.io/d4vinci/scrapling:latest

This image is automatically built and pushed through GitHub actions on the repository's main branch.

Contributing

We welcome contributions! Please read ourcontributing guidelines before getting started.

Disclaimer

Caution

This library is provided for educational and research purposes only. By using this library, you agree to comply with local and international data scraping and privacy laws. The authors and contributors are not responsible for any misuse of this software. Always respect the terms of service of websites and robots.txt files.