- Notifications
You must be signed in to change notification settings - Fork28
A pure-Python robots.txt parser with support for modern conventions.
License
NotificationsYou must be signed in to change notification settings
scrapy/protego
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Protego is a pure-Pythonrobots.txt
parser with support for modernconventions.
To install Protego, simply use pip:
pip install protego
>>>from protegoimport Protego>>> robotstxt="""... User-agent:*... Disallow:/... Allow:/about... Allow:/account... Disallow:/account/contact$... Disallow:/account/*/profile... Crawl-delay:4... Request-rate:10/1m# 10 requests every 1 minute...... Sitemap: http://example.com/sitemap-index.xml... Host: http://example.co.in...""">>> rp= Protego.parse(robotstxt)>>> rp.can_fetch("http://example.com/profiles","mybot")False>>> rp.can_fetch("http://example.com/about","mybot")True>>> rp.can_fetch("http://example.com/account","mybot")True>>> rp.can_fetch("http://example.com/account/myuser/profile","mybot")False>>> rp.can_fetch("http://example.com/account/contact","mybot")False>>> rp.crawl_delay("mybot")4.0>>> rp.request_rate("mybot")RequestRate(requests=10, seconds=60, start_time=None, end_time=None)>>>list(rp.sitemaps)['http://example.com/sitemap-index.xml']>>> rp.preferred_host'http://example.co.in'
Using Protego withRequests:
>>>from protegoimport Protego>>>import requests>>> r= requests.get("https://google.com/robots.txt")>>> rp= Protego.parse(r.text)>>> rp.can_fetch("https://google.com/search","mybot")False>>> rp.can_fetch("https://google.com/search/about","mybot")True>>>list(rp.sitemaps)['https://www.google.com/sitemap.xml']
The following table compares Protego to the most popularrobots.txt
parsersimplemented in Python or featuring Python bindings:
Protego | RobotFileParser | Reppy | Robotexclusionrulesparser | |
---|---|---|---|---|
Implementation language | Python | Python | C++ | Python |
Reference specification | Martijn Koster’s 1996 draft | |||
Wildcard support | ✓ | ✓ | ✓ | |
Length-based precedence | ✓ | ✓ | ||
Performance | +40% | +1300% | -25% |
Classprotego.Protego
:
sitemaps
{list_iterator
} A list of sitemaps specified inrobots.txt
.preferred_host
{string} Preferred host specified inrobots.txt
.
parse(robotstxt_body)
Parserobots.txt
and return a new instance ofprotego.Protego
.can_fetch(url, user_agent)
Return True if the user agent can fetch theURL, otherwise returnFalse
.crawl_delay(user_agent)
Return the crawl delay specified for the useragent as a float. If nothing is specified, returnNone
.request_rate(user_agent)
Return the request rate specified for the useragent as a named tupleRequestRate(requests, seconds, start_time,end_time)
. If nothing is specified, returnNone
.visit_time(user_agent)
Return the visit time specified for the useragent as a named tupleVisitTime(start_time, end_time)
.If nothing is specified, returnNone
.
About
A pure-Python robots.txt parser with support for modern conventions.
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
No packages published
Uh oh!
There was an error while loading.Please reload this page.