- Notifications
You must be signed in to change notification settings - Fork28
A pure-Python robots.txt parser with support for modern conventions.
License
NotificationsYou must be signed in to change notification settings
scrapy/protego
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Protego is a pure-Pythonrobots.txt
parser with support for modernconventions.
To install Protego, simply use pip:
pip install protego
>>>from protegoimport Protego>>> robotstxt="""... User-agent:*... Disallow:/... Allow:/about... Allow:/account... Disallow:/account/contact$... Disallow:/account/*/profile... Crawl-delay:4... Request-rate:10/1m# 10 requests every 1 minute...... Sitemap: http://example.com/sitemap-index.xml... Host: http://example.co.in...""">>> rp= Protego.parse(robotstxt)>>> rp.can_fetch("http://example.com/profiles","mybot")False>>> rp.can_fetch("http://example.com/about","mybot")True>>> rp.can_fetch("http://example.com/account","mybot")True>>> rp.can_fetch("http://example.com/account/myuser/profile","mybot")False>>> rp.can_fetch("http://example.com/account/contact","mybot")False>>> rp.crawl_delay("mybot")4.0>>> rp.request_rate("mybot")RequestRate(requests=10, seconds=60, start_time=None, end_time=None)>>>list(rp.sitemaps)['http://example.com/sitemap-index.xml']>>> rp.preferred_host'http://example.co.in'
Using Protego withRequests:
>>>from protegoimport Protego>>>import requests>>> r= requests.get("https://google.com/robots.txt")>>> rp= Protego.parse(r.text)>>> rp.can_fetch("https://google.com/search","mybot")False>>> rp.can_fetch("https://google.com/search/about","mybot")True>>>list(rp.sitemaps)['https://www.google.com/sitemap.xml']
The following table compares Protego to the most popularrobots.txt
parsersimplemented in Python or featuring Python bindings:
Protego | RobotFileParser | Reppy | Robotexclusionrulesparser | |
---|---|---|---|---|
Implementation language | Python | Python | C++ | Python |
Reference specification | Martijn Koster’s 1996 draft | |||
Wildcard support | ✓ | ✓ | ✓ | |
Length-based precedence | ✓ | ✓ | ||
Performance | +40% | +1300% | -25% |
Classprotego.Protego
:
sitemaps
{list_iterator
} A list of sitemaps specified inrobots.txt
.preferred_host
{string} Preferred host specified inrobots.txt
.
parse(robotstxt_body)
Parserobots.txt
and return a new instance ofprotego.Protego
.can_fetch(url, user_agent)
Return True if the user agent can fetch theURL, otherwise returnFalse
.crawl_delay(user_agent)
Return the crawl delay specified for the useragent as a float. If nothing is specified, returnNone
.request_rate(user_agent)
Return the request rate specified for the useragent as a named tupleRequestRate(requests, seconds, start_time,end_time)
. If nothing is specified, returnNone
.visit_time(user_agent)
Return the visit time specified for the useragent as a named tupleVisitTime(start_time, end_time)
.If nothing is specified, returnNone
.
About
A pure-Python robots.txt parser with support for modern conventions.
Topics
Resources
License
Stars
Watchers
Forks
Packages0
No packages published