urllib.robotparser
— Parser for robots.txt¶
Source code:Lib/urllib/robotparser.py
This module provides a single class,RobotFileParser
, which answersquestions about whether or not a particular user agent can fetch a URL on theweb site that published therobots.txt
file. For more details on thestructure ofrobots.txt
files, seehttp://www.robotstxt.org/orig.html.
- classurllib.robotparser.RobotFileParser(url='')¶
This class provides methods to read, parse and answer questions about the
robots.txt
file aturl.- set_url(url)¶
Sets the URL referring to a
robots.txt
file.
- read()¶
Reads the
robots.txt
URL and feeds it to the parser.
- parse(lines)¶
Parses the lines argument.
- can_fetch(useragent,url)¶
Returns
True
if theuseragent is allowed to fetch theurlaccording to the rules contained in the parsedrobots.txt
file.
- mtime()¶
Returns the time the
robots.txt
file was last fetched. This isuseful for long-running web spiders that need to check for newrobots.txt
files periodically.
- modified()¶
Sets the time the
robots.txt
file was last fetched to the currenttime.
- crawl_delay(useragent)¶
Returns the value of the
Crawl-delay
parameter fromrobots.txt
for theuseragent in question. If there is no such parameter or itdoesn’t apply to theuseragent specified or therobots.txt
entryfor this parameter has invalid syntax, returnNone
.Added in version 3.6.
- request_rate(useragent)¶
Returns the contents of the
Request-rate
parameter fromrobots.txt
as anamed tupleRequestRate(requests,seconds)
.If there is no such parameter or it doesn’t apply to theuseragentspecified or therobots.txt
entry for this parameter has invalidsyntax, returnNone
.Added in version 3.6.
The following example demonstrates basic use of theRobotFileParser
class:
>>>importurllib.robotparser>>>rp=urllib.robotparser.RobotFileParser()>>>rp.set_url("http://www.musi-cal.com/robots.txt")>>>rp.read()>>>rrate=rp.request_rate("*")>>>rrate.requests3>>>rrate.seconds20>>>rp.crawl_delay("*")6>>>rp.can_fetch("*","http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")False>>>rp.can_fetch("*","http://www.musi-cal.com/")True