urllib.robotparser — Parser for robots.txt

Source code:Lib/urllib/robotparser.py


This module provides a single class,RobotFileParser, which answersquestions about whether or not a particular user agent can fetch a URL on theweb site that published therobots.txt file. For more details on thestructure ofrobots.txt files, seehttp://www.robotstxt.org/orig.html.

classurllib.robotparser.RobotFileParser(url='')

This class provides methods to read, parse and answer questions about therobots.txt file aturl.

set_url(url)

Sets the URL referring to arobots.txt file.

read()

Reads therobots.txt URL and feeds it to the parser.

parse(lines)

Parses the lines argument.

can_fetch(useragent,url)

ReturnsTrue if theuseragent is allowed to fetch theurlaccording to the rules contained in the parsedrobots.txtfile.

mtime()

Returns the time therobots.txt file was last fetched. This isuseful for long-running web spiders that need to check for newrobots.txt files periodically.

modified()

Sets the time therobots.txt file was last fetched to the currenttime.

crawl_delay(useragent)

Returns the value of theCrawl-delay parameter fromrobots.txtfor theuseragent in question. If there is no such parameter or itdoesn’t apply to theuseragent specified or therobots.txt entryfor this parameter has invalid syntax, returnNone.

Added in version 3.6.

request_rate(useragent)

Returns the contents of theRequest-rate parameter fromrobots.txt as anamed tupleRequestRate(requests,seconds).If there is no such parameter or it doesn’t apply to theuseragentspecified or therobots.txt entry for this parameter has invalidsyntax, returnNone.

Added in version 3.6.

site_maps()

Returns the contents of theSitemap parameter fromrobots.txt in the form of alist(). If there is no suchparameter or therobots.txt entry for this parameter hasinvalid syntax, returnNone.

Added in version 3.8.

The following example demonstrates basic use of theRobotFileParserclass:

>>>importurllib.robotparser>>>rp=urllib.robotparser.RobotFileParser()>>>rp.set_url("http://www.musi-cal.com/robots.txt")>>>rp.read()>>>rrate=rp.request_rate("*")>>>rrate.requests3>>>rrate.seconds20>>>rp.crawl_delay("*")6>>>rp.can_fetch("*","http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")False>>>rp.can_fetch("*","http://www.musi-cal.com/")True