Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

A pure-Python robots.txt parser with support for modern conventions.

License

NotificationsYou must be signed in to change notification settings

scrapy/protego

Repository files navigation

Supported Python VersionsCI

Protego is a pure-Pythonrobots.txt parser with support for modernconventions.

Install

To install Protego, simply use pip:

pip install protego

Usage

>>>from protegoimport Protego>>> robotstxt="""... User-agent:*... Disallow:/... Allow:/about... Allow:/account... Disallow:/account/contact$... Disallow:/account/*/profile... Crawl-delay:4... Request-rate:10/1m# 10 requests every 1 minute...... Sitemap: http://example.com/sitemap-index.xml... Host: http://example.co.in...""">>> rp= Protego.parse(robotstxt)>>> rp.can_fetch("http://example.com/profiles","mybot")False>>> rp.can_fetch("http://example.com/about","mybot")True>>> rp.can_fetch("http://example.com/account","mybot")True>>> rp.can_fetch("http://example.com/account/myuser/profile","mybot")False>>> rp.can_fetch("http://example.com/account/contact","mybot")False>>> rp.crawl_delay("mybot")4.0>>> rp.request_rate("mybot")RequestRate(requests=10, seconds=60, start_time=None, end_time=None)>>>list(rp.sitemaps)['http://example.com/sitemap-index.xml']>>> rp.preferred_host'http://example.co.in'

Using Protego withRequests:

>>>from protegoimport Protego>>>import requests>>> r= requests.get("https://google.com/robots.txt")>>> rp= Protego.parse(r.text)>>> rp.can_fetch("https://google.com/search","mybot")False>>> rp.can_fetch("https://google.com/search/about","mybot")True>>>list(rp.sitemaps)['https://www.google.com/sitemap.xml']

Comparison

The following table compares Protego to the most popularrobots.txt parsersimplemented in Python or featuring Python bindings:

 ProtegoRobotFileParserReppyRobotexclusionrulesparser
Implementation languagePythonPythonC++Python
Reference specificationGoogleMartijn Koster’s 1996 draft
Wildcard support 
Length-based precedence  
Performance +40%+1300%-25%

API Reference

Classprotego.Protego:

Properties

  • sitemaps {list_iterator} A list of sitemaps specified inrobots.txt.
  • preferred_host {string} Preferred host specified inrobots.txt.

Methods

  • parse(robotstxt_body) Parserobots.txt and return a new instance ofprotego.Protego.
  • can_fetch(url, user_agent) Return True if the user agent can fetch theURL, otherwise returnFalse.
  • crawl_delay(user_agent) Return the crawl delay specified for the useragent as a float. If nothing is specified, returnNone.
  • request_rate(user_agent) Return the request rate specified for the useragent as a named tupleRequestRate(requests, seconds, start_time,end_time). If nothing is specified, returnNone.
  • visit_time(user_agent) Return the visit time specified for the useragent as a named tupleVisitTime(start_time, end_time).If nothing is specified, returnNone.

[8]ページ先頭

©2009-2025 Movatter.jp