
This issue trackerhas been migrated toGitHub, and is currentlyread-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.
Created on2015-10-14 01:21 bypwirtz, last changed2022-04-11 14:58 byadmin. This issue is nowclosed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| robotparser_crawl_delay.patch | pwirtz,2015-10-14 01:21 | patch | review | |
| robotparser_crawl_delay_v2.patch | pwirtz,2015-10-14 18:35 | review | ||
| issue25400_v2.diff | berker.peksag,2016-09-18 15:36 | review | ||
| issue25400_v3.diff | berker.peksag,2016-09-18 16:01 | review | ||
| Pull Requests | |||
|---|---|---|---|
| URL | Status | Linked | Edit |
| PR 552 | closed | dstufft,2017-03-31 16:36 | |
| Messages (8) | |||
|---|---|---|---|
| msg252971 -(view) | Author: Peter Wirtz (pwirtz)* | Date: 2015-10-14 01:21 | |
After changesethttp://hg.python.org/lookup/dbed7cacfb7e, calling the crawl_delay method for a robots.txt files that has a crawl-delay for * useragents always returns None.Ex:Python 3.6.0a0 (default:1aae9b6a6929+, Oct 9 2015, 22:08:05)[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwinType "help", "copyright", "credits" or "license" for more information.>>> import urllib.robotparser>>> parser = urllib.robotparser.RobotFileParser()>>> parser.set_url('https://www.carthage.edu/robots.txt')>>> parser.read()>>> parser.crawl_delay('test_robotparser')>>> parser.crawl_delay('*')>>> print(parser.default_entry.delay)120>>>Excerpt fromhttps://www.carthage.edu/robots.txt:User-agent: *Crawl-Delay: 120Disallow: /cgi-binI have written a patch that solves this. With patch, output is:Python 3.6.0a0 (default:1aae9b6a6929+, Oct 9 2015, 22:08:05)[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwinType "help", "copyright", "credits" or "license" for more information.>>> import urllib.robotparser>>> parser = urllib.robotparser.RobotFileParser()>>> parser.set_url('https://www.carthage.edu/robots.txt')>>> parser.read()>>> parser.crawl_delay('test_robotparser')120>>> parser.crawl_delay('*')120>>> print(parser.default_entry.delay)120>>>This also applies to the request_rate method. | |||
| msg252972 -(view) | Author: Peter Wirtz (pwirtz)* | Date: 2015-10-14 01:25 | |
This fix breaks the unit tests though. I am not sure how to go about checking those as this would be my first contribution to python and an open source project in general. | |||
| msg253015 -(view) | Author: Peter Wirtz (pwirtz)* | Date: 2015-10-14 18:16 | |
On further inspection of the tests, it appears that the way in which the tests are written, a test case can only be tested for one useragent at a time. I will attempt to work on the tests so work correctly. Any advice would be much appreciated. | |||
| msg253016 -(view) | Author: Berker Peksag (berker.peksag)*![]() | Date: 2015-10-14 18:22 | |
Thanks for the patch Peter(and welcome to Python and open source development). I have a WIP patch to rewrite test_robotparser in a less magic way. So we can ignore test failures for now. I'll take a closer look to your patch. | |||
| msg253017 -(view) | Author: Peter Wirtz (pwirtz)* | Date: 2015-10-14 18:35 | |
Ok, for the mean time, I reworked the test so it appears to test correctly and tests passes. There does seem to be some magic, so I do hope I did not overlook anything. Here is the new patch. | |||
| msg275776 -(view) | Author: Berker Peksag (berker.peksag)*![]() | Date: 2016-09-11 11:55 | |
I've now updatedLib/test/test_robotparser.py (issue 25497) Peter, do you have time to update your patch? Thanks! | |||
| msg276897 -(view) | Author: Berker Peksag (berker.peksag)*![]() | Date: 2016-09-18 15:36 | |
Here's an updated patch. | |||
| msg276900 -(view) | Author: Roundup Robot (python-dev)![]() | Date: 2016-09-18 17:17 | |
New changesetd5d910cfd288 by Berker Peksag in branch '3.6':Issue#25400: RobotFileParser now correctly returns default values for crawl_delay and request_ratehttps://hg.python.org/cpython/rev/d5d910cfd288New changeset911070065e38 by Berker Peksag in branch 'default':Issue#25400: Merge from 3.6https://hg.python.org/cpython/rev/911070065e38 | |||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022-04-11 14:58:22 | admin | set | github: 69586 |
| 2017-03-31 16:36:30 | dstufft | set | pull_requests: +pull_request1034 |
| 2016-09-18 17:18:17 | berker.peksag | set | status: open -> closed resolution: fixed stage: patch review -> resolved |
| 2016-09-18 17:17:29 | python-dev | set | nosy: +python-dev messages: +msg276900 |
| 2016-09-18 16:01:20 | berker.peksag | set | files: +issue25400_v3.diff |
| 2016-09-18 15:36:23 | berker.peksag | set | files: +issue25400_v2.diff messages: +msg276897 versions: + Python 3.7 |
| 2016-09-11 11:55:50 | berker.peksag | set | messages: +msg275776 |
| 2015-10-14 18:35:14 | pwirtz | set | files: +robotparser_crawl_delay_v2.patch messages: +msg253017 |
| 2015-10-14 18:22:35 | berker.peksag | set | messages: +msg253016 stage: patch review |
| 2015-10-14 18:16:25 | pwirtz | set | messages: +msg253015 |
| 2015-10-14 09:10:17 | berker.peksag | set | nosy: +berker.peksag |
| 2015-10-14 01:25:01 | pwirtz | set | messages: +msg252972 |
| 2015-10-14 01:21:42 | pwirtz | create | |