[issue25400] robotparser doesn't return crawl delay for default entry
Peter Wirtz
report at bugs.python.org
Tue Oct 13 21:21:42 EDT 2015
New submission from Peter Wirtz:
After changeset http://hg.python.org/lookup/dbed7cacfb7e, calling the crawl_delay method for a robots.txt files that has a crawl-delay for * useragents always returns None.
Ex:
Python 3.6.0a0 (default:1aae9b6a6929+, Oct 9 2015, 22:08:05)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.robotparser
>>> parser = urllib.robotparser.RobotFileParser()
>>> parser.set_url('https://www.carthage.edu/robots.txt')
>>> parser.read()
>>> parser.crawl_delay('test_robotparser')
>>> parser.crawl_delay('*')
>>> print(parser.default_entry.delay)
120
>>>
Excerpt from https://www.carthage.edu/robots.txt:
User-agent: *
Crawl-Delay: 120
Disallow: /cgi-bin
I have written a patch that solves this. With patch, output is:
Python 3.6.0a0 (default:1aae9b6a6929+, Oct 9 2015, 22:08:05)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.robotparser
>>> parser = urllib.robotparser.RobotFileParser()
>>> parser.set_url('https://www.carthage.edu/robots.txt')
>>> parser.read()
>>> parser.crawl_delay('test_robotparser')
120
>>> parser.crawl_delay('*')
120
>>> print(parser.default_entry.delay)
120
>>>
This also applies to the request_rate method.
----------
components: Library (Lib)
files: robotparser_crawl_delay.patch
keywords: patch
messages: 252971
nosy: pwirtz
priority: normal
severity: normal
status: open
title: robotparser doesn't return crawl delay for default entry
type: behavior
versions: Python 3.6
Added file: http://bugs.python.org/file40777/robotparser_crawl_delay.patch
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue25400>
_______________________________________
More information about the Python-bugs-list
mailing list