[Python-checkins] cpython (3.3): #17403: urllib.parse.robotparser normalizes the urls before adding to ruleline.

senthil.kumaran python-checkins at python.org
Wed May 29 14:59:12 CEST 2013


http://hg.python.org/cpython/rev/30128355f53b
changeset:   83971:30128355f53b
branch:      3.3
parent:      83969:968f6094788b
user:        Senthil Kumaran <senthil at uthcode.com>
date:        Wed May 29 05:54:31 2013 -0700
summary:
  #17403: urllib.parse.robotparser normalizes the urls before adding to ruleline.
This helps in handling certain types invalid urls in a conservative manner.

files:
  Lib/test/test_robotparser.py |  12 ++++++++++++
  Lib/urllib/robotparser.py    |   1 +
  Misc/NEWS                    |   4 ++++
  3 files changed, 17 insertions(+), 0 deletions(-)


diff --git a/Lib/test/test_robotparser.py b/Lib/test/test_robotparser.py
--- a/Lib/test/test_robotparser.py
+++ b/Lib/test/test_robotparser.py
@@ -234,6 +234,18 @@
 
 RobotTest(15, doc, good, bad)
 
+# 16. Empty query (issue #17403). Normalizing the url first.
+doc = """
+User-agent: *
+Allow: /some/path?
+Disallow: /another/path?
+"""
+
+good = ['/some/path?']
+bad = ['/another/path?']
+
+RobotTest(16, doc, good, bad)
+
 
 class NetworkTestCase(unittest.TestCase):
 
diff --git a/Lib/urllib/robotparser.py b/Lib/urllib/robotparser.py
--- a/Lib/urllib/robotparser.py
+++ b/Lib/urllib/robotparser.py
@@ -157,6 +157,7 @@
         if path == '' and not allowance:
             # an empty value means allow all
             allowance = True
+        path = urllib.parse.urlunparse(urllib.parse.urlparse(path))
         self.path = urllib.parse.quote(path)
         self.allowance = allowance
 
diff --git a/Misc/NEWS b/Misc/NEWS
--- a/Misc/NEWS
+++ b/Misc/NEWS
@@ -24,6 +24,10 @@
 Library
 -------
 
+- Issue #17403: urllib.parse.robotparser normalizes the urls before adding to
+  ruleline. This helps in handling certain types invalid urls in a conservative
+  manner.
+
 - Issue #18025: Fixed a segfault in io.BufferedIOBase.readinto() when raw
   stream's read() returns more bytes than requested.
 

-- 
Repository URL: http://hg.python.org/cpython


More information about the Python-checkins mailing list