[issue20559] urllib/http fail to sanitize a non-ascii url
Bill Winslow
report at bugs.python.org
Sat Feb 8 05:34:24 CET 2014
New submission from Bill Winslow:
The following code will produce a UnicodeEncodeError about a character being non-ascii:
from urllib import request, parse, error
url = 'http://en.wikipedia.org/wiki/Antonio Vallejo-Nájera'
req = request.Request(url)
response = request.urlopen(req)
This fails as follows:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.3/urllib/request.py", line 156, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.3/urllib/request.py", line 469, in open
response = self._open(req, data)
File "/usr/lib/python3.3/urllib/request.py", line 487, in _open
'_open', req)
File "/usr/lib/python3.3/urllib/request.py", line 447, in _call_chain
result = func(*args)
File "/usr/lib/python3.3/urllib/request.py", line 1268, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/lib/python3.3/urllib/request.py", line 1248, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
File "/usr/lib/python3.3/http/client.py", line 1067, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python3.3/http/client.py", line 1095, in _send_request
self.putrequest(method, url, **skips)
File "/usr/lib/python3.3/http/client.py", line 959, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\xe1' in position 27: ordinal not in range(128)
I examined the library code in question: line 958 in http/client.py, the line before the one that barfs, contains the following comment:
# Non-ASCII characters should have been eliminated earlier
I added a print statement to the library code:
print(request)
self._output(request.encode('ascii'))
This prints the following:
>>> response = request.urlopen(req)
GET /wiki/Antonio Vallejo-Nájera HTTP/1.1
Traceback (most recent call last): ...
I confirmed that the 27th character as mentioned in the traceback is in fact the á in the last name. Clearly either urllib or http is not properly sanitizing the url -- unfortunately, my knowledge is useless as to determining where the actual error is; hopefully this report contains enough detail to make it easy enough.
----------
components: Library (Lib), Unicode
messages: 210587
nosy: Dubslow, ezio.melotti, haypo
priority: normal
severity: normal
status: open
title: urllib/http fail to sanitize a non-ascii url
type: behavior
versions: Python 3.3
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue20559>
_______________________________________
More information about the Python-bugs-list
mailing list