[issue1712522] urllib.quote throws exception on Unicode URL

Mon Jul 19 15:22:01 CEST 2010

Antoine Pitrou <pitrou at free.fr> added the comment:

> Now one of the major goals of Python 2.6/2.7 is to allow the writing
> of code which ports smoothly to Python 3. Unicode support is a major
> issue here.

I understand the argument. But 2.7 is a bugfix branch and shouldn't
receive new features, even backports. If we wanted 2.x to converge
further into 3.x, we would do a 2.8, which we have decided not to do.

> I don't consider use of Unicode strings in Python 2.7 to be
> "accidental". In my experience with Python 2, pretty much everything
> already works with Unicode strings, and it's best practice to use
> them.

Not true. From the urllib module itself:

$ touch /tmp/hé
$ python -c 'import urllib; urllib.urlretrieve("file:///tmp/hé")'
$ python -c 'import urllib; urllib.urlretrieve(u"file:///tmp/hé")'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib64/python2.6/urllib.py", line 93, in urlretrieve
    return _urlopener.retrieve(url, filename, reporthook, data)
  File "/usr/lib64/python2.6/urllib.py", line 225, in retrieve
    url = unwrap(toBytes(url))
  File "/usr/lib64/python2.6/urllib.py", line 1027, in toBytes
    " contains non-ASCII characters")
UnicodeError: URL u'file:///tmp/h\xc3\xa9' contains non-ASCII characters

> Having functions in Python 2.7 which don't accept Unicode (or worse,
> raise random exceptions) runs against best practices for moving to
> Python 3.

There are lots of them, and urllib.quote() isn't an exception:

'x\x9c\xcbH\x04\x00\x013\x00\xca'
>>> zlib.compress(u"hà")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 1: ordinal not in range(128)

pwd.struct_passwd(pw_name='root', pw_passwd='x', pw_uid=0, pw_gid=0, pw_gecos='root', pw_dir='/root', pw_shell='/bin/bash')
>>> pwd.getpwnam(u"rooté")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 4: ordinal not in range(128)

> In fact, most code written to work with strings naturally works with
> Unicode because unicode strings support the same basic operations.

What should zlib compression of an unicode string result in?

> > The original issue is against robotparser, and clearly states a bug
> > (robotparser doesn't work in some cases).
> 
> I don't know why this keeps coming back to robotparser. The original
> bug was not against robotparser; it is called "quote throws exception
> on Unicode URL" and that is the bug. Robotparser was just one
> demonstrative piece of code which failed because of it.

Well, there are two different concerns:
- robotparser fails on certain Web pages, which is a bug (unless the Web
pages are clearly malformed)
- urllib.quote() should accept any kind of unicode strings, and perform
appropriate encoding, with an ability to override default encoding
parameters: this is a feature request

The OP himself (John Nagle) said:
“The problem is down inside a library module. "robotparser" is calling
"urllib.quote". One of those two library modules needs to be fixed.”

It seems to imply that the primary concern was robotparser not working.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue1712522>
_______________________________________