[issue3300] urllib.quote and unquote - Unicode issues

Sun Aug 10 07:05:09 CEST 2008

Matt Giuca <matt.giuca at gmail.com> added the comment:

I've been thinking more about the errors="strict" default. I think this
was Guido's suggestion. I've decided I'd rather stick with errors="replace".

I changed errors="replace" to errors="strict" in patch 8, but now I'm
worried that will cause problems, specifically for unquote. Once again,
all the code in the stdlib which calls unquote doesn't provide an errors
option, so the default will be the only choice when using these other
services.

I'm concerned that there'll be lots of unhandled exceptions flying
around for URLs which aren't encoded with UTF-8, and a conscientious
programmer will not be able to protect against user errors.

Take the cgi module as an example. Typical usage is to write:
> fields = cgi.FieldStorage()
> foo = fields.getFirst("foo")

If the QUERY_STRING is "foo=w%FCt" (Latin-1), with errors='strict', you
get a UnicodeDecodeError when you call cgi.FieldStorage(). With
errors='replace', the variable foo will be "w�t". I think in general I'd
rather have '�'s in my program (representing invalid user input) than
exceptions, since this is usually a user input error, not a programming
error.

(One problem is that all I can do to handle this is catch a
UnicodeDecodeError on the call to FieldStorage; then I can't access any
of the data).

Now maybe something we can think about is propagating the "encoding" and
"errors" argument through to a few other major functions (such as
cgi.parse_qsl, cgi.FieldStorage and urllib.parse.urlencode), but that
should be separately to this patch.

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue3300>
_______________________________________