[Web-SIG] Python 3: Form data encoding issues in cgi and urllib modules

Wed Apr 15 23:16:08 CEST 2009

On Sat, Apr 11, 2009 at 8:48 PM, Miles Kaufmann wrote:
> The first issue is that there doesn't seem to be a way to parse
> x-www-form-urlencoded query strings in a character set other than
> UTF-8, for example:
>
> 'premier=un&deuxi%E8me=deux' # latin-1
>
> The urllib.parse.unquote* functions take encoding and errors
> parameters, but none of the higher-level ones.  The solution to me
> seems to be that functions that build on top of
> it--urllib.parse.parse*, cgi.parse*, and the cgi.FieldStorage
> constructor--should grow encoding and errors parameters that they pass
> through to the lower-level functions.
>
> The second issue is that the FieldStorage classes work with text input
> streams.  However, with multipart/form-data posts, posted files aren't
> necessarily in the same encoding as form fields, or may be binary and
> not text at all.  I would suggest that FieldStorage should be changed
> to take a binary input stream.
>
> [...]

I'm not quite sure how to interpret the lack of response I've gotten
on this topic.  Is it just that there's little interest in the cgi
module?  Should I raise this issue on the python-dev list, or just
open a bug report and start submitting patches?

There's been a lot of discussion recently about bytes vs. str in email
headers and WSGI environ variables, but I haven't been able to find a
substantive discussion on this specific topic.  Here are some of the
related quotes I've come across.

Martin v. Löwis wrote [1]:
> In a CGI application, you shouldn't be using sys.stdin or print().
> Instead, you should be using sys.stdin.buffer (or sys.stdin.buffer.raw),
> and sys.stdout.buffer.raw. A CGI script essentially does binary IO;
> if you use TextIO, there likely will be bugs (e.g. if you have
> attachments of type application/octet-stream).

bobince wrote [2]:
> Evan Fosmark wrote:
>> bobince wrote:
>>> So yeah, it's a bug in cgi.py, yet another victim of 2to3 conversion
>>> that hasn't been fixed properly for the new string model. It should
>>> be converting the incoming byte stream to characters before
>>> passing them to urllib.
>>>
>>> Did I mention Python 3.0's libraries (especially web-related
>>> ones) still being rather shonky? :-)
>>
>> Yeah. So far I've noticed huge problems with cgi, urllib, and
>> wsgiref. I hope they get fixed soon. :(
>
> Indeed. Momentum in WEB-SIG seems to have ground to a halt; no-one
> seems to want ownership of the issue. Very disappointing.

There's also this bug report[3], but it doesn't directly propose the
changes that I have.

So: does anyone agree, or disagree, that cgi.FieldStorage should be
changed to take byte streams, and many of the cgi and urllib.parse
functions should become encoding-aware, preferably in time for Python
3.1?  The byte-stream change will break compatibility with with Python
3.0, but I strongly feel that treating POST data as text is wrong and
should not continue to be supported.

-Miles Kaufmann

[1]: http://mail.python.org/pipermail/python-dev/2009-April/088727.html
[2]: http://stackoverflow.com/questions/540342/python-3-0-urllib
[3]: http://bugs.python.org/issue4953