[Web-SIG] Python 3: Form data encoding issues in cgi and urllib modules

Wed Apr 15 23:23:59 CEST 2009

2009/4/16 Miles Kaufmann <milesck at umich.edu>:
> On Sat, Apr 11, 2009 at 8:48 PM, Miles Kaufmann wrote:
>> The first issue is that there doesn't seem to be a way to parse
>> x-www-form-urlencoded query strings in a character set other than
>> UTF-8, for example:
>>
>> 'premier=un&deuxi%E8me=deux' # latin-1
>>
>> The urllib.parse.unquote* functions take encoding and errors
>> parameters, but none of the higher-level ones.  The solution to me
>> seems to be that functions that build on top of
>> it--urllib.parse.parse*, cgi.parse*, and the cgi.FieldStorage
>> constructor--should grow encoding and errors parameters that they pass
>> through to the lower-level functions.
>>
>> The second issue is that the FieldStorage classes work with text input
>> streams.  However, with multipart/form-data posts, posted files aren't
>> necessarily in the same encoding as form fields, or may be binary and
>> not text at all.  I would suggest that FieldStorage should be changed
>> to take a binary input stream.
>>
>> [...]
>
> I'm not quite sure how to interpret the lack of response I've gotten
> on this topic.  Is it just that there's little interest in the cgi
> module?  Should I raise this issue on the python-dev list, or just
> open a bug report and start submitting patches?
>
> There's been a lot of discussion recently about bytes vs. str in email
> headers and WSGI environ variables, but I haven't been able to find a
> substantive discussion on this specific topic.  Here are some of the
> related quotes I've come across.
>
> Martin v. Löwis wrote [1]:
>> In a CGI application, you shouldn't be using sys.stdin or print().
>> Instead, you should be using sys.stdin.buffer (or sys.stdin.buffer.raw),
>> and sys.stdout.buffer.raw. A CGI script essentially does binary IO;
>> if you use TextIO, there likely will be bugs (e.g. if you have
>> attachments of type application/octet-stream).
>
> bobince wrote [2]:
>> Evan Fosmark wrote:
>>> bobince wrote:
>>>> So yeah, it's a bug in cgi.py, yet another victim of 2to3 conversion
>>>> that hasn't been fixed properly for the new string model. It should
>>>> be converting the incoming byte stream to characters before
>>>> passing them to urllib.
>>>>
>>>> Did I mention Python 3.0's libraries (especially web-related
>>>> ones) still being rather shonky? :-)
>>>
>>> Yeah. So far I've noticed huge problems with cgi, urllib, and
>>> wsgiref. I hope they get fixed soon. :(
>>
>> Indeed. Momentum in WEB-SIG seems to have ground to a halt; no-one
>> seems to want ownership of the issue. Very disappointing.
>
> There's also this bug report[3], but it doesn't directly propose the
> changes that I have.
>
> So: does anyone agree, or disagree, that cgi.FieldStorage should be
> changed to take byte streams, and many of the cgi and urllib.parse
> functions should become encoding-aware, preferably in time for Python
> 3.1?  The byte-stream change will break compatibility with with Python
> 3.0, but I strongly feel that treating POST data as text is wrong and
> should not continue to be supported.
>
> -Miles Kaufmann
>
> [1]: http://mail.python.org/pipermail/python-dev/2009-April/088727.html
> [2]: http://stackoverflow.com/questions/540342/python-3-0-urllib
> [3]: http://bugs.python.org/issue4953

Have you read:

  http://bugs.python.org/issue3300

This was referenced in a prior post here and is likely relevant. A lot
of the discussion for that was happening on developers list for Python
3.0.

Not sure why someone was taking issue with WEB-SIG list over cgi
FieldStorage issues as I don't recollect us having any substantive
discussion about it and any problems it has.

Graham