[Web-SIG] parsing of urlencoded data and Unicode

Manlio Perillo manlio_perillo at libero.it
Tue Jul 29 18:39:10 CEST 2008


Bill Janssen ha scritto:
>>> That's probably wrong.  We went through this recently on the
>>> python-dev list.  While it's possible to tell the encoding of
>>> multipart/form-data, 
>> With multipart/form-data the problem should be the same.
>> The content type is defined only for file fields.
> 
> Actually, it's defined for all fields, isn't it?  From RFC 2388:
> 
> ``As with all multipart MIME types, each part has an optional
> "Content-Type", which defaults to text/plain.''
> 
> So the type is "text/plain" unless it says something else.  And,
> according to RFC 2046, the default charset for "text/plain" is
> "US-ASCII".
> 

Ok with theory.
But in practice:

<form action="" method="post" accept-charset="utf-8"
       enctype="multipart/form-data">


Content-Type: multipart/form-data; boundary=abcde
abcde
Content-Disposition: form-data; name="Title"

hello
abcde
Content-Disposition: form-data; name="body"

à Úìòù
abcde


In theory I should assume ascii encoded data for the body field; and 
since this data can not be decoded, I should assume it as byte string.

However the body field is encoded in utf-8, and if I add an hidden 
_charset_ field, FF and IE add this field in the response, with the 
charset used in the encoding.


I think that it is safe to decode data from the QUERY_STRING and POST 
data to Unicode, and to return Bad Request in case of errors.

If the user have specialized needs, he can use low level parsing functions.

In wsgix the "high" level functions are parse_query_string and 
parse_simple_post_data; the "low" level function is parse_qs.

 > [...]



Thanks   Manlio Perillo


More information about the Web-SIG mailing list