[Web-SIG] parsing of urlencoded data and Unicode

Tue Jul 29 22:04:04 CEST 2008

On Jul 29, 2008, at 3:18 PM, Deron Meranda wrote:

> In what way is RFC 2388 wrong or not MIME?
>
> Per RFC 2388 sect. 3:
>  "The media-type multipart/form-data follows the rules of all  
> multipart
>   MIME data streams as outlined in [RFC 2046]."
>
> So it is MIME, right?

No: RFC 2388 says it is MIME, but in real life it is not. RFC 2388 is  
wrong.

>

> Now you can successfully argue that many user agents do not
> follow the RFC carefully enough.  But that's not a problem with
> the RFC itself.

Common practice is by now long established, and cannot simply be  
changed 10 years after the fact to conform to what the standard says  
it should've been. Therefore, it *is* now a problem with the standard:  
the standard is wrong. If you follow it, you're going to create  
totally broken software.

For instance, treating form posts as being 7bit unless they have a  
Content-Transfer-Encoding. The RFC says you should do that. But it's  
an absolutely nonsensical thing to do. Your code would not work with  
any existing web browser if you did. Or, if you're writing a web  
browser: don't even think of using Content-Transfer-Encoding to encode  
your response. Few servers/frameworks would understand your submission  
if you tried.

> But, at this point, can one consider the content of form post to be  
> encoded "text" string?
>
> Or it should be considered encoded "byte" string?

I'd recommend that it should be, certainly at the lower levels. A  
higher level API can look at the hints available to figure out how to  
decode the non-file fields: e.g.: if the magic _charset_ parameter is  
present, use that, otherwise use what the developer tells you they put  
in accept-charset / what encoding they sent the page in.

James