[Python-Dev] Bytes path support

Isaac Morland ijmorlan at uwaterloo.ca
Mon Aug 25 18:46:46 CEST 2014


On Sat, 23 Aug 2014, Marko Rauhamaa wrote:

> Isaac Morland <ijmorlan at uwaterloo.ca>:
>
>>>  HTTP/1.1 200 OK
>>>  Content-Type: text/html; charset=ISO-8859-1
>>>
>>>  <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
>>>  <html>
>>>  <head>
>>>  <meta http-equiv="Content-Type" content="text/html; charset=utf-16">
>>
>> For HTML it's not quite so bad.  According to the HTML 4 standard:
>> [...]
>>
>> The Content-Type header takes precedence over a <meta> element. I
>> thought I read once that the reason was to allow proxy servers to
>> transcode documents but I don't have a cite for that. Also, the <meta>
>> element "must only be used when the character encoding is organized
>> such that ASCII-valued bytes stand for ASCII characters" so the
>> initial UTF-16 example wouldn't be conformant in HTML.
>
> That's not how I read it:
>
>   The META declaration must only be used when the character encoding is
>   organized such that ASCII characters stand for themselves (at least
>   until the META element is parsed). META declarations should appear as
>   early as possible in the HEAD element.
>
>   <URL: http://www.w3.org/TR/1998/REC-html40-19980424/charset.ht
>   ml#doc-char-set>
>
> IOW, you must obey the HTTP character encoding until you have parsed a
> conflicting META content-type declaration.

>From the same document:

--------------------------------------------------------------------------
To sum up, conforming user agents must observe the following priorities 
when determining a document's character encoding (from highest priority to 
lowest):

     An HTTP "charset" parameter in a "Content-Type" field.
     A META declaration with "http-equiv" set to "Content-Type" and a value 
set for "charset".
     The charset attribute set on an element that designates an external 
resource. 
--------------------------------------------------------------------------

(In the original they are numbered)

This is a priority list - if the Content-Type header gives a charset, it 
takes precedence, and all other sources for the encoding are ignored.  The 
"charset=" on an <img> or similar is only used if it is the only source 
for the encoding.

The "at least until the META element is parsed" bit allows for the use of 
encodings which make use of shifting.  So maybe they start out 
ASCII-compatible, but after a particular shift byte is seen those bytes 
now stand for Japanese Kanji characters until another shift byte is seen. 
This is allowed by the specification, as long as none of the 
non-ASCII-compatible stuff is seen before the META element.

> The author of the standard keeps a straight face and continues:

I like your way of putting this - "straight face" indeed.  The third 
option really is a hack to allow working around nonsensical situations 
(and even the META tag is pretty questionable).  All this complexity 
because people can't be bothered to do things properly.

>   For cases where neither the HTTP protocol nor the META element
>   provides information about the character encoding of a document, HTML
>   also provides the charset attribute on several elements. By combining
>   these mechanisms, an author can greatly improve the chances that,
>   when the user retrieves a resource, the user agent will recognize the
>   character encoding.

Isaac Morland			CSCF Web Guru
DC 2554C, x36650		WWW Software Specialist


More information about the Python-Dev mailing list