Any reason why cStringIO in 2.5 behaves different from 2.4?

Stefan Behnel stefan.behnel-n05pAM at web.de
Fri Jul 27 02:39:27 EDT 2007


Stefan Scholl wrote:
> Stefan Behnel <stefan.behnel-n05pAM at web.de> wrote:
>> The XML is *not* well-formed if you pass Python unicode instead of a byte
>> encoded string. Read the XML spec.
> 
> Pointers, please.

There you have it:

http://www.w3.org/TR/xml/#charencoding

"""
In the absence of information provided by an external transport protocol (e.g.
HTTP or MIME), it is a *fatal error* for an entity including an encoding
declaration to be presented to the XML processor in an encoding other than
that named in the declaration, or for an entity which begins with neither a
Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8.
"""

"""
Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin with
the Byte Order Mark ...
"""

Python does not use BOMs internally (although that again may be platform
specific). You might argue that there is some kind of "external transportation
protocol" as it is a Python Unicode string (I used that excuse when I
implemented Unicode parsing support in lxml), but Python's Unicode objects are
strictly a character stream, not a byte stream. XML is only defined for
streams of bytes.

Also, there is no requirement for an XML processor to be able to parse
anything but UTF-8 and UTF-16. Especially if the encoding is *undefined* and
*platform-specific*, as that of a Python Unicode string.

Anything else I can help you understanding?

Stefan



More information about the Python-list mailing list