just a bug

Carsten Haese carsten at uniqsys.com
Fri May 25 11:07:34 EDT 2007


On Fri, 2007-05-25 at 17:30 +0300, Maksim Kasimov wrote:
> I insist - my message is correct and not contradicts no any point of w3.org xml-specification.

The fact that you believe this so strongly and we disagree just as
strongly indicates a fundamental misunderstanding. Your fundamental
misunderstanding is between bytes and unicode code points.

The contents of an XML document is a sequence of unicode code points,
encoded into a sequence of bytes using some character encoding. The
<?xml...?> header should identify that encoding. In the absence of an
explicit encoding specification, the parser will guess what encoding the
content uses. In your case, the encoding is absent, and the parser
guesses utf-8, but your string is not a legible utf-8 string.

If you want to convey an arbitrary sequence of bytes as if they were
characters, you need to pick a character encoding that can handle an
arbitrary sequence of bytes. utf-8 can not do that. ISO-8859-1 can, but
you need to specify the encoding explicitly. Observe what happens if I
take your example and insert an encoding specification:

>>> iMessage = '<?xml version="1.0" encoding="ISO-8859-1"?>\n<message>\n
<Data><![CDATA[\xd0\x94\xd0\xb0\xd0\xbd\xd0\xbd\xd1\x8b\xd0\xb5 \xd0\xbf
\xd0\xbe\xd0\xbf\xd1\x83\xd0\xbb\xd1\x8f\xd1\x80\xd0\xbd\xd1\x8b\xd1\x85
\xd0\xb7\xd0\xb0\xd0\xbf\xd1\x80\xd0\xbe\xd1\x81\xd0\xbe\xd0\xb2 \xd0
\xbc\xd0\xbe\xd0\xb6\xd0\xbd\xd0\xbe \xd1\x83\xd1\x87\xd0\xb8\xd1\x82
\xd1\x8b\xd0\xb2\xd0\xb0\xd1\x82\xd1\x8c \xd0\xbf\xd1\x80\xd0\xb8 \xd1
\x81\xd0\xbe\xd0\xb1\xd1\x81\xd1\x82\xd0\xb2\xd0\xb5\xd0\xbd\xd0\xbd\xd1
\x8b\xd1\x85 \xd1\x80\xd0\xb5\xd0\xba\xd0\xbb\xd0\xb0\xd0\xbc\xd0\xbd
\xd1]]></Data>\n</message>\n\n'
>>> minidom.parseString(iMessage)
<xml.dom.minidom.Document instance at 0xb7c157ac>

Of course, when you extract your CDATA, it will come out as a unicode
string which you'll have to encode with ISO-8859-1 to turn it into a
sequence of bytes. Then you add the sequence of bytes from the next
message, and in the end that should yield a valid utf-8-encoded string
once you've collected and assembled all fragments.

Hope this helps,

-- 
Carsten Haese
http://informixdb.sourceforge.net





More information about the Python-list mailing list