just a bug (done)

Maksim Kasimov maksim.kasimov at gmail.com
Fri May 25 12:03:44 EDT 2007


Carsten Haese:

> If you want to convey an arbitrary sequence of bytes as if they were
> characters, you need to pick a character encoding that can handle an
> arbitrary sequence of bytes. utf-8 can not do that. ISO-8859-1 can, but
> you need to specify the encoding explicitly. Observe what happens if I
> take your example and insert an encoding specification:
> 
>>>> iMessage = '<?xml version="1.0" encoding="ISO-8859-1"?>\n<message>\n
> <Data><![CDATA[\xd0\x94\xd0\xb0\xd0\xbd\xd0\xbd\xd1\x8b\xd0\xb5 \xd0\xbf
> \xd0\xbe\xd0\xbf\xd1\x83\xd0\xbb\xd1\x8f\xd1\x80\xd0\xbd\xd1\x8b\xd1\x85
> \xd0\xb7\xd0\xb0\xd0\xbf\xd1\x80\xd0\xbe\xd1\x81\xd0\xbe\xd0\xb2 \xd0
> \xbc\xd0\xbe\xd0\xb6\xd0\xbd\xd0\xbe \xd1\x83\xd1\x87\xd0\xb8\xd1\x82
> \xd1\x8b\xd0\xb2\xd0\xb0\xd1\x82\xd1\x8c \xd0\xbf\xd1\x80\xd0\xb8 \xd1
> \x81\xd0\xbe\xd0\xb1\xd1\x81\xd1\x82\xd0\xb2\xd0\xb5\xd0\xbd\xd0\xbd\xd1
> \x8b\xd1\x85 \xd1\x80\xd0\xb5\xd0\xba\xd0\xbb\xd0\xb0\xd0\xbc\xd0\xbd
> \xd1]]></Data>\n</message>\n\n'
>>>> minidom.parseString(iMessage)
> <xml.dom.minidom.Document instance at 0xb7c157ac>
> 
> Of course, when you extract your CDATA, it will come out as a unicode
> string which you'll have to encode with ISO-8859-1 to turn it into a
> sequence of bytes. Then you add the sequence of bytes from the next
> message, and in the end that should yield a valid utf-8-encoded string
> once you've collected and assembled all fragments.
> 
> Hope this helps,
> 


Hi Carsten! Thanks for your suggestion - it is possible to fix the problem in that way.


BTW: i've found an "xmlproc" and use to try to parse with commandline tool xpcmd.py
it gives me
"Parse complete, 0 error(s) and 0 warning(s)"

I did not pick a character encoding "ISO-8859-1"

(but using the lib it is another problem: to recode/retest/redoc/re* a lot of things)

the project homepage: http://www.garshol.priv.no/download/software/xmlproc/


and another thing: I've open my xml-message in Mozilla again,
in pop-up menu select "Page info" item, it shows me:
Content-Type: text/xml
Encoding: UTF-8


Many thank for your attention and patience!

-- 
Maksim Kasimov



More information about the Python-list mailing list