just a bug (was: xml.dom.minidom: how to preserve CRLF's inside CDATA?)

Fri May 25 07:03:05 EDT 2007

On 25 май, 12:45, Marc 'BlackJack' Rintsch <bj_... at gmx.net> wrote:
> In <1180082137.329142.45... at p77g2000hsh.googlegroups.com>, sim.sim wrote:
> > Below the code that tryes to parse an well-formed xml, but it fails
> > with error message:
> > "not well-formed (invalid token): line 3, column 85"
>
> How did you verified that it is well formed?  `xmllint` barf on it too.

you can try to write iMessage to file and open it using Mozilla
Firefox (web-browser)

>
> > The "problem" within CDATA-section: it consists a part of utf-8
> > encoded string wich was splited (widely used for memory limited
> > devices).
>
> > When minidom parses the xml-string, it fails becouse it tryes to convert
> > into unicode the data within CDATA-section, insted of just to return the
> > value of the section "as is". The convertion contradicts the
> > specificationhttp://www.w3.org/TR/REC-xml/#sec-cdata-sect
>
> An XML document contains unicode characters, so does the CDTATA section.
> CDATA is not meant to put arbitrary bytes into a document.  It must
> contain valid characters of this typehttp://www.w3.org/TR/REC-xml/#NT-Char(linked from the grammar of CDATA in
> your link above).
>
> Ciao,
>         Marc 'BlackJack' Rintsch


my CDATA-section contains only symbols in the range specified for
Char:
Char ::=   #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]


filter(lambda x: ord(x) not in range(0x20, 0xD7FF), iMessage)