just a bug (was: xml.dom.minidom: how to preserve CRLF's inside CDATA?)

Fri May 25 08:57:51 EDT 2007

On May 25, 12:03 pm, "sim.sim" <Maksim.Kasi... at gmail.com> wrote:
> On 25 ÍÁÊ, 12:45, Marc 'BlackJack' Rintsch <bj_... at gmx.net> wrote:
>
> > In <1180082137.329142.45... at p77g2000hsh.googlegroups.com>, sim.sim wrote:
> > > Below the code that tryes to parse an well-formed xml, but it fails
> > > with error message:
> > > "not well-formed (invalid token): line 3, column 85"
>
> > How did you verified that it is well formed?  `xmllint` barf on it too.
>
> you can try to write iMessage to file and open it using Mozilla
> Firefox (web-browser)
>
>
>
>
>
>
>
> > > The "problem" within CDATA-section: it consists a part of utf-8
> > > encoded string wich was splited (widely used for memory limited
> > > devices).
>
> > > When minidom parses the xml-string, it fails becouse it tryes to convert
> > > into unicode the data within CDATA-section, insted of just to return the
> > > value of the section "as is". The convertion contradicts the
> > > specificationhttp://www.w3.org/TR/REC-xml/#sec-cdata-sect
>
> > An XML document contains unicode characters, so does the CDTATA section.
> > CDATA is not meant to put arbitrary bytes into a document.  It must
> > contain valid characters of this typehttp://www.w3.org/TR/REC-xml/#NT-Char(linkedfrom the grammar of CDATA in
> > your link above).
>
> > Ciao,
> >         Marc 'BlackJack' Rintsch
>
> my CDATA-section contains only symbols in the range specified for
> Char:
> Char ::=   #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
> [#x10000-#x10FFFF]
>
> filter(lambda x: ord(x) not in range(0x20, 0xD7FF), iMessage)- Hide quoted text -
>
> - Show quoted text -

You need to explicitly convert the string of UTF8 encoded bytes to a
Unicode string before parsing e.g.
unicodestring = unicode(encodedbytes, 'utf8')

Unless I messed up copying and pasting, your original string had an
erroneous byte immediately before ]]>. With that corrected I was able
to process the string correctly - the CDATA marked section consits
entirely of spaces and Cyrillic characters. As I noted earlier you
will lose \r characters as part of the basic XML processing.

HTH

Harvey