just a bug (was: xml.dom.minidom: how to preserve CRLF's inside CDATA?)

sim.sim Maksim.Kasimov at gmail.com
Fri May 25 04:35:37 EDT 2007


On 22 май, 16:45, "sim.sim" <Maksim.Kasi... at gmail.com> wrote:
> Hi all.
> i'm faced to trouble using minidom:
>
> #i have a string (xml) within CDATA section, and the section includes
> "\r\n":
> iInStr = '<?xml version="1.0"?>\n<Data><![CDATA[BEGIN:VCALENDAR\r
> \nEND:VCALENDAR\r\n]]></Data>\n'
>
> #After i create DOM-object, i get the value of "Data" without "\r\n"
>
> from xml.dom import minidom
> iDoc = minidom.parseString(iInStr)
> iDoc.childNodes[0].childNodes[0].data # it gives u'BEGIN:VCALENDAR
> \nEND:VCALENDAR\n'
>
> according tohttp://www.w3.org/TR/REC-xml/#sec-line-ends
>
> it looks normal, but another part of the documentation says that "only
> the CDEnd string is recognized as markup":http://www.w3.org/TR/REC-xml/#sec-cdata-sect
>
> so parser must (IMHO) give the value of CDATA-section "as is" (neither
> both of parts of the document do not contradicts to each other).
>
> How to get the value of CDATA-section with preserved all symbols
> within? (perhaps use another parser - which one?)
>
> Many thanks for any help.


Hi all, I have another problem with minidom and now it is really
critical.

Below the code that tryes to parse an well-formed xml, but it fails
with error message:
"not well-formed (invalid token): line 3, column 85"


from xml.dom import minidom

iMessage = "3c3f786d6c2076657273696f6e3d22312e30223f3e0a3c6d657373616\
7653e0a202020203c446174613e3c215b43444154415bd094d0b0d0bdd0bdd18bd0b5\
20d0bfd0bed0bfd183d0bbd18fd180d0bdd18bd18520d0b7d0b0d0bfd180d0bed181d\
0bed0b220d0bcd0bed0b6d0bdd0be20d183d187d0b8d182d18bd0b2d0b0d182d18c20\
d0bfd180d0b820d181d0bed0b1d181d182d0b2d0b5d0bdd0bdd18bd18520d180d0b5d\
0bad0bbd0b0d0bcd0bdd15d5d3e3c2f446174613e0a3c2f6d6573736167653e0a0a".\
decode('hex')

iMsgDom = minidom.parseString(iMessage)


The "problem" within CDATA-section: it consists a part of utf-8
encoded string
wich was splited (widely used for memory limited devices).

When minidom parses the xml-string, it fails becouse it tryes to
convert
into unicode the data within CDATA-section, insted of just to return
the value
of the section "as is". The convertion contradicts the specification
http://www.w3.org/TR/REC-xml/#sec-cdata-sect


So my question still open:

How to get the value of CDATA-section with preserved all symbols
within? (perhaps use another parser - which one?)

Thanks for help.

Maksim




More information about the Python-list mailing list