xml.dom.minidom: how to preserve CRLF's inside CDATA?

harvey.thomas at informa.com harvey.thomas at informa.com
Tue May 22 11:23:56 EDT 2007


On May 22, 2:45 pm, "sim.sim" <Maksim.Kasi... at gmail.com> wrote:
> Hi all.
> i'm faced to trouble using minidom:
>
> #i have a string (xml) within CDATA section, and the section includes
> "\r\n":
> iInStr = '<?xml version="1.0"?>\n<Data><![CDATA[BEGIN:VCALENDAR\r
> \nEND:VCALENDAR\r\n]]></Data>\n'
>
> #After i create DOM-object, i get the value of "Data" without "\r\n"
>
> from xml.dom import minidom
> iDoc = minidom.parseString(iInStr)
> iDoc.childNodes[0].childNodes[0].data # it gives u'BEGIN:VCALENDAR
> \nEND:VCALENDAR\n'
>
> according tohttp://www.w3.org/TR/REC-xml/#sec-line-ends
>
> it looks normal, but another part of the documentation says that "only
> the CDEnd string is recognized as markup":http://www.w3.org/TR/REC-xml/#sec-cdata-sect
>
> so parser must (IMHO) give the value of CDATA-section "as is" (neither
> both of parts of the document do not contradicts to each other).
>
> How to get the value of CDATA-section with preserved all symbols
> within? (perhaps use another parser - which one?)
>
> Many thanks for any help.

You will lose the \r characters. From the document you referred to
"""
This section defines some symbols used widely in the grammar.

S (white space) consists of one or more space (#x20) characters,
carriage returns, line feeds, or tabs.

White Space
[3]    S    ::=    (#x20 | #x9 | #xD | #xA)+

Note:

The presence of #xD in the above production is maintained purely for
backward compatibility with the First Edition. As explained in 2.11
End-of-Line Handling, all #xD characters literally present in an XML
document are either removed or replaced by #xA characters before any
other processing is done. The only way to get a #xD character to match
this production is to use a character reference in an entity value
literal.
"""




More information about the Python-list mailing list