xml.dom.minidom: how to preserve CRLF's inside CDATA?

Tue May 22 11:22:47 EDT 2007

On May 22, 8:45 am, "sim.sim" <Maksim.Kasi... at gmail.com> wrote:
> Hi all.
> i'm faced to trouble using minidom:
>
> #i have a string (xml) within CDATA section, and the section includes
> "\r\n":
> iInStr = '<?xml version="1.0"?>\n<Data><![CDATA[BEGIN:VCALENDAR\r
> \nEND:VCALENDAR\r\n]]></Data>\n'
>
> #After i create DOM-object, i get the value of "Data" without "\r\n"
>
> from xml.dom import minidom
> iDoc = minidom.parseString(iInStr)
> iDoc.childNodes[0].childNodes[0].data # it gives u'BEGIN:VCALENDAR
> \nEND:VCALENDAR\n'
>
> according tohttp://www.w3.org/TR/REC-xml/#sec-line-ends
>
> it looks normal, but another part of the documentation says that "only
> the CDEnd string is recognized as markup":http://www.w3.org/TR/REC-xml/#sec-cdata-sect
>
> so parser must (IMHO) give the value of CDATA-section "as is" (neither
> both of parts of the document do not contradicts to each other).
>
> How to get the value of CDATA-section with preserved all symbols
> within? (perhaps use another parser - which one?)
>
> Many thanks for any help.

I'm thinking that the endline character "\n" is relevant for *nix
systems. So if you're running this on Windows, Python will translate
it automatically to "\r\n". According to Lutz's book, Programming
Python 3rd Ed, it's for historical reasons. It says that most text
editors handle text in Unix format, with the exception of Notepad,
which is why some documents are displayed as just one long line in
Notepad. (see pg 150 of said book).

The book goes on to talk about how to use a script that will check
this endline character and fix it depending on the platform you're
running under. The following link seems to do something along those
lines as well.

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/435882

Not exactly helpful, but maybe it'll give you some insight into the
issue.

Mike