Python 3 - xml - crlf handling problem

durumdara durumdara at gmail.com
Fri Dec 2 03:13:43 EST 2011


Dear Stefan!


So: may I don't understand the things well, but I thought that parser
drop the "nondata" CRLF-s + other characters (not preserve them).

Then don't matters that I read the XML from a file, or I create it
from code, because all of them generating SAME RESULT.
But Python don't do that.
If I make xml from code, the code is without plus characters.
But Python preserves parsed CRLF characters somewhere, and they are
also flushing into the result.

Example:

original='''
<?xml version="1.0" encoding="utf-8"?>
<doc a="1">
    <element a="1">
        AnyText
    </element>
</doc>
'''

If I parse this, and write with toxml, the CRLF-s remaining in the
code, but if I create this document line by line, there is no CRLF,
the toxml write "only lined" xml.

This also meaning that if I use prettyxml call, to prettying the xml,
the file size is growing.

If there is a multiple processing queue - if two pythons communicating
in xml files, the size can growing every time.

Py1 - read the Py2's file, process it, and write to a result file
Py2 - read the Py1's result file, process it, and pass back to Py1
this can grow the file with each call, because "pretty" CRLF-s not
normalized out from the code.

original='''
<?xml version="1.0" encoding="utf-8"?>
<doc a="1">
    <element a="1">
        AnyText
    </element>
</doc>
'''

def main():
    f = open('test.0.xml','w')
    f.write(original.strip())
    f.close()

    for i in range(1, 10 + 1):
        xo = parse('test.%d.xml' % (i - 1))
        de = xo.documentElement
        de.setAttribute('c', str(i))
        t = de.getElementsByTagName('element')[0]
        tn = t.childNodes[0]
        print (dir(t))
        print (tn)
        print (tn.nodeValue)
        tn.nodeValue = str(i) + '\t' + '\n'
        #s = xo.toxml()
        s = xo.toprettyxml()
        f = open('test.%d.xml' % i,'w')
        f.write(s)
        f.close()

    sys.exit()

And: because Python is not converting CRLF to &013; I cannot make
different from "prettied source's CRLF" (loaded from template file),
"my own pretty's CRLF" (my own topretty), and really contained CRLF
(for example a memo field's value).

My case is that the processor application (for whom I pass the XML
from Python) is sensitive to "plus CRLF"-s in text nodes, I must do
something these "plus" items to avoid external's program errors.

I got these templates and input files from prettied format (with
CRLFS), but I must "eat" them to make an XML that one lined if
possible.

I hope you understand my problem with it.

Thanks:
   dd



More information about the Python-list mailing list