Problem [bug?] with minidom
Paul Nilsson
python.newsgroup at REMOVE.paulnilsson.com
Fri Feb 7 20:25:38 EST 2003
Hello, I was parsing a rather large XML file when I came across text
nodes being unexpectedly sliced in two.
The problem *only* seems to happen when I load from a file directly
with parse, not if I parse a string or preload the file into a string
manually. Below is some code (and output) which illustrates the
behaviour [this will create a small file called "junk2634.xml" when
run].
I don't know wether this problem is a bug or a legal parse which I
have to look out for. I noticed that the DOM docs with python
described a normalize() member function for Node objects, this seems
to fix the problem. However if this problem is so common it requires a
member function, I'm surprised the lib doesn't call it automatically
following a parse (is there some reason one wouldn't want to call it).
I also find it interesting that it only happens when loading files, I
suspect it may be a buffering problem. However it's too late for
source diving, mabye tomorrow :)
Cheers, Paul.
CODE:
import xml.dom.minidom
import string
FILENAME="junk2634.xml"
print "Build"
xmlList = []
xmlList.append("<?xml version=\"1.0\"?>")
xmlList.append("<root>")
for junk in range(0,260):
xmlList.append(
"<data>"
"aaaaaaaaaa"
"aaaaaaaaaa"
"aaaaaaaaaa"
"aaaaaaaaaa"
"aaaaaaaaaa"
"</data>")
xmlList.append("</root>")
print "Cat"
xmlString = string.join(xmlList, "")
print "Save"
f = file(FILENAME, "w")
f.write(xmlString)
f.close()
def run(doc, desc):
print "Extracting [%s]"%desc
errorCount = 0
dataCount = 0
for element in doc.getElementsByTagName("data"):
str = element.childNodes[0].data
dataCount += 1
if len(str) != 50:
print "Error: %s [%d]"%(str, len(str))
print "Number of childnodes: %d"%len(element.childNodes)
print "XML representation: %s"%element.toxml()
errorCount+=1
print "Finished [%s] %d errors, %d elements"%(
desc, errorCount, dataCount)
print "Parsing [string]"
doc = xml.dom.minidom.parseString(xmlString)
run(doc, "string")
doc.unlink()
print "Parsing [file]"
doc =xml.dom.minidom.parse(FILENAME)
run(doc, "file")
doc.unlink()
print "Parsing [preloadedfile]"
f = file(FILENAME, "r")
newstr = f.read()
f.close()
doc = xml.dom.minidom.parseString(newstr)
run(doc, "preloadedfile")
doc.unlink()
OUTPUT:
Build
Cat
Save
Parsing [string]
Extracting [string]
Finished [string] 0 errors, 260 elements
Parsing [file]
Extracting [file]
Error: aaaaaaaaaaa [11]
Number of childnodes: 2
XML representation:
<data>aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa</data>
Finished [file] 1 errors, 260 elements
Parsing [preloadedfile]
Extracting [preloadedfile]
Finished [preloadedfile] 0 errors, 260 elements
More information about the Python-list
mailing list