Problem [bug?] with minidom

Fri Feb 7 20:25:38 EST 2003

Hello, I was parsing a rather large XML file when I came across text
nodes being unexpectedly sliced in two. 

The problem *only* seems to happen when I load from a file directly
with parse, not if I parse a string or preload the file into a string
manually. Below is some code (and output) which illustrates the
behaviour [this will create a small file called "junk2634.xml" when
run].

I don't know wether this problem is a bug or a legal parse which I
have to look out for. I noticed that the DOM docs with python
described a normalize() member function for Node objects, this seems
to fix the problem. However if this problem is so common it requires a
member function, I'm surprised the lib doesn't call it automatically
following a parse (is there some reason one wouldn't want to call it).

I also find it interesting that it only happens when loading files, I
suspect it may be a buffering problem. However it's too late for
source diving, mabye tomorrow :)

Cheers, Paul.

CODE:

import xml.dom.minidom
import string

FILENAME="junk2634.xml"

print "Build"
xmlList = []
xmlList.append("<?xml version=\"1.0\"?>")
xmlList.append("<root>")
for junk in range(0,260):
  xmlList.append(
    "<data>"
    "aaaaaaaaaa"
    "aaaaaaaaaa"
    "aaaaaaaaaa"
    "aaaaaaaaaa"
    "aaaaaaaaaa"
    "</data>")
xmlList.append("</root>")

print "Cat"
xmlString = string.join(xmlList, "")

print "Save"
f = file(FILENAME, "w")
f.write(xmlString)
f.close()

def run(doc, desc):
  print "Extracting [%s]"%desc
  errorCount = 0
  dataCount = 0

  for element in doc.getElementsByTagName("data"):
    str = element.childNodes[0].data
    dataCount += 1
    if len(str) != 50:
      print "Error: %s [%d]"%(str, len(str))
      print "Number of childnodes: %d"%len(element.childNodes)
      print "XML representation: %s"%element.toxml()
      errorCount+=1
  print "Finished [%s] %d errors, %d elements"%(
    desc, errorCount, dataCount)

print "Parsing [string]"
doc = xml.dom.minidom.parseString(xmlString)
run(doc, "string")
doc.unlink()

print "Parsing [file]"
doc =xml.dom.minidom.parse(FILENAME)
run(doc, "file")
doc.unlink()

print "Parsing [preloadedfile]"
f = file(FILENAME, "r")
newstr = f.read()
f.close()
doc = xml.dom.minidom.parseString(newstr)
run(doc, "preloadedfile")
doc.unlink()

OUTPUT:

Build
Cat
Save
Parsing [string]
Extracting [string]
Finished [string] 0 errors, 260 elements
Parsing [file]
Extracting [file]
Error: aaaaaaaaaaa [11]
Number of childnodes: 2
XML representation:
<data>aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa</data>
Finished [file] 1 errors, 260 elements
Parsing [preloadedfile]
Extracting [preloadedfile]
Finished [preloadedfile] 0 errors, 260 elements