xml.dom.minidom.parse() splitting text nodes?-THANKS

Fri Jan 17 12:29:10 EST 2003

thanks martin and gilles, exactly the problem!  of course, i spent the afternoon yesterday hacking my own normalizer . . . i suspect the existing one is a bit more ~robust~ *grin*

> -----Original Message-----
> From: martin at v.loewis.de [mailto:martin at v.loewis.de]
> Sent: Friday, January 17, 2003 1:09 AM
> To: python-list at python.org
> Subject: Re: xml.dom.minidom.parse() splitting text nodes?
> 
> 
> hawkeye.parker at autodesk.com writes:
> 
> > has anyone else run across this issue?  can you explain it?
> 
> The text nodes are created as the underlying parser (Expat) reports
> chunks of text data.
> 
> Those data are chunked for various reasons:
> - if you have character references or entity references, everything
>   up to the reference will be reported as a chunk, then the referenced
>   data will be reported as a chunk, and everything after it 
> will be reported
>   as a chunk.
> - Expat buffers the input in blocks. Everytime the block is exhausted,
>   its data is reported as a chunk.
> 
> You are likely seeing the second case.
> 
> This is, strictly speaking, no bug: the DOM reader is entitled to
> represent the document in such a way. The minidom implementation in
> PyXML will, however, avoid splitting the text nodes if it can.
> 
> In general, this issue is what lead to the introduction of the
> .normalize method in the DOM; this merges adjacent text nodes
> throughout the tree.
> 
> Regards,
> Martin
> 
> -- 
> http://mail.python.org/mailman/listinfo/python-list
>