xml.dom.minidom.parse() splitting text nodes?-THANKS
hawkeye.parker at autodesk.com
hawkeye.parker at autodesk.com
Fri Jan 17 12:29:10 EST 2003
thanks martin and gilles, exactly the problem! of course, i spent the afternoon yesterday hacking my own normalizer . . . i suspect the existing one is a bit more ~robust~ *grin*
> -----Original Message-----
> From: martin at v.loewis.de [mailto:martin at v.loewis.de]
> Sent: Friday, January 17, 2003 1:09 AM
> To: python-list at python.org
> Subject: Re: xml.dom.minidom.parse() splitting text nodes?
>
>
> hawkeye.parker at autodesk.com writes:
>
> > has anyone else run across this issue? can you explain it?
>
> The text nodes are created as the underlying parser (Expat) reports
> chunks of text data.
>
> Those data are chunked for various reasons:
> - if you have character references or entity references, everything
> up to the reference will be reported as a chunk, then the referenced
> data will be reported as a chunk, and everything after it
> will be reported
> as a chunk.
> - Expat buffers the input in blocks. Everytime the block is exhausted,
> its data is reported as a chunk.
>
> You are likely seeing the second case.
>
> This is, strictly speaking, no bug: the DOM reader is entitled to
> represent the document in such a way. The minidom implementation in
> PyXML will, however, avoid splitting the text nodes if it can.
>
> In general, this issue is what lead to the introduction of the
> .normalize method in the DOM; this merges adjacent text nodes
> throughout the tree.
>
> Regards,
> Martin
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
More information about the Python-list
mailing list