PyXML, Sax, error in processing external entity reference
Uche Ogbuji
uche at ogbuji.net
Sat Feb 28 01:59:58 EST 2004
David Dorward <dorward at yahoo.com> wrote in message news:<c1j2q8$6pb$1$8300dec7 at news.demon.co.uk>...
> I'm attempting to read an XHTML 1.1 file[1], perform some DOM manipulation,
> then write the results to a different file.
>
> I've found myself rather stuck at the first hurdle.
>
> I have the following:
>
> from xml.dom.ext.reader import Sax2
> reader = Sax2.Reader()
> f = open('dorward.me.uk/sitemap.html', 'r')
> doc = reader.fromStream(f)
>
> (dorward.me.uk/sitemap.html being a local copy of
> http://dorward.me.uk/sitemap.html)
>
> ... which outputs the following:
>
> Traceback (most recent call last):
> File "x.py", line 4, in ?
> doc = reader.fromStream(f)
> File "/usr/lib/python2.3/site-packages/_xmlplus/dom/ext/reader/Sax2.py",
> line 372, in fromStream
> self.parser.parse(s)
> File "/usr/lib/python2.3/site-packages/_xmlplus/sax/expatreader.py", line
> 109, in parse
> xmlreader.IncrementalParser.parse(self, source)
> File "/usr/lib/python2.3/site-packages/_xmlplus/sax/xmlreader.py", line
> 123, in parse
> self.feed(buffer)
> File "/usr/lib/python2.3/site-packages/_xmlplus/sax/expatreader.py", line
> 220, in feed
> self._err_handler.fatalError(exc)
> File "/usr/lib/python2.3/site-packages/_xmlplus/dom/ext/reader/Sax2.py",
> line 340, in fatalError
> raise exception
> xml.sax._exceptions.SAXParseException:
> http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-notations-1.mod:115:0:
> error in processing external entity reference
>
> I'm not sure where I should proceed from here. Is it a bug in my code? In
> PyXML? In the DTD itself? What should I do next?
The bug is with the W3C. Through a chain of parameter entity refs, it
http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd references
http://www.w3.org/TR/xhtml-modularization/DTD/xhtml11-model-1.mod,
which gives 404 (and yes XML heads, it is in an INCLUDE section so the
URI must be traversed unless there's a resoltion through pubID).
I'm actually rather amazed at such carelessness by the W3C, but I
don't have time to dig further to see if I can figure out how things
got broken.
I can tell you that you can use minidom or OK with this because it
does not even read the external DTD subset:
>>> from xml.dom import minidom
>>> doc = minidom.parse('sitemap.html')
>>> doc
<xml.dom.minidom.Document instance at 0x400635ec>
>>>
Also, 4Suite's cDomlette makes it easy for you to avoid the DTD
problem:
>>> from Ft.Xml.Domlette import NoExtDtdReader
>>> doc = NoExtDtdReader.parseUri("file:sitemap.html")
>>> doc
<cDocument at 0x0x403ab42c>
>>>
http://4suite.org
http://uche.ogbuji.net/tech/akara/nodes/2003-01-01/domlettes
Good luck.
--Uche
http://uche.ogbuji.net
More information about the Python-list
mailing list