[XML-SIG] Exceptions on undefined character entities

Frank McIngvale frankm@HiWAAY.net
Fri, 1 Feb 2002 08:37:12 -0600 (CST)


Hi, I stumbled across this while fetching my usual
rdf/rss files yesterday, and am hoping someone can
explain what is happening:

newsforge.com gave me a file containing this line:
   <title>University of Osnabr&uuml;ck, Germany</title>

minidom doesn't like it:

Python 2.1.1 (#1, Jan 21 2002, 22:52:28)
[GCC 2.95.3 20010315 (release)] on linux2
Type "copyright", "credits" or "license" for more information.
>>> from xml.dom import minidom
>>> s = "<title>University of Osnabr&uuml;ck, Germany</title>"
>>> minidom.parseString(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/lib/python2.1/xml/dom/minidom.py", line 915, in parseString
    return _doparse(pulldom.parseString, args, kwargs)
  File "/usr/lib/python2.1/xml/dom/minidom.py", line 902, in _doparse
    toktype, rootNode = events.getEvent()
  File "/usr/lib/python2.1/xml/dom/pulldom.py", line 234, in getEvent
    self.parser.feed(buf)
  File "/usr/lib/python2.1/xml/sax/expatreader.py", line 92, in feed
    self._err_handler.fatalError(exc)
  File "/usr/lib/python2.1/xml/sax/handler.py", line 38, in fatalError
    raise exception
xml.sax._exceptions.SAXParseException: <unknown>:1:27: undefined entity
>>>

Dr. David Mertz pointed out that this works:

>>> s = "<!DOCTYPE title [<!ENTITY uuml '[fakechar]'>]><title>University
of Osnabr&uuml;ck, Germany</title>"
>>> minidom.parseString(s)
<xml.dom.minidom.Document instance at 0x81571c4>
>>>

So my question is, what is the correct way to handle this? Is
minidom supposed to handle it, is the caller supposed to provide
the entities, or is it a bug in the XML file?

thanks!
frank  (please cc: me on replies, I'm not subscribed yet)