trouble with xml.sax and unknow entities

Alan Kennedy alanmk at hotmail.com
Sun Apr 27 11:57:19 EDT 2003


Antony Lesuisse wrote:
> 
> I'm not on the list, please cc: me the answers.
> 
> I'm having trouble to parse the folowing xml with the default python xml.sax
> api. I'm using python2.2 on debian unstable powerpc (python2.2-xmlbase).
> 
> '<?xml version="1.0"?><html><body>hello   </body></html>'

Your base problem here is that this is not a "well-formed" XML document,
and as such, parsers are *obliged* to report it as an error. The
relevant
section of the XML spec is

http://www.w3.org/TR/REC-xml.html#sec-references

and says:

"In a document without any DTD, a document with only an internal DTD
subset which contains no parameter entity references, or a document
with "standalone='yes'", for an entity reference that does not occur
within the external subset or a parameter entity, the Name given in
the entity reference must match that in an entity declaration that
does not occur within the external subset or a parameter entity,
except that well-formed documents need not declare any of the 
following entities: amp, lt, gt, apos, quot."

The way to solve your problem is to introduce an "internal DTD
subset", like so

"""<?xml version="1.0"?>
<!DOCTYPE[
<!ENTITY nbsp " ">
]>
<html><body>hello   </body></html>
"""

> The parser halt on &nsbsp; because it doesn't know about this entity. The
> problem is cannot find a way to tell him what this entity is.
> 
> (1)
> Is there a way to have a callback the parser arrive on   ? None of the
> folowing handler functions (resolveEntity,notationDecl,unparsedEntityDecl) are
> called.
> 
> I thought resolveEntity had to be called in that situation but i probably
> misunderstand the sax api.

The resolveEntity function is intended only for use with "external
entities", i.e. entities which reside entirely in a separate container.
The resolveEntity function is basically there so that you can control
the resolution of the addresses of XML documents and document fragments,
in relation to the address of the document you're dealing with.

> (2)
> Is there a way to register entities before the parsing begin ?
> Something like:
>     parser.registerEntity(' ','blahblah')

Here is the essential difficulty. If anyone else were to parse your
XML document, they would also have to configure their parser to 
"register" the same entity. The correct way to register the entity 
is in the document itself.

> (3)
> Or is there a way to register an external DTD where those entities can be
> defined ?  Something like:
>     parser.registerExternalDTD('xhtml.dtd')

Again, such a document whose dtd was *only* registered in this way,
i.e. it did not contain a doctype declaration, would not be a "valid"
XML document, i.e. its structure could not be checked against the
structure rules declared in the actual DTD: it could only be
"well-formed".

Hmmm. Or maybe not. I just tried to find the relevant section of the
XML spec that states the above, but I couldn't find it. All I could
find was

http://www.w3.org/TR/REC-xml.html#sec-prolog-dtd

"Definition: An XML document is valid if it has an associated document
type declaration and if the document complies with the constraints
expressed in it."

Which, to me, doesn't explicitly state that the doctype has to be
declared inside the document text itself.

Here's my version of your code.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
import StringIO,sys,xml.sax,xml.sax.handler

class CHandler(xml.sax.handler.ContentHandler):
    def startElement(self, name, attrs):
        print name
    def characters(self, ch):
        print ch.encode('Latin-1')

class EResolver(xml.sax.handler.EntityResolver):
    def resolveEntity(self,publicId,systemId):
        print " resolveEntity  ",publicId,systemId
        sys.exit()
class DHandler(xml.sax.handler.DTDHandler):
    def notationDecl(name, publicId, systemId):
        print " notationDecl ",publicId,systemId
        sys.exit()
    def unparsedEntityDecl(name, publicId, systemId, ndata):
        print " unparsedEntityDecl ",publicId,systemId,ndata
        sys.exit()

xmlstr = """<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html [
<!ENTITY nbsp " ">
]>
<html><body>hello   </body></html>
"""
parser = xml.sax.make_parser()
parser.setContentHandler(CHandler())
parser.setEntityResolver(EResolver())
parser.setDTDHandler(DHandler())
parser.parse(StringIO.StringIO(xmlstr))
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

regards,

-- 
alan kennedy
-----------------------------------------------------
check http headers here: http://xhaus.com/headers
email alan:              http://xhaus.com/mailto/alan




More information about the Python-list mailing list