nonstandard XML character entities?

"Martin v. Löwis" martin at v.loewis.de
Sat Apr 14 03:10:44 EDT 2007


> I'm new to xml mongering so forgive me if there's an obvious
> well-known answer to this.  It's not real obvious from the library
> documentation I've looked at so far.  Basically I have to munch of a
> bunch of xml files which contain character entities like ú 
> which are apparently nonstandard.

If they contain such things, and do not contain a document type
definition, they are not well-formed XML files (i.e. can't be
called "XML" in a meaningful sense).

It would have been helpful if you had given an example of such
a document.

> Basically I want to know if there's a way to supply the regular parser
> (preferably xml.etree but I guess I can switch to another one if
> necessary) with some kind of entity table, and/or if the info is
> supposed to be found in the DTD or someplace like that.  Right now I'm
> ignoring the DTD and simply figuring out the doc structure by
> eyeballing the xml files, maybe not a perfectly approved method but
> it seems to be what most people do.

If there is a document type declaration in the document, the best
way is to parse it in a mode where the parser downloads the DTD
when parsing it, and resolves the entity references itself.

In SAX, you can put an EntityResolver into the parser, and then
return a file-like object from resolveEntity. This can be used
to avoid the network download; the document type declaration
would still have to be present.

Alternatively, you can implement a skippedEntity callback in
the SAX content handler.

In ElementTree, the XMLTreeBuilder has an attribute entity
which is a dictionary used to map entity names in entity references
to their definitions. Whether you can make the parser download
the DTD itself, I don't know.

Regards,
Martin




More information about the Python-list mailing list