nonstandard XML character entities?
"Martin v. Löwis"
martin at v.loewis.de
Sat Apr 14 03:10:44 EDT 2007
> I'm new to xml mongering so forgive me if there's an obvious
> well-known answer to this. It's not real obvious from the library
> documentation I've looked at so far. Basically I have to munch of a
> bunch of xml files which contain character entities like ú
> which are apparently nonstandard.
If they contain such things, and do not contain a document type
definition, they are not well-formed XML files (i.e. can't be
called "XML" in a meaningful sense).
It would have been helpful if you had given an example of such
a document.
> Basically I want to know if there's a way to supply the regular parser
> (preferably xml.etree but I guess I can switch to another one if
> necessary) with some kind of entity table, and/or if the info is
> supposed to be found in the DTD or someplace like that. Right now I'm
> ignoring the DTD and simply figuring out the doc structure by
> eyeballing the xml files, maybe not a perfectly approved method but
> it seems to be what most people do.
If there is a document type declaration in the document, the best
way is to parse it in a mode where the parser downloads the DTD
when parsing it, and resolves the entity references itself.
In SAX, you can put an EntityResolver into the parser, and then
return a file-like object from resolveEntity. This can be used
to avoid the network download; the document type declaration
would still have to be present.
Alternatively, you can implement a skippedEntity callback in
the SAX content handler.
In ElementTree, the XMLTreeBuilder has an attribute entity
which is a dictionary used to map entity names in entity references
to their definitions. Whether you can make the parser download
the DTD itself, I don't know.
Regards,
Martin
More information about the Python-list
mailing list