XML (XHTML) character entities and PxXml

Martin v. Loewis martin at v.loewis.de
Tue May 7 14:15:41 EDT 2002


andrew at acooke.org (andrew cooke) writes:

> I'm processing XML (XHTML) in Python and have hit a major problem -
> character entities appear to be silently dropped.  

That is very possible. Entity references (*) can only be processed
appropriately if the DTD is somehow available, and is being
processed. The standard parser (pyexpat) is non-validating, and won't
read the DTD.

You have the following options:
- don't use entity references; use character references (&#num;)
  instead, or use utf-8
- include the entity definitions in the document itself, i.e. as
  internal entities.
- use a validating parser, such as xmlproc
- implement your own entity resolver, and try to integrate it into
  the parsing process. This can be done in several ways; one is:
  * implement an EntityResolver. Construct a SAX parser that uses
    this entity resolver. Use the SAX parser to build the DOM tree.

HTH,
Martin

(*) There are no "character entities" in XML, only "(parsed)
entities", "entity references", and "character references".



More information about the Python-list mailing list