[XML-SIG] Using character entities in external DTD without validating.

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Mon, 9 Apr 2001 23:55:00 +0200


> Can anyone suggest a way that I can keep the character entity definitions in
> an external file, AND read the documents without validating them?
> 
> I considered converting all of the documents to ISO-8859-1 encoding, but
> doesn't solve the problem of the Greek letters in paper abstracts. I really
> don't want to have to define those character entities in the internal subset
> of all these documents.

Did you consider using character references, instead of external
entities?

If that is also not feasible, I believe that none of the existing
parsers will exactly fit your need. You cannot talk pyexpat into
reading the external subset. With some efforts, you might manage to
talk xmlproc (the validating parser) into not producing validation
errors.

The most promising approach might be to use sgmlop. There is currently
no SAX2 sgmlop driver, but there is a SAX1 one; this does not support
entity references, though.

So here is a rough outline of what might succeed:
- extend drv_sgmlop.py to also support entity references. To do that,
  you best inherit from xml.sax.drivers.drv_sgmlop.Parser and add a
  handle_entityref method. In your code, this method should magically
  know your DTD; off-hand, I don't see a way to have sgmlop actually
  parse the external subset as well. Whenever you see an entity
  reference, invoke

    self.doc_handler.characters(<replacement>,0,len(<replacement>))

- Create an instance of your SAX driver.

- Pass that to Sax.From*, as the parser= parameter.

Hope this helps,
Martin