[XML-SIG] Using character entities in external DTD without validating.

"Walter Dörwald" walter@livinglogic.de
Tue, 10 Apr 2001 11:16:12 +0200


On 09.04.01 at 23:55 Martin v. Loewis wrote:

> > Can anyone suggest a way that I can keep the character entity
> definitions in
> > an external file, AND read the documents without validating them?
> > 
> > I considered converting all of the documents to ISO-8859-1 encoding,=
 but
> > doesn't solve the problem of the Greek letters in paper abstracts. I
> really
> > don't want to have to define those character entities in the internal
> subset
> > of all these documents.
> 
> Did you consider using character references, instead of external
> entities?
> 
> If that is also not feasible, I believe that none of the existing
> parsers will exactly fit your need. You cannot talk pyexpat into
> reading the external subset. With some efforts, you might manage to
> talk xmlproc (the validating parser) into not producing validation
> errors.
> 
> The most promising approach might be to use sgmlop. There is currently
> no SAX2 sgmlop driver, 

I do have a rough, untested SAX2 driver for sgmlop, which could
be used as the base for a real SAX2 driver. If there is interest
I can post it.

> but there is a SAX1 one; this does not support
> entity references, though.
> 
> So here is a rough outline of what might succeed:
> - extend drv_sgmlop.py to also support entity references. To do that,
>   you best inherit from xml.sax.drivers.drv_sgmlop.Parser and add a
>   handle_entityref method. In your code, this method should magically
>   know your DTD; off-hand, I don't see a way to have sgmlop actually
>   parse the external subset as well. Whenever you see an entity
>   reference, invoke
> 
>     self.doc_handler.characters(<replacement>,0,len(<replacement>))
> 
> - Create an instance of your SAX driver.
> 
> - Pass that to Sax.From*, as the parser= parameter.

Alternatively, you could try using XIST 
(ftp://titan.bnbt.de/pub/livinglogic/xist/),
which is based on sgmlop and does exactly what Martin suggested. It
"automagically" knows the character entities, so you can type 
	&Alpha;
to get the character:
	greek capital letter alpha, U+0391

And if you need a new entity you can simple add one Python class to
define it:

class Spam(xsc.Entity):
	"the spam character, U+4242"
	codepoint = 0x4242

HTH

Bye,
   Walter Dörwald

-- 
Walter Dörwald · LivingLogic AG · Bayreuth, Germany · www.livinglogic.de