[XML-SIG] 2 Qs: encoding & entities with xmlproc

Lars Marius Garshol larsga@ifi.uio.no
08 Jun 1999 08:15:07 +0200


* Dan Libby
| 
| 1) The version of xmlproc I have does not appear to support any
| encoding other than "iso-8859-1". 

This is correct. (Well, US-ASCII will also work, as will anything else
that is based on US-ASCII as long as you don't try to use funny
characters in names or name tokens.)

| [...]  Before, when we were using xmllib, it simply called
| handle_xml(), where we were able to look at the encoding value and
| make appropriate decisions at the application level.  Does xmlproc
| have any equivalent functionality, perhaps in a more recent version?

The functionality is there, but not used at the moment. If you look at
the charconv module you'll see that it contains conversion code for
various encodings as well as registry object for converters.

If you want I can easily add the hooks that would let you use this
functionality. The reason I haven't done this so far is that there
seemed to be no demand for this functionality.
 
| 2) Explanation: I need to preserve XML/HTML entities.  For example,
| if the document contains % then I want to print that out
| exactly, not the parsed/converted value.  If I don't do this, then
| any random person can embed html markup, etc, which could break an
| HTML page.

Hmmm. The cleanest solution to this (from an XML/SGML point of view)
is probably to use string.replace to escape all '<'s in character data
when it is passed to you from the parser. That would also let you
retain parser independence and is cleaner in the sense that it becomes
more obvious what you're really doing.

| However, with xmlproc, it doesn't seem to call any callback that I
| can find, it simply looks up the entity in its map and returns it,
| or else spits out an error 3021: Undeclared Entity.  So my question
| is: Is there a suggested workaround?

If you don't like the solution above you may want to subclass
XMLProcessor in xmlproc.py and write your own versions of
parse_charref and parse_ent_ref.

Instead of rewriting parse_ent_ref you could also just declare the
entities you need in the DTD, and break into the entity hashtable and
modify the value of '&lt;'. (I can show you how.)

If you don't like any of these solutions, let me know, and we'll think
of something.

Also: do you need an option to disallow element and attribute
declarations in the internal subset? 

--Lars M.