[XML-SIG] Handling of character entity references

Tamito KAJIYAMA kajiyama@grad.sccs.chukyo-u.ac.jp
Tue, 27 May 2003 17:52:39 +0900


Mike Brown <mike@skew.org> writes:
|
| You said you're using SAX to produce HTML from XML, so I assume the XML 
| parser is calling the event handler methods in your ContentHandler. When 
| ContentHandler.characters() is called by the parser to notify your 
| application about character data, a Unicode string is passed as the 
| content argument (as long as expat is your underlying parser). This is 
| probably not how it worked when your application was originally written, 
| prior to the omnipresence of Unicode in Python.
| 
| Whatever mechanism you are using to produce HTML (I'm not going to guess 
| how you're doing that) will be running the Unicode string through an 
| encoder, perhaps just using the built-in encode() method on the Unicode 
| string object, to produce EUC-JP or ISO-2022-JP byte strings for output.
| 
| Of course this isn't automatic, but my point is that (hopefully) your 
| HTML-producing SAX application will be written (by you) such that it 
| does do the encoding (at the last step before output, preferably), and 
| will be smart enough (because you wrote it that way) to write character 
| references when the codec doesn't handle a particular Unicode character. 

Hmm, I've still missed the point.  Do you mean that there is a
codec with an error handling scheme that translates undefined
characters into appropriate character references?  The SAX
application of mine does not have (of course) such a mechanism
that would "write character references when the codec doesn't
handle a particular Unicode character," since it does not rely
on Unicode support at all.

Let me make the discussion a bit more concrete.  The following
script effectively reproduces the problem that I encountered.
(I rewrote the same conversion logic in SAX2.)

--------------------------------------------------------------
import StringIO, string

from xml.sax import saxutils, sax2exts

class MyHandler(saxutils.DefaultHandler):
    def characters(self, content):
        if string.strip(content):
            print "DEBUG:", repr(content)
            print content

DOC = """\
<?xml version="1.0" encoding="EUC-JP" ?>
<!DOCTYPE doc [
<!ENTITY eacute "&#233;">
]>
<doc>
<p>Isto &eacute; uma caneta.</p>
<p>\244\263\244\354\244\317\245\332\245\363\244\307\244\271\241\243</p>
</doc>
"""

parser = sax2exts.make_parser(["xml.sax.drivers2.drv_xmlproc"])
parser.setContentHandler(MyHandler())
parser.parse(StringIO.StringIO(DOC))
--------------------------------------------------------------

In Python 1.5.2, the method ContentHandler.characters() ends up
receiving byte strings in both EUC-JP and Latin-1 encodings.
That's why I had to reinvent the wheel (namely, the "char" tag).
On the other hand, in Python 2.x, the script will raise an error
like this:

UnicodeEncodeError: 'ascii' codec can't encode character '\ue9' in position 0: ordinal not in range(128)

AFAIK, the standard "ascii" codec does not have such a nifty
feature that would automatically translate unknown characters
into appropriate character references.  So, I can't see what you
meant in the last paragraph quoted above.  Could you please
elaborate your assumption in the last paragraph?  (What I'm
afraid is that I might miss something new and important in
recent versions of Python and PyXML.)

Thanks,

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>