[XML-SIG] Handling of character entity references

Mike Brown mike@skew.org
Tue, 27 May 2003 03:40:00 -0600


Tamito KAJIYAMA wrote:

>Mike Brown <mike@skew.org> writes:
>|
>| You said you're using SAX to produce HTML from XML, so I assume the XML 
>| parser is calling the event handler methods in your ContentHandler. When 
>| ContentHandler.characters() is called by the parser to notify your 
>| application about character data, a Unicode string is passed as the 
>| content argument (as long as expat is your underlying parser). This is 
>| probably not how it worked when your application was originally written, 
>| prior to the omnipresence of Unicode in Python.
>| 
>| Whatever mechanism you are using to produce HTML (I'm not going to guess 
>| how you're doing that) will be running the Unicode string through an 
>| encoder, perhaps just using the built-in encode() method on the Unicode 
>| string object, to produce EUC-JP or ISO-2022-JP byte strings for output.
>| 
>| Of course this isn't automatic, but my point is that (hopefully) your 
>| HTML-producing SAX application will be written (by you) such that it 
>| does do the encoding (at the last step before output, preferably), and 
>| will be smart enough (because you wrote it that way) to write character 
>| references when the codec doesn't handle a particular Unicode character. 
>
>Hmm, I've still missed the point.  Do you mean that there is a
>codec with an error handling scheme that translates undefined
>characters into appropriate character references?
>

No.

>  The SAX
>application of mine does not have (of course) such a mechanism
>that would "write character references when the codec doesn't
>handle a particular Unicode character," since it does not rely
>on Unicode support at all.
>  
>
And that's a problem now, as your script below demonstrates

>Let me make the discussion a bit more concrete.  The following
>script effectively reproduces the problem that I encountered.
>(I rewrote the same conversion logic in SAX2.)
>
>--------------------------------------------------------------
>import StringIO, string
>
>from xml.sax import saxutils, sax2exts
>
>class MyHandler(saxutils.DefaultHandler):
>    def characters(self, content):
>        if string.strip(content):
>            print "DEBUG:", repr(content)
>            print content
>
>DOC = """\
><?xml version="1.0" encoding="EUC-JP" ?>
><!DOCTYPE doc [
><!ENTITY eacute "&#233;">
>]>
><doc>
><p>Isto &eacute; uma caneta.</p>
><p>\244\263\244\354\244\317\245\332\245\363\244\307\244\271\241\243</p>
></doc>
>"""
>
>parser = sax2exts.make_parser(["xml.sax.drivers2.drv_xmlproc"])
>parser.setContentHandler(MyHandler())
>parser.parse(StringIO.StringIO(DOC))
>--------------------------------------------------------------
>
>In Python 1.5.2, the method ContentHandler.characters() ends up
>receiving byte strings in both EUC-JP and Latin-1 encodings.
>That's why I had to reinvent the wheel (namely, the "char" tag).
>  
>
Right.

>On the other hand, in Python 2.x, the script will raise an error
>like this:
>
>UnicodeEncodeError: 'ascii' codec can't encode character '\ue9' in position 0: ordinal not in range(128)
>
>AFAIK, the standard "ascii" codec does not have such a nifty
>feature that would automatically translate unknown characters
>into appropriate character references.  So, I can't see what you
>meant in the last paragraph quoted above.  Could you please
>elaborate your assumption in the last paragraph?  (What I'm
>afraid is that I might miss something new and important in
>recent versions of Python and PyXML.)
>  
>
I am trying to say that your application does not have to rely on your 
'char' tag hack under Python 2.x because you are now *able* to write it 
in such a way that it doesn't do something foolish like "print content" 
when content is a Unicode string and sys.stdout is an ASCII console. :)

For example, if you change that print to

print ''.join([c.encode('ascii', 'ignore') or "&#%d;" % ord(c) for c in 
content])

then you will at least be able to see it on your terminal, serialized 
with all non-ASCII characters represented by NCRs.

If you were writing to a file rather than sys.stdout, you would want to 
change the 'ascii' in the line to 'EUC-JP' or whatever.