[XML-SIG] Handling of character entity references

Walter Dörwald walter@livinglogic.de
Tue, 27 May 2003 15:25:23 +0200


Tamito KAJIYAMA wrote:

> [...]
> 
> In Python 1.5.2, the method ContentHandler.characters() ends up
> receiving byte strings in both EUC-JP and Latin-1 encodings.
> That's why I had to reinvent the wheel (namely, the "char" tag).
> On the other hand, in Python 2.x, the script will raise an error
> like this:
> 
> UnicodeEncodeError: 'ascii' codec can't encode character '\ue9' in position 0: ordinal not in range(128)
> 
> AFAIK, the standard "ascii" codec does not have such a nifty
> feature that would automatically translate unknown characters
> into appropriate character references.

The "ascii" codec doesn't, but Python 2.3 (which you seem to be
using) will introduce codec error callbacks (see PEP 293),
so Mikes example:

     ''.join([c.encode('ascii', 'ignore') or "&#%d;" % ord(c) \
                 for c in u'\u8314äöüß?abc'])

could be shortened to:

     u'\u8314äöüß?abc'.encode('ascii', 'xmlcharrefreplace')

with Python 2.3.

> So, I can't see what you
> meant in the last paragraph quoted above.  Could you please
> elaborate your assumption in the last paragraph?  (What I'm
> afraid is that I might miss something new and important in
> recent versions of Python and PyXML.)

You might want to read the PEP at:
    http://www.python.org/peps/pep-0293.html
and maybe the test script:
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/dist/src/Lib/test/test_codeccallbacks.py

Bye,
    Walter Dörwald