[XML-SIG] Handling of character entity references
Walter Dörwald
walter@livinglogic.de
Tue, 27 May 2003 15:25:23 +0200
Tamito KAJIYAMA wrote:
> [...]
>
> In Python 1.5.2, the method ContentHandler.characters() ends up
> receiving byte strings in both EUC-JP and Latin-1 encodings.
> That's why I had to reinvent the wheel (namely, the "char" tag).
> On the other hand, in Python 2.x, the script will raise an error
> like this:
>
> UnicodeEncodeError: 'ascii' codec can't encode character '\ue9' in position 0: ordinal not in range(128)
>
> AFAIK, the standard "ascii" codec does not have such a nifty
> feature that would automatically translate unknown characters
> into appropriate character references.
The "ascii" codec doesn't, but Python 2.3 (which you seem to be
using) will introduce codec error callbacks (see PEP 293),
so Mikes example:
''.join([c.encode('ascii', 'ignore') or "&#%d;" % ord(c) \
for c in u'\u8314äöüß?abc'])
could be shortened to:
u'\u8314äöüß?abc'.encode('ascii', 'xmlcharrefreplace')
with Python 2.3.
> So, I can't see what you
> meant in the last paragraph quoted above. Could you please
> elaborate your assumption in the last paragraph? (What I'm
> afraid is that I might miss something new and important in
> recent versions of Python and PyXML.)
You might want to read the PEP at:
http://www.python.org/peps/pep-0293.html
and maybe the test script:
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/dist/src/Lib/test/test_codeccallbacks.py
Bye,
Walter Dörwald