Unicode error handler

Tue Jan 30 11:28:52 EST 2007

Rares Vernica wrote:
> Hi,
> 
> Does anyone know of any Unicode encode/decode error handler that does a 
> better replace job than the default replace error handler?
> 
> For example I have an iso-8859-1 string that has an 'e' with an accent 
> (you know, the French 'e's). When I use s.encode('ascii', 'replace') the 
> 'e' will be replaced with '?'. I would prefer to be replaced with an 'e' 
> even if I know it is not 100% correct.
> 
> If only this letter would be the problem I would do it manually, but 
> there is an entire set of letters that need to be replaced with their 
> closest ascii letter.
> 
> Is there an encode/decode error handler that can replace all the 
> not-ascii letters from iso-8859-1 with their closest ascii letter?

You might try the following:

# -*- coding: iso-8859-1 -*-

import unicodedata, codecs

def transliterate(exc):
	if not isinstance(exc, UnicodeEncodeError):
		raise TypeError("don'ty know how to handle %r" % r)
	return (unicodedata.normalize("NFD", exc.object[exc.start])[:1],
exc.start+1)

codecs.register_error("transliterate", transliterate)

print u"Frédéric Chopin".encode("ascii", "transliterate")

Running this script gives you:
$ python transliterate.py
Frederic Chopin

Hope that helps.

Servus,
   Walter