Unicode error handler

Robert Kern robert.kern at gmail.com
Fri Jan 26 15:29:23 EST 2007


Rares Vernica wrote:
> Is there an encode/decode error handler that can replace all the 
> not-ascii letters from iso-8859-1 with their closest ascii letter?

No, but IBM's ICU library can transform one script to another in very flexible
and capable ways. One such configuration can do what you ask.

  http://www-306.ibm.com/software/globalization/icu/index.jsp
  http://icu.sourceforge.net/userguide/Transform.html

Unfortunately, I don't think any of the available ICU bindings for Python have
exposed this functionality. If you wanted to contribute such, you might want to
start with PyICU. It seems to be the most actively developed of the bindings.

  http://pyicu.osafoundation.org/

Of course, that's overkill for this problem. Those transformations can handle
such things as this:

  Αλφαβητικός Κατάλογος	Alphabētikós Katálogos

The number of characters in iso-8859-1 that you would want to transliterate is
not all that large. You could spend a little bit of time going through the
character set and making a translation map for str.translate().

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
 that is made terrible by our own mad attempt to interpret it as though it had
 an underlying truth."
  -- Umberto Eco




More information about the Python-list mailing list