Unicode: matching a word and unaccenting characters

rndblnch at gmail.com rndblnch at gmail.com
Wed Nov 14 20:33:00 EST 2007


On Nov 15, 1:21 am, Jeremie Le Hen <jere... at le-hen.org> wrote:
> (Mail resent with the proper subject.
>
> Hi list,
>
> (Please Cc: me when replying, as I'm not subscribed to this list.)
Don't know your mail, hope you will come back to look at the list
archive...

> I'm working with Unicode strings to handle accented characters but I'm
> experiencing a few problem.

[skipped first question]

> Secondly, I need to translate accented characters to their unaccented
> form.  I've written this function (sorry if the code isn't as efficient
> as possible, I'm not a long-time Python programmer, feel free to correct
> me, I' be glad to learn anything):
>
> % def unaccent(s):
> %         """
> %         """
> %
> %         if not isinstance(s, types.UnicodeType):
> %                 return s
> %         singleletter_re = re.compile(r'(?:^|\s)([A-Z])(?:$|\s)')
> %         result = ''
> %         for l in s:
> %                 desc = unicodedata.name(l)
> %                 m = singleletter_re.search(desc)
> %                 if m is None:
> %                         result += str(l)
> %                         continue
> %                 result += m.group(1).lower()
> %         return result
> %
>
> But I don't feel confortable with it.  It strongly depend on the UCD
> file format and names that don't contain a single letter cannot
> obvisouly all be converted to ascii.  How would you implement this
> function?
my 2 cents:

<unaccent.py>
# -*- coding: utf-8 -*-
import unicodedata

def unaccent(s):
   u"""
   >>> unaccent(u"Ça crée déjà l'évènement")
   "Ca cree deja l'evenement"
   """

   s = unicodedata.normalize('NFD', unicode(s.encode("utf-8"),
encoding="utf-8"))
   return "".join(b for b in s.encode("utf-8") if ord(b) < 128)

def _test():
   import doctest
   doctest.testmod()

if __name__ == "__main__":
   import sys
   sys.exit(_test())
</unaccent.py>

> Thank you for your help.
you are welcome.

(left to the reader:
- why does it work?
- why does doctest work?)

renaud

> Regards,
> --
> Jeremie Le Hen
> < jlehen at clesys dot fr >
>
> ----- End forwarded message -----
>
> --
> Jeremie Le Hen
> < jlehen at clesys dot fr >




More information about the Python-list mailing list