Unicode: matching a word and unaccenting characters

Gabriel Genellina gagsl-py2 at yahoo.com.ar
Wed Nov 14 20:27:25 EST 2007


En Wed, 14 Nov 2007 21:21:55 -0300, Jeremie Le Hen <jeremie at le-hen.org>  
escribió:

> (Please Cc: me when replying, as I'm not subscribed to this list.)

Not a good thing. *I* may CC you now, but any further replies and comments  
 from other people may leave the CC out. You can always browse this  
newsgroup at Google http://groups.google.com/group/comp.lang.python or  
Gmane http://dir.gmane.org/gmane.comp.python.general

> The first one is with regular expression.  If I want to match a word
> composed of characters only.  One can easily use '[a-zA-Z]+' when
> working in ascii, but unfortunately there is no equivalent when working
> with unicode strings: the latter doesn't match accented characters.  The
> only mean the re package provides is '\w' along with the re.UNICODE
> flag, but unfortunately it also matches digits and underscore.  It
> appears there is no suitable solution for this currently.  Am I right?

I think you're right, unfortunately.

> Secondly, I need to translate accented characters to their unaccented
> form.  I've written this function (sorry if the code isn't as efficient
> as possible, I'm not a long-time Python programmer, feel free to correct
> me, I' be glad to learn anything):

It's hard to do it right - this is another version:  
http://www.effbot.org/zone/unicode-convert.htm

-- 
Gabriel Genellina




More information about the Python-list mailing list