Unicode: matching a word and unaccenting characters

Thu Nov 15 18:08:40 EST 2007

On Nov 15, 12:21 am, Jeremie Le Hen <jere... at le-hen.org> wrote:
> (Mail resent with the proper subject.
>
> Hi list,
>
> (Please Cc: me when replying, as I'm not subscribed to this list.)
>
> I'm working with Unicode strings to handle accented characters but I'm
> experiencing a few problem.
>
> The first one is with regular expression.  If I want to match a word
> composed of characters only.  One can easily use '[a-zA-Z]+' when
> working in ascii, but unfortunately there is no equivalent when working
> with unicode strings: the latter doesn't match accented characters.  The
> only mean the re package provides is '\w' along with the re.UNICODE
> flag, but unfortunately it also matches digits and underscore.  It
> appears there is no suitable solution for this currently.  Am I right?
>
[snip]
You can match a single character with '\w' and then ensure that it
isn't a digit or underscore with a negative lookbehind '(?<![\d_])',
so to match only words consisting of characters (in the sense you
mean), use '\w(?<![\d_]))+'.