Regular Expression for words (with umlauts, without numbers)

Tim Chon devchon at gmail.com
Fri May 13 12:14:47 EDT 2011


Hallo Jens,

In current python re module, you have to do something like:

((?!\d|_\w)+ which uses the negative look ahead to grab all words except
integers and underscore. Of course, if you turn on the unicode flag re.U or
use it inline like, (?u) then this will grab your desired umlauts.

I'd actually recommend, however, that if you have an extra 20 minutes, to
use Regexp 2.7:
http://bugs.python.org/issue2636

Its a much needed improvement over F.Lundh's re implementation (from 1999!)
and its 40% faster. Moreover, you can do exactly what you are requesting
like so,

(?u)[[:alpha:]]+

cheers,
--tim

On Fri, May 13, 2011 at 9:01 AM, Jens Lechtenboerger <
lechten at helios.uni-muenster.de> wrote:

> Dear experts,
>
> I'm looking for a regular expression to recognize natural language
> words with umlauts but without numbers.  While \w with re.U does
> recognize words with umlauts, it also matches numbers, which I do
> not want.
>
> Is there a better way than an exhaustive enumeration such as
> [-a-zàáâãäåæ...]?
>
> I guess there should be a better way as \w appears to know about
> alphabetical characters...
>
> Thanks in advance
> Jens
> --
> http://mail.python.org/mailman/listinfo/python-list
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20110513/0e1f3bd4/attachment-0001.html>


More information about the Python-list mailing list