Regex similar to "^(?u)\w$", but without digits?

Mark Tolonen metolone+gmane at gmail.com
Mon Apr 13 00:21:27 EDT 2009


"Andreas Pfrengle" <a.pfrengle at gmail.com> wrote in message 
news:26d3bec3-8329-4432-a680-05c17f930a6a at 3g2000yqk.googlegroups.com...
> On 12 Apr., 02:31, "Mark Tolonen" <metolone+gm... at gmail.com> wrote:
>> "Andreas" <a.pfren... at gmail.com> wrote in message
>>
>> news:f953c845-3660-4bb5-8ba7-00b93989cd20 at b1g2000vbc.googlegroups.com...
>>
>> > Hello,
>>
>> > I'd like to create a regex that captures any unicode character, but
>> > not the underscore and the digits 0-9. "^(?u)\w$" captures them also.
>> > Is there a possibility to restrict an expression like "\w" to "\w
>> > without [0-9_]"?
>>
>> '(?u)[^\W0-9_]' removes 0-9_ from \w.
>>
>> -Mark
>
> Hello Mark,
>
> haven't tried it yet, but it looks good!
> @John: Sorry for being imprecise, I meant *letters*, not *characters*,
> so requirement 2 fits my needs.

Note that \w matches alphanumeric Unicode characters.  If you only want 
letters, consider superscripts(¹²³), fractions (¼½¾), and other characters 
are also numbers to Unicode.  See the unicodedata.category function and 
http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values.

If you only want letters as considered by the Unicode standard, something 
this would give you only Unicode letters (it could be optimized to list 
ranges of characters):

u'(?u)[' + u''.join(unichr(n) for n in xrange(65536) if 
ud.category(unichr(n))[0]=='L') + u']'

Hmm, maybe Python 3.0 with its default Unicode strings needs a regex 
extension to specify the Unicode category to match.

-Mark





More information about the Python-list mailing list