[Tutor] regex: matching unicode

Sat Dec 22 23:54:17 CET 2012

On Sat, Dec 22, 2012 at 9:53 PM, Albert-Jan Roskam <fomcl at yahoo.com> wrote:

> Hi,
>
> Is the code below the only/shortest way to match unicode characters? I
> would like to match whatever is defined as a character in the unicode
> reference database. So letters in the broadest sense of the word, but not
> digits, underscore or whitespace. Until just now, I was convinced that the
> re.UNICODE flag generalized the [a-z] class to all unicode letters, and
> that the absence of re.U was an implicit 're.ASCII'. Apparently that mental
> model was *wrong*.
> But [^\W\s\d_]+ is kind of hard to read/write.
>
> import re
> s = unichr(956)  # mu sign
> m = re.match(ur"[^\W\s\d_]+", s, re.I | re.U)
>
>
A thought would be to rely on the general category of the character, as
listed in the Unicode database. Unicodedata.category will give you what you
need. Here is a list of categories in the Unicode standard:

http://www.fileformat.info/info/unicode/category/index.htm

So, if you wanted only letters, you could say:

def is_unicode_character(c):
    assert len(c) == 1
    return 'L' in unicodedata.category(c)

if only the Letter category will get you what you need, this is pretty
simple, but if you also need symbols and marks or something it will start
to get more complicated.

Another thought is to match against two separate regexes, one being \w for
alphanumeric and the other being [^\d] to leave you only with alpha. Not
exactly ideal either.

The last option is to just go with the regex, make sure you write it only
once, and leave a nice comment. That's not too bad.

Hugo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20121222/a391725c/attachment.html>