[Tutor] regex: matching unicode

Sat Dec 22 21:53:56 CET 2012

Hi,

Is the code below the only/shortest way to match unicode characters? I would like to match whatever is defined as a character in the unicode reference database. So letters in the broadest sense of the word, but not digits, underscore or whitespace. Until just now, I was convinced that the re.UNICODE flag generalized the [a-z] class to all unicode letters, and that the absence of re.U was an implicit 're.ASCII'. Apparently that mental model was *wrong*.
But [^\W\s\d_]+ is kind of hard to read/write.

import re
s = unichr(956)  # mu sign
m = re.match(ur"[^\W\s\d_]+", s, re.I | re.U)

Regards,
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a 
fresh water system, and public health, what have the Romans ever done for us?

 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~