Python and Cyrillic characters in regular expression

Fri Sep 5 13:43:14 EDT 2008

phasma wrote:

> string = u"Привет"
> (u'\u041f\u0440\u0438\u0432\u0435\u0442',)
> 
> string = u"Hi.Привет"
> (u'Hi',)

the [\w\s] pattern you used matches letters, numbers, underscore, and 
whitespace.  "." doesn't fall into that category, so the "match" method 
stops when it gets to that character.

maybe you could use re.sub or re.findall?

 >>> # replace all non-alphanumerics with the empty string
 >>> re.sub("(?u)\W+", "", string)
u'Hi\u041f\u0440\u0438\u0432\u0435\u0442'

 >>> # find runs of alphanumeric characters
 >>> re.findall("(?u)\w+", string)
[u'Hi', u'\u041f\u0440\u0438\u0432\u0435\u0442']
 >>> "".join(re.findall("(?u)\w+", string))
u'Hi\u041f\u0440\u0438\u0432\u0435\u0442'

(the "sub" example expects you to specify what characters you want to 
skip, while "findall" expects you to specify what you want to keep.)

</F>