Python and Cyrillic characters in regular expression

Fredrik Lundh fredrik at pythonware.com
Fri Sep 5 13:43:14 EDT 2008


phasma wrote:

> string = u"Привет"
> (u'\u041f\u0440\u0438\u0432\u0435\u0442',)
> 
> string = u"Hi.Привет"
> (u'Hi',)

the [\w\s] pattern you used matches letters, numbers, underscore, and 
whitespace.  "." doesn't fall into that category, so the "match" method 
stops when it gets to that character.

maybe you could use re.sub or re.findall?

 >>> # replace all non-alphanumerics with the empty string
 >>> re.sub("(?u)\W+", "", string)
u'Hi\u041f\u0440\u0438\u0432\u0435\u0442'

 >>> # find runs of alphanumeric characters
 >>> re.findall("(?u)\w+", string)
[u'Hi', u'\u041f\u0440\u0438\u0432\u0435\u0442']
 >>> "".join(re.findall("(?u)\w+", string))
u'Hi\u041f\u0440\u0438\u0432\u0435\u0442'

(the "sub" example expects you to specify what characters you want to 
skip, while "findall" expects you to specify what you want to keep.)

</F>




More information about the Python-list mailing list