Python and Cyrillic characters in regular expression

MRAB google at mrabarnett.plus.com
Fri Sep 5 10:28:12 EDT 2008


On Sep 5, 12:28 pm, phasma <xpa... at gmail.com> wrote:
> string = u"ðÒÉ×ÅÔ"

All the characters are letters.

> (u'\u041f\u0440\u0438\u0432\u0435\u0442',)
>
> string = u"Hi.ðÒÉ×ÅÔ"

The third character isn't a letter and isn't whitespace.

> (u'Hi',)
>

> On Sep 4, 9:53špm, Fredrik Lundh <fred... at pythonware.com> wrote:
>
> > phasma wrote:
> > > Hi, I'm trying extract all alphabetic characters from string.
>
> > > reg = re.compile('(?u)([\w\s]+)', re.UNICODE)
> > > buf = re.match(string)
>
> > > But it's doesn't work. If string starts from Cyrillic character, all
> > > works fine. But if string starts from Latin character, match returns
> > > only Latin characters.
>
> > can you provide a few sample strings that show this behaviour?
>



More information about the Python-list mailing list