re module & problem with utf-8

Tue Jan 14 18:38:03 EST 2003

I found, Python 2.2 has big problems with utf-8 characters in re
module. Look, what stupid results it generates.

First, I create example string with some cp1250 characters separated
with a space:

>>> s = 'ą ć ę ł ń ó ś ż ź'
>>> s
'\xb9 \xe6 \xea \xb3 \xf1 \xf3 \x9c \xbf \x9f'

Nex, I convert it to utf-8 string:

>>> u = unicode(s, 'cp1250').encode('utf-8')
>>> u
'\xc4\x85 \xc4\x87 \xc4\x99 \xc5\x82 \xc5\x84 \xc3\xb3 \xc5\x9b
\xc5\xbc \xc5\xba'

Now, I try to extract my separated characters:

>>> re.compile(r'(\w)', re.U).findall(u)
['\xc4', '\xc4', '\xc4', '\xc5', '\xc5', '\xc3', '\xb3', '\xc5',
'\xc5', '\xbc', '\xc5', '\xba']

Failed! Python did not understand, that it is utf-8 string although I
set up re.U flag. :(

I tried another regex pattern:

>>> re.compile(r'(\S)', re.U).findall(u)
['\xc4', '\xc4', '\x87', '\xc4', '\x99', '\xc5', '\x82', '\xc5',
'\x84', '\xc3', '\xb3', '\xc5', '\x9b', '\xc5', '\xbc', '\xc5',
'\xba']

Failed again! It seems, re module is buggy. It does not extract
correctly utf-8 characters. Bad news. :(

-- 
JZ