re module & problem with utf-8
Jarosław Zabiełło
webmaster at watchtowerDOTorg.pl
Tue Jan 14 18:38:03 EST 2003
I found, Python 2.2 has big problems with utf-8 characters in re
module. Look, what stupid results it generates.
First, I create example string with some cp1250 characters separated
with a space:
>>> s = 'ą ć ę ł ń ó ś ż ź'
>>> s
'\xb9 \xe6 \xea \xb3 \xf1 \xf3 \x9c \xbf \x9f'
Nex, I convert it to utf-8 string:
>>> u = unicode(s, 'cp1250').encode('utf-8')
>>> u
'\xc4\x85 \xc4\x87 \xc4\x99 \xc5\x82 \xc5\x84 \xc3\xb3 \xc5\x9b
\xc5\xbc \xc5\xba'
Now, I try to extract my separated characters:
>>> re.compile(r'(\w)', re.U).findall(u)
['\xc4', '\xc4', '\xc4', '\xc5', '\xc5', '\xc3', '\xb3', '\xc5',
'\xc5', '\xbc', '\xc5', '\xba']
Failed! Python did not understand, that it is utf-8 string although I
set up re.U flag. :(
I tried another regex pattern:
>>> re.compile(r'(\S)', re.U).findall(u)
['\xc4', '\xc4', '\x87', '\xc4', '\x99', '\xc5', '\x82', '\xc5',
'\x84', '\xc3', '\xb3', '\xc5', '\x9b', '\xc5', '\xbc', '\xc5',
'\xba']
Failed again! It seems, re module is buggy. It does not extract
correctly utf-8 characters. Bad news. :(
--
JZ
More information about the Python-list
mailing list