re module & problem with utf-8

Tue Jan 14 19:12:54 EST 2003

Jaros?aw Zabie??o wrote:
> I found, Python 2.2 has big problems with utf-8 characters in re
> module. Look, what stupid results it generates.
> 
> First, I create example string with some cp1250 characters separated
> with a space:
> 
> >>> s = '± æ ê ³ ñ ó ¶ ¿ ¼'
> >>> s
> '\xb9 \xe6 \xea \xb3 \xf1 \xf3 \x9c \xbf \x9f'
> 
> Nex, I convert it to utf-8 string:
> 
> >>> u = unicode(s, 'cp1250').encode('utf-8')
> >>> u
> '\xc4\x85 \xc4\x87 \xc4\x99 \xc5\x82 \xc5\x84 \xc3\xb3 \xc5\x9b
> \xc5\xbc \xc5\xba'
> 
> Now, I try to extract my separated characters:
> 
> >>> re.compile(r'(\w)', re.U).findall(u)
> ['\xc4', '\xc4', '\xc4', '\xc5', '\xc5', '\xc3', '\xb3', '\xc5',
> '\xc5', '\xbc', '\xc5', '\xba']
> 
> Failed! Python did not understand, that it is utf-8 string although I
> set up re.U flag. :(

I don't think it's python which failed.

UTF8 is an encoding (that's why you call the 'encode' method
on an unicode object to get it).  cp1250 is another encoding 
of unicode.  If you pass the 're.UNICODE' flag to the re-module than
you should work with unicode objects, e.g.

>>> u = unicode(s, 'cp1250')
>>> u
u'\u0105 \u0107 \u0119 \u0142 \u0144 \xf3 \u015b \u017c \u017a'
>>> re.compile(r'(\w)', re.U).findall(u)
[u'\u0105', u'\u0107', u'\u0119', u'\u0142', u'\u0144', u'\xf3',
u'\u015b', u'\u017c', u'\u017a']

which is perfectly sensible. 

>>> " ".join([x.encode('cp1250') for x in _])
'\xb9 \xe6 \xea \xb3 \xf1 \xf3 \x9c \xbf \x9f'

and voila, you have your cp1250-encoded string back.

> Failed again! It seems, re module is buggy. It does not extract
> correctly utf-8 characters. Bad news. :(

Don't you have the slightest doubt that you might be wrong
instead of an important module beeing fundamentally broken? 

regards,

    holger