re module & problem with utf-8

Tue Jan 14 19:05:42 EST 2003

> Failed again! It seems, re module is buggy. It does not extract
> correctly utf-8 characters. Bad news. :(

You misunderstand how to use the re module. Try: 

>>> s = '\xb9 \xe6 \xea \xb3 \xf1 \xf3 \x9c \xbf \x9f'
>>> u = unicode(s, 'cp1250')
>>> import re
>>> re.compile(r'(\w)', re.U).findall(u)
[u'\u0105', u'\u0107', u'\u0119', u'\u0142', u'\u0144', u'\xf3', 
 u'\u015b', u'\u017c', u'\u017a']

When a type is specified as "Unicode" in Python, it doesn't mean UTF-8
or any other encoded string, it really means a Unicode object.

Cheers,
Brian