re and locale/unicode

Mon Sep 20 23:57:21 EDT 2010

On 21/09/2010 04:21, Jerry Fleming wrote:
> Hi,
>
> Having the following python code:
>
>
> import locale
> import re
>
> locale.setlocale(locale.LC_ALL, 'zh_CN.utf8')
> re.findall('(?uL)\s+', u'\u2001\u3000\x20', re.U|re.L)
> re.findall('\s+', u'\u2001\u3000\x20', re.U|re.L)
> re.findall('(?uL)\s+', u'\u2001\u3000\x20')
>
>
> I was wondering why doesn't it find the unicode space chars \u2001 and
> \u3000? The python docs for re module says:
>
> When the LOCALE and UNICODE flags are not specified, matches any
> whitespace character; this is equivalent to the set [ \t\n\r\f\v]. With
> LOCALE, it will match this set plus whatever characters are defined as
> space for the current locale. If UNICODE is set, this will match the
> characters [ \t\n\r\f\v] plus whatever is classified as space in the
> Unicode character properties database.
>
> which doesn't seem to work. Any ideas?

Use the regex module? ;-)

     http://pypi.python.org/pypi/regex

BTW, LOCALE is for locale-specific bytestrings.

Basically the choice is between ASCII (bytestring) (default in Python
2), LOCALE (bytestring) and UNICODE (Unicode string), so don't bother
combining them.