regular expressions and the LOCALE flag

Baz Walter bazwal at ftml.net
Tue Aug 3 15:40:01 EDT 2010


On 03/08/10 19:40, MRAB wrote:
> Baz Walter wrote:
>> the python docs say that re.LOCALE makes certain character classes
>> "dependent on the current locale".
>
> re.LOCALE just passes the character to the underlying C library. It
> really only works on bytestrings which have 1 byte per character.

the re docs don't specify 8-bit encodings: they just refer to the 
'current locale'.

> And, BTW, none of your examples pass a UTF-8 bytestring to re.findall:
> all those string literals starting with the 'u' prefix are Unicode
> strings!

not sure what you mean by this: if the string was encoded as utf8, '\w' 
still wouldn't match any of the non-ascii characters.

> Locale encodings are more trouble than they're worth. Unicode is better.
> :-)

yes, i'm really just trying to decide whether i should offer 'locale' as 
an option in my program. given the unintuitive way re.LOCALE works, i'm 
not sure that i should.

are you saying that it only really makes sense for *bytestrings* to be 
used with re.LOCALE?

if so, the re docs certainly don't make that clear.



More information about the Python-list mailing list