python3, regular expression and bytes text

MRAB python at mrabarnett.plus.com
Sat Oct 12 16:15:26 EDT 2019


On 2019-10-12 20:48, Serhiy Storchaka wrote:
> 12.10.19 21:08, Eko palypse пише:
>> So how can I make it work with utf8 encoded text?
> 
> You cannot. First, \w in re.LOCALE works only when the text is encoded
> with the locale encoding (cp1252 in your case). Second, re.LOCALE
> supports only 8-bit charsets. So even if you set the utf-8 locale, it
> would not help.
> 
> Regular expressions with re.LOCALE are slow. It may be more efficient to
> decode text and use Unicode regular expression.
> 
+1

It's best to treat re.LOCALE as being for old legacy encodings that 
use/used 8 bits per character. Wherever possible, decode to Unicode and 
work with that instead.



More information about the Python-list mailing list