python3, regular expression and bytes text

Sat Oct 12 14:08:34 EDT 2019

What needs to be set in order to be able to use a re search within
utf8 encoded bytes?

My test, being on a windows PC with cp1252 setup, looks like this

import re
import locale

cp1252 = 'Ärger im Paradies'.encode('cp1252')
utf8 = 'Ärger im Paradies'.encode('utf-8')

print('cp1252:', cp1252)
print('utf8  :', utf8)
print('-'*80)
print("search for 'Ärger'.encode('cp1252') in cp1252 encoded text")
for m in re.finditer('Ärger'.encode('cp1252'), cp1252):
    print(m)

print('-'*80)
print("search for 'Ärger'.encode('') in utf8 encoded text")
for m in re.finditer('Ärger'.encode(), utf8):
    print(m)

print('-'*80)
print("search for '\\w+'.encode('cp1252') in cp1252 encoded text")
for m in re.finditer('\\w+'.encode('cp1252'), cp1252):
    print(m)

print('-'*80)
print("search for '\\w+'.encode('') in utf8 encoded text")
for m in re.finditer('\\w+'.encode(), utf8):
    print(m)

locale.setlocale(locale.LC_ALL, '')
print('-'*80)
print("search for '\\w+'.encode('cp1252') using re.LOCALE in cp1252 encoded text")
for m in re.finditer('\\w+'.encode('cp1252'), cp1252, re.LOCALE):
    print(m)

print('-'*80)
print("search for '\\w+'.encode('') using ??? in utf8 encoded text")
for m in re.finditer('\\w+'.encode(), utf8):
    print(m)

if you run this you will get something like

cp1252: b'\xc4rger im Paradies'
utf8  : b'\xc3\x84rger im Paradies'
--------------------------------------------------------------------------------
search for 'Ärger'.encode('cp1252') in cp1252 encoded text
<re.Match object; span=(0, 5), match=b'\xc4rger'>
--------------------------------------------------------------------------------
search for 'Ärger'.encode('') in utf8 encoded text
<re.Match object; span=(0, 6), match=b'\xc3\x84rger'>
--------------------------------------------------------------------------------

these two are ok BUT the result for \w+ shows a difference

search for '\w+'.encode('cp1252') in cp1252 encoded text
<re.Match object; span=(1, 5), match=b'rger'>
<re.Match object; span=(6, 8), match=b'im'>
<re.Match object; span=(9, 17), match=b'Paradies'>
--------------------------------------------------------------------------------
search for '\w+'.encode('') in utf8 encoded text
<re.Match object; span=(2, 6), match=b'rger'>
<re.Match object; span=(7, 9), match=b'im'>
<re.Match object; span=(10, 18), match=b'Paradies'>
--------------------------------------------------------------------------------

it doesn't find the Ä, which from documentation point of view is expected
and a hint to use locale is given, so let's do it and the results are

search for '\w+'.encode('cp1252') using re.LOCALE in cp1252 encoded text
<re.Match object; span=(0, 5), match=b'\xc4rger'>
<re.Match object; span=(6, 8), match=b'im'>
<re.Match object; span=(9, 17), match=b'Paradies'>
--------------------------------------------------------------------------------

works for cp1252 BUT does not work for utf8

search for '\w+'.encode('') using ??? in utf8 encoded text
<re.Match object; span=(2, 6), match=b'rger'>
<re.Match object; span=(7, 9), match=b'im'>
<re.Match object; span=(10, 18), match=b'Paradies'>

So how can I make it work with utf8 encoded text?
Note, decoding it to a string isn't preferred as this would mean
allocating the bytes buffer a 2nd time and it might be that a 
buffer is several 100MBs, even GBs.

Thank you
Eren