python3, regular expression and bytes text

Sat Oct 12 16:14:30 EDT 2019

On 10/12/19 3:46 PM, Eko palypse wrote:
> Thank you very much for your answer.
>
>> You have to be able to match bytes, not strings.
> May I ask you to elaborate on this, sorry non-native English speaker.
> The buffer I receive is a byte-like buffer.
>
>> I don't think you'll be able to 100% reliably match bytes in this way.
>> You're asking it to make analysis of multiple bytes and to interpret
>> them according to which character they would represent if decoded from
>> UTF-8.
>>
>> My recommendation: Even if your buffer is multiple gigabytes, just
>> decode it anyway. Maybe you can decode your buffer in chunks, but
>> otherwise, just bite the bullet and do the decode. You may be
>> pleasantly surprised at how little you suffer as a result; Python is
>> quite decent at memory management, and even if you DO get pushed into
>> the swapper by this, it's still likely to be faster than trying to
>> code around all the possible problems that come from mismatching your
>> text search.
>>
>> ChrisA
> That's what I was afraid of. 
> It would be nice if the "world" could commit itself to one standard, 
> but I'm afraid that won't happen in my life anymore, I guess. :-(
>
> Thx
> Eren

Current 'best practices' are in my opinion to convert data (if needed)
to some version of Unicode (UTF-8, UTF-16, or UCS-4) at input (if
needed) and process in that domain. You do need to be prepared to run
into files which are encoded in some locally defined 8-bit code page. In
Python3,  strings are unicode encoded, and you don't need to worry about
the details of which encoding is used internally, Python will deal with
that itself.

-- 
Richard Damon