python3, regular expression and bytes text

Chris Angelico rosuav at gmail.com
Sat Oct 12 15:29:50 EDT 2019


On Sun, Oct 13, 2019 at 5:11 AM Eko palypse <ekopalypse at gmail.com> wrote:
>
> What needs to be set in order to be able to use a re search within
> utf8 encoded bytes?

You have to be able to match bytes, not strings.

> So how can I make it work with utf8 encoded text?
> Note, decoding it to a string isn't preferred as this would mean
> allocating the bytes buffer a 2nd time and it might be that a
> buffer is several 100MBs, even GBs.

I don't think you'll be able to 100% reliably match bytes in this way.
You're asking it to make analysis of multiple bytes and to interpret
them according to which character they would represent if decoded from
UTF-8.

My recommendation: Even if your buffer is multiple gigabytes, just
decode it anyway. Maybe you can decode your buffer in chunks, but
otherwise, just bite the bullet and do the decode. You may be
pleasantly surprised at how little you suffer as a result; Python is
quite decent at memory management, and even if you DO get pushed into
the swapper by this, it's still likely to be faster than trying to
code around all the possible problems that come from mismatching your
text search.

ChrisA



More information about the Python-list mailing list