[Python-ideas] Support parsing stream with `re`

Sun Oct 7 21:40:47 EDT 2018

On Sun, Oct 7, 2018 at 5:54 PM, Nathaniel Smith <njs at pobox.com> wrote:
> Are you imagining something roughly like this? (Ignoring chunk
> boundary handling for the moment.)
>
> def find_double_line_end(buf):
>     start = 0
>     while True:
>         next_idx = buf.index(b"\n", start)
>         if buf[next_idx - 1:next_idx + 1] == b"\n" or buf[next_idx -
> 3:next_idx] == b"\r\n\r":
>             return next_idx
>         start = next_idx + 1
>
> That's much more complicated than using re.search, and on some random
> HTTP headers I have lying around it benchmarks ~70% slower too. Which
> makes sense, since we're basically trying to replicate re engine's
> work by hand in a slower language.
>
> BTW, if we only want to find a fixed string like b"\r\n\r\n", then
> re.search and bytearray.index are almost identical in speed. If you
> have a problem that can be expressed as a regular expression, then
> regular expression engines are actually pretty good at solving those
> :-)

Though... here's something strange.

Here's another way to search for the first appearance of either
\r\n\r\n or \n\n in a bytearray:

def find_double_line_end_2(buf):
    idx1 = buf.find(b"\r\n\r\n")
    idx2 = buf.find(b"\n\n", 0, idx1)
    if idx1 == -1:
        return idx2
    elif idx2 == -1:
        return idx1
    else:
        return min(idx1, idx2)

So this is essentially equivalent to our regex (notice they both pick
out position 505 as the end of the headers):

In [52]: find_double_line_end_2(sample_headers)
Out[52]: 505

In [53]: double_line_end_re = re.compile(b"\r\n\r\n|\n\n")

In [54]: double_line_end_re.search(sample_headers)
Out[54]: <_sre.SRE_Match object; span=(505, 509), match=b'\r\n\r\n'>

But, the Python function that calls bytearray.find twice is about ~3x
faster than the re module:

In [55]: %timeit find_double_line_end_2(sample_headers)
1.18 µs ± 40 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [56]: %timeit double_line_end_re.search(sample_headers)
3.3 µs ± 23.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

The regex module is even slower:

In [57]: double_line_end_regex = regex.compile(b"\r\n\r\n|\n\n")

In [58]: %timeit double_line_end_regex.search(sample_headers)
4.95 µs ± 76.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

-n

-- 
Nathaniel J. Smith -- https://vorpus.org