[Python-ideas] Support parsing stream with `re`

Sun Oct 7 20:09:02 EDT 2018

On 10/6/2018 5:00 PM, Nathaniel Smith wrote:
> On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum <ram at rachum.com> wrote:
>> I'd like to use the re module to parse a long text file, 1GB in size. I wish
>> that the re module could parse a stream, so I wouldn't have to load the
>> whole thing into memory. I'd like to iterate over matches from the stream
>> without keeping the old matches and input in RAM.
>>
>> What do you think?
> 
> This has frustrated me too.
> 
> The case where I've encountered this is parsing HTTP/1.1. We have data
> coming in incrementally over the network, and we want to find the end
> of the headers. To do this, we're looking for the first occurrence of
> b"\r\n\r\n" OR b"\n\n".
> 
> So our requirements are:
> 
> 1. Search a bytearray for the regex b"\r\n\r\n|\n\n"

I believe that re is both overkill and slow for this particular problem.
For O(n), search forward for \n with str.index('\n') (or .find)
[I assume that this searches forward faster than
for i, c in enumerate(s):
    if c == '\n': break
and leave you to test this.]

If not found, continue with next chunk of data.
If found, look back for \r to determine whether to look forward for \n 
or \r\n *whenever there is enough data to do so.

> 2. If there's no match yet, wait for more data to arrive and try again
> 3. When more data arrives, start searching again *where the last
> search left off*

s.index has an optional start parameter.  And keep chunks in a list 
until you have a match and can join all at once.

-- 
Terry Jan Reedy