Re: [Tutor] multiline regular expressions on endless data stream

Magnus Lycka magnus at thinkware.se
Wed Apr 28 14:48:01 EDT 2004


Duncan Gibson wrote:
> I have a Perl utility which needs to be rewritten in something readable :-)
 
Hey, that's what originally brought me to Python in 1996! :)

> What I'm currently doing is:
> 	read a line into a string
> 	if line contains start of regular expression marker
> 		read more lines until end of regular expression marker
> 		join lines into strin
> 		extract data from string
> 	repeat
> 
> Unfortunately this piecemeal approach to matching the regular expression
> makes the logic very messy, which in Perl means that its very very messy.

Would it be impossible to load all the data into the program before
you start working on it instead of using a line-by-line approach?

In other words, are you working with more than a few tens of MB, or
do you work with a stream of data that you need to start processing
before all is read?

With "normal" files, you could do something like this:

Let's make a file first.

>>> x = """bla bla
START
in first block
STOP
outside block
START
in second block
STOP
bla bla"""
>>> file('tmp.txt', 'w').write(x)

Now, we read the file and process it:

>>> data = file('tmp.txt').read()
>>> import re
>>> pat = r"(?ms)(^START$.*?^STOP$)"
>>> re.findall(pat, data)
['START\nin first block\nSTOP', 'START\nin second block\nSTOP']

If you can't read all of the data at once, I guess you could use
an approach where you read, let's say 1 MB at a time, extracts all
relevant data as above, and finally, you check for a start-marker
after your last processed block. (You might not want to use findall
then, but rather something that let's you keep track of where in
your data your last end marker was.)

If you didn't find a new start marker after your processed blocks,
you can throw away all data read so far. If there is a start marker
(i.e. you stopped reading input inside a relevant block) you will
only remove the data before that start marker. Then you read in more
data and continue processing as before.

At least that should be faster than doing one line at a time.

Ouch! I just realized that you might have read in half a new
start marker. I guess you'd better always keep at least as much
of your old data as the length of a start marker - 1. :)
 
> Is there a simpler way of doing this in Python without just duplicating
> the messy logic? I've been searching through the books and archives but
> haven't quite seen what I want, which is more along the lines of:
> 
> 	discard lines until multiline regular expression matches
> 	extract data from regular expression
> 	repeat
> 
> All the examples either read a line and try to match within the line, or
> read the whole file and ignore end of line when matching. I can't slurp
> the whole data into memory using read() because it might be a data stream
> rather than a fixed length file.

Aha. I should have read all the way before responding. :) As I said: You can
use something like .read(1000000) and work on that. Or can you? Is there an
urgency in processing data quickly if the stream is open, but not providing
more data quickly? Oh, well, you can experiment with how many bytes to read
at a time...

> The regular expression is relatively simple. It's just the buffering
> around it and writing a full parser/scanner that I'm trying to avoid. 
> 
> Does anyone have any hints or tips?
> Will I kick myself for overlooking the obvious?

I don't think so, but it doesn't seem very difficult either...

Assuming you have compiled RE objects for start and stop
patterns, you can use a function like this:

def getBlocks(data, start_pat, stop_pat):
    '''
    Feed with a text and two compiled patterns for start
    and stop markers in a block. Returns data that might
    need further processing and a list of blocks with the
    texts between pairs of start/stop markers.
    '''
    blocks = []
    pos = 0 # Determines where we'll cut remaining data
    while True:
        mo = start_pat.search(data, pos)
        if not mo:
            return data[pos:], blocks
        pos = mo.start()
        start = mo.end() # Beginning of a data block
        mo = stop_pat.search(data, start)
        if not mo:
            return data[pos:], blocks
        pos = mo.end()
        stop = mo.start() # End of a data block
        blocks.append(data[start:stop])

You would use this like this:

start = re.compile(...)
stop = re.compile(...)
text = ''

while True:
    more_text = your_data_stream.read(1000) # Whatever...
    if not more_text:
        break
    text += more_text
    text, blocks = getBlocks(text, start, stop)
    for block in blocks:
        ...do something ...

Would this work?

-- 
Magnus Lycka, Thinkware AB
Alvans vag 99, SE-907 50 UMEA, SWEDEN
phone: int+46 70 582 80 65, fax: int+46 70 612 80 65
http://www.thinkware.se/  mailto:magnus at thinkware.se



More information about the Tutor mailing list