Multiline regex

Wed Jul 21 20:25:58 EDT 2010

On Wed, 21 Jul 2010 10:06:14 -0500, Brandon Harris wrote:

> what do you mean by slurp the entire file? I'm trying to use regular
> expressions because line by line parsing will be too slow.  And example
> file would have somewhere in the realm of 6 million lines of code.

And you think trying to run a regex over all 6 million lines at once will 
be faster? I think you're going to be horribly, horribly disappointed.

And then on Wed, 21 Jul 2010 10:42:11 -0500, Brandon Harris wrote:

> I could make it that simple, but that is also incredibly slow and on a
> file with several million lines, it takes somewhere in the league of
> half an hour to grab all the data. I need this to grab data from many
> many file and return the data quickly.

What do you mean "grab" all the data? If all you mean is read the file, 
then 30 minutes to read ~ 100MB of data is incredibly slow and you're 
probably doing something wrong, or you're reading it over a broken link 
with very high packet loss, or something.

If you mean read the data AND parse it, then whether that is "incredibly 
slow" or "amazingly fast" depends entirely on how complicated your parser 
needs to be.

If *all* you mean is "read the file and group the lines, for later 
processing", then I would expect it to take well under a minute to group 
millions of lines. Here's a simulation I ran, using 2001000 lines of text 
based on the examples you gave. It grabs the blocks, as required, but 
does no further parsing of them.

def merge(lines):
    """Join multiple lines into a single block."""
    accumulator = []
    for line in lines:
        if line.lower().startswith('createnode'):
            if accumulator:
                yield ''.join(accumulator)
                accumulator = []
        accumulator.append(line)
    if accumulator:
        yield ''.join(accumulator)

def test():
    import time
    t = time.time()
    count = 0
    f = open('/steve/test.junk')
    for block in merge(f):
        # do some make-work
        n = sum([1 for c in block if c in '1234567890'])
        count += 1
    print "Processed %d blocks from 2M+ lines." % count
    print "Time taken:", time.time() - t, "seconds"

And the result on a low-end PC:

>>> test()
Processed 1000 blocks from 2M+ lines.
Time taken: 17.4497909546 seconds

-- 
Steven