regex over files

Peter Otten __peter__ at web.de
Fri Apr 29 03:10:41 EDT 2005


Robin Becker wrote:

> #sscan1.py thanks to Skip
> import sys, time, mmap, os, re
> fn = sys.argv[1]
> fh=os.open(fn,os.O_BINARY|os.O_RDONLY)
> s=mmap.mmap(fh,0,access=mmap.ACCESS_READ)
> l=n=0
> t0 = time.time()
> for mat in re.split("XXXXX", s):

re.split() returns a list, not a generator, and this list may consume a lot
of memory.

> n += 1
> l += len(mat)
> t1 = time.time()
> 
> print "fn=%s n=%d l=%d time=%.2f" % (fn, n, l, (t1-t0))

I wrote a generator replacement for re.split(), but as you might expect, the
performance is nowhere near re.split(). For your large data it might help
somewhat because of its smaller memory footprint.

def splititer(regex, data):
    # like re.split(), but never yields the separators.
    if not hasattr(regex, "finditer"):
        regex = re.compile(regex)
    start = 0
    for match in regex.finditer(data):
        end, new_start = match.span()
        yield data[start:end]
        start = new_start
    yield data[start:]

Peter



More information about the Python-list mailing list