regex over files

Wed Apr 27 22:39:45 EDT 2005

    Robin> I implemented a simple scanning algorithm in two ways. First buffered scan 
    Robin> tscan0.py; second mmapped scan tscan1.py.

    ...

    Robin> C:\code\reportlab\demos\gadflypaper>\tmp\tscan0.py dingo.dat
    Robin> len=139583265 w=103 time=110.91

    Robin> C:\code\reportlab\demos\gadflypaper>\tmp\tscan1.py dingo.dat
    Robin> len=139583265 w=103 time=140.53

I'm not sure why the mmap() solution is so much slower for you.  Perhaps on
some systems files opened for reading are mmap'd under the covers.  I'm sure
it's highly platform-dependent.  (My results on MacOSX - see below - are
somewhat better.)

Let me return to your original problem though, doing regex operations on
files.  I modified your two scripts slightly:

tscan0.py:

    import sys, time, re
    fn = sys.argv[1]
    f=open(fn,'rb')
    n=0
    t0 = time.time()
    while 1:
         buf = f.read(4096)
         if not buf: break
         for i in re.split("XXXXX", buf):
             n += 1
    t1 = time.time()

    print "n=%d time=%.2f" % (n, (t1-t0))

tscan1.py:

    import sys, time, mmap, os, re
    fn = sys.argv[1]
    fh=os.open(fn,os.O_RDONLY)
    s=mmap.mmap(fh,0,access=mmap.ACCESS_READ)
    t0 = time.time()
    n = 0
    for mat in re.split("XXXXX", s):
        n += 1
    t1 = time.time()

    print "n=%d time=%.2f" % (n, (t1-t0))

The mmap version is almost obviously correct, assuming what we want to do is
split the file on "XXXXX".  The buffered read version is almost certainly
incorrect, given our understanding that corner cases lurk at buffer
boundaries.

I took the file from Bengt Richter's example and replicated it a bunch of
times to get a 122MB file.  I then ran the above two programs against it:

    % python tscan1.py splitX
    n=2112001 time=8.88
    % python tscan0.py splitX
    n=2139845 time=10.26

So the mmap'd version is within 15% of the performance of the buffered read
version and we don't have to solve the problem of any corner cases (note the
different values of n).  I'm happy to take the extra runtime in exchange for
simpler code.

Skip