regex over files
Robin Becker
robin at reportlab.com
Thu Apr 28 09:22:13 EDT 2005
Skip Montanaro wrote:
.....
>
> Let me return to your original problem though, doing regex operations on
> files. I modified your two scripts slightly:
>
.....
> Skip
I'm sure my results are dependent on something other than the coding style
I suspect file/disk cache and paging operates here. Note that we now agree on
total match length and split count. However, when the windows VM goes into
paging mode the mmap thing falls off the world as I would expect for a thrashing
system.
eg small memory (relatively)
C:\code\reportlab\demos\gadflypaper>\tmp\sscan0.py xxx_100mb.dat
fn=xxx_100mb.dat n=1898737 l=90506416 time=3.55
C:\code\reportlab\demos\gadflypaper>\tmp\sscan1.py xxx_100mb.dat
fn=xxx_100mb.dat n=1898737 l=90506416 time=8.25
C:\code\reportlab\demos\gadflypaper>\tmp\sscan1.py xxx_100mb.dat
fn=xxx_100mb.dat n=1898737 l=90506416 time=9.77
C:\code\reportlab\demos\gadflypaper>\tmp\sscan0.py xxx_100mb.dat
fn=xxx_100mb.dat n=1898737 l=90506416 time=5.09
C:\code\reportlab\demos\gadflypaper>\tmp\sscan1.py xxx_100mb.dat
fn=xxx_100mb.dat n=1898737 l=90506416 time=6.17
C:\code\reportlab\demos\gadflypaper>\tmp\sscan0.py xxx_100mb.dat
fn=xxx_100mb.dat n=1898737 l=90506416 time=4.64
and large memory
C:\code\reportlab\demos\gadflypaper>\tmp\sscan0.py xxx_200mb.dat
fn=xxx_200mb.dat n=3797470 l=181012689 time=20.16
C:\code\reportlab\demos\gadflypaper>\tmp\sscan1.py xxx_200mb.dat
fn=xxx_200mb.dat n=3797470 l=181012689 time=136.42
At the end of this run I had to wait quite a long time for other things to
become responsive (ie things were entirely paged out).
Here I've implemented slightly modified versions of the scanners that you put
forward.
eg
#sscan0.py thanks to Bengt
import sys, time, re
fn = sys.argv[1]
rxo = re.compile('XXXXX')
def frxsplit(path, rxo, chunksize=4096):
buffer = ''
for chunk in iter((lambda f=open(path,'rb'): f.read(chunksize)),''):
buffer += chunk
pieces = rxo.split(buffer)
for piece in pieces[:-1]: yield piece
buffer = pieces[-1]
yield buffer
l=n=0
t0 = time.time()
for mat in frxsplit(fn,rxo):
n += 1
l += len(mat)
t1 = time.time()
print "fn=%s n=%d l=%d time=%.2f" % (fn, n, l, (t1-t0))
#sscan1.py thanks to Skip
import sys, time, mmap, os, re
fn = sys.argv[1]
fh=os.open(fn,os.O_BINARY|os.O_RDONLY)
s=mmap.mmap(fh,0,access=mmap.ACCESS_READ)
l=n=0
t0 = time.time()
for mat in re.split("XXXXX", s):
n += 1
l += len(mat)
t1 = time.time()
print "fn=%s n=%d l=%d time=%.2f" % (fn, n, l, (t1-t0))
--
Robin Becker
More information about the Python-list
mailing list