regex over files

Robin Becker robin at SPAMREMOVEjessikat.fsnet.co.uk
Thu Apr 28 16:33:04 EDT 2005


Skip Montanaro wrote:
>...
> 
> I'm not sure why the mmap() solution is so much slower for you.  Perhaps on
> some systems files opened for reading are mmap'd under the covers.  I'm sure
> it's highly platform-dependent.  (My results on MacOSX - see below - are
> somewhat better.)
> 
> Let me return to your original problem though, doing regex operations on
> files.  I modified your two scripts slightly:
> 

I'm sure my results are dependent on something other than the coding 
style I suspect file/disk cache and paging operates here. Note that we 
now agree on total match length and split count. However, when the 
windows VM goes into paging mode the mmap thing falls off the world as I 
would expect for a thrashing system.

eg small memory (relatively)
C:\code\reportlab\demos\gadflypaper>\tmp\sscan0.py xxx_100mb.dat
fn=xxx_100mb.dat n=1898737 l=90506416 time=3.55

C:\code\reportlab\demos\gadflypaper>\tmp\sscan1.py xxx_100mb.dat
fn=xxx_100mb.dat n=1898737 l=90506416 time=8.25

C:\code\reportlab\demos\gadflypaper>\tmp\sscan1.py xxx_100mb.dat
fn=xxx_100mb.dat n=1898737 l=90506416 time=9.77

C:\code\reportlab\demos\gadflypaper>\tmp\sscan0.py xxx_100mb.dat
fn=xxx_100mb.dat n=1898737 l=90506416 time=5.09

C:\code\reportlab\demos\gadflypaper>\tmp\sscan1.py xxx_100mb.dat
fn=xxx_100mb.dat n=1898737 l=90506416 time=6.17

C:\code\reportlab\demos\gadflypaper>\tmp\sscan0.py xxx_100mb.dat
fn=xxx_100mb.dat n=1898737 l=90506416 time=4.64

and large memory
C:\code\reportlab\demos\gadflypaper>\tmp\sscan0.py xxx_200mb.dat
fn=xxx_200mb.dat n=3797470 l=181012689 time=20.16

C:\code\reportlab\demos\gadflypaper>\tmp\sscan1.py xxx_200mb.dat
fn=xxx_200mb.dat n=3797470 l=181012689 time=136.42

At the end of this run I had to wait quite a long time for other things 
to become responsive (ie things were entirely paged out).

as another data point with sscan0/1.py (slight mods of your code) I get 
this with a 200mb file on freeBSD 4.9

/usr/RL_HOME/users/robin/sstest:
$ python sscan0.py xxx_200mb.dat
fn=xxx_200mb.dat n=3797470 l=181012689 time=7.37
/usr/RL_HOME/users/robin/sstest:
$ python sscan1.py xxx_200mb.dat
fn=xxx_200mb.dat n=3797470 l=181012689 time=129.65
/usr/RL_HOME/users/robin/sstest:

ie the freeBSD vm seems to thrash just as nastily as xp :(


####################################################################
Here I've implemented slightly modified versions of the scanners that 
you put forward.

eg

#sscan0.py thanks to Bengt
import sys, time, re
fn = sys.argv[1]
rxo = re.compile('XXXXX')

def frxsplit(path, rxo, chunksize=4096):
	buffer = ''
	for chunk in iter((lambda f=open(path,'rb'): f.read(chunksize)),''):
		buffer += chunk
		pieces = rxo.split(buffer)
		for piece in pieces[:-1]: yield piece
		buffer = pieces[-1]
	yield buffer
l=n=0
t0 = time.time()
for mat in frxsplit(fn,rxo):
	n += 1
	l += len(mat)
t1 = time.time()

print "fn=%s n=%d l=%d time=%.2f" % (fn, n, l, (t1-t0))

#sscan1.py thanks to Skip
import sys, time, mmap, os, re
fn = sys.argv[1]
fh=os.open(fn,os.O_BINARY|os.O_RDONLY)
s=mmap.mmap(fh,0,access=mmap.ACCESS_READ)
l=n=0
t0 = time.time()
for mat in re.split("XXXXX", s):
	n += 1
	l += len(mat)
t1 = time.time()

print "fn=%s n=%d l=%d time=%.2f" % (fn, n, l, (t1
-- 
Robin Becker



More information about the Python-list mailing list