regex over files
Skip Montanaro
skip at pobox.com
Wed Apr 27 22:39:45 EDT 2005
Robin> I implemented a simple scanning algorithm in two ways. First buffered scan
Robin> tscan0.py; second mmapped scan tscan1.py.
...
Robin> C:\code\reportlab\demos\gadflypaper>\tmp\tscan0.py dingo.dat
Robin> len=139583265 w=103 time=110.91
Robin> C:\code\reportlab\demos\gadflypaper>\tmp\tscan1.py dingo.dat
Robin> len=139583265 w=103 time=140.53
I'm not sure why the mmap() solution is so much slower for you. Perhaps on
some systems files opened for reading are mmap'd under the covers. I'm sure
it's highly platform-dependent. (My results on MacOSX - see below - are
somewhat better.)
Let me return to your original problem though, doing regex operations on
files. I modified your two scripts slightly:
tscan0.py:
import sys, time, re
fn = sys.argv[1]
f=open(fn,'rb')
n=0
t0 = time.time()
while 1:
buf = f.read(4096)
if not buf: break
for i in re.split("XXXXX", buf):
n += 1
t1 = time.time()
print "n=%d time=%.2f" % (n, (t1-t0))
tscan1.py:
import sys, time, mmap, os, re
fn = sys.argv[1]
fh=os.open(fn,os.O_RDONLY)
s=mmap.mmap(fh,0,access=mmap.ACCESS_READ)
t0 = time.time()
n = 0
for mat in re.split("XXXXX", s):
n += 1
t1 = time.time()
print "n=%d time=%.2f" % (n, (t1-t0))
The mmap version is almost obviously correct, assuming what we want to do is
split the file on "XXXXX". The buffered read version is almost certainly
incorrect, given our understanding that corner cases lurk at buffer
boundaries.
I took the file from Bengt Richter's example and replicated it a bunch of
times to get a 122MB file. I then ran the above two programs against it:
% python tscan1.py splitX
n=2112001 time=8.88
% python tscan0.py splitX
n=2139845 time=10.26
So the mmap'd version is within 15% of the performance of the buffered read
version and we don't have to solve the problem of any corner cases (note the
different values of n). I'm happy to take the extra runtime in exchange for
simpler code.
Skip
More information about the Python-list
mailing list