regex over files

Robin Becker robin at SPAMREMOVEjessikat.fsnet.co.uk
Fri Apr 29 03:51:25 EDT 2005


Peter Otten wrote:
> Robin Becker wrote:
> 
> 
>>#sscan1.py thanks to Skip
>>import sys, time, mmap, os, re
>>fn = sys.argv[1]
>>fh=os.open(fn,os.O_BINARY|os.O_RDONLY)
>>s=mmap.mmap(fh,0,access=mmap.ACCESS_READ)
>>l=n=0
>>t0 = time.time()
>>for mat in re.split("XXXXX", s):
> 
> 
> re.split() returns a list, not a generator, and this list may consume a lot
> of memory.
> 
> 
..... that would certainly be the case and may answer why the simple way is so
bad for larger memory. I'll have a go at this experiment as well. My original
intention was to find the start of each match as a scanner would and this would
certainly do that. However, my observation with the trivial byte scan would seem
to imply that just scanning the file causes vm problems (at least in xp). I suppose it's
hard to explain to the os that I actually only need the relevant few pages.
-- 
Robin Becker



More information about the Python-list mailing list