regex over files

Wed Apr 27 05:29:06 EDT 2005

Jeremy Bowers wrote:
> On Tue, 26 Apr 2005 20:54:53 +0000, Robin Becker wrote:
> 
> 
>>Skip Montanaro wrote:
>>...
>>
>>>If I mmap() a file, it's not slurped into main memory immediately, though as
>>>you pointed out, it's charged to my process's virtual memory.  As I access
>>>bits of the file's contents, it will page in only what's necessary.  If I
>>>mmap() a huge file, then print out a few bytes from the middle, only the
>>>page containing the interesting bytes is actually copied into physical
>>>memory.
>>
>>....
>>my simple rather stupid experiment indicates that windows mmap at least 
>>will reserve 25Mb of paged file for a linear scan through a 25Mb file. I 
>>probably only need 4096b to scan. That's a lot less than even the page 
>>table requirement. This isn't rocket science just an old style observation.
> 
> 
> Are you trying to claim Skip is wrong, or what? There's little value in
> saying that by mapping a file of 25MB into VM pages, you've increased your
> allocated paged file space by 25MB. That's effectively tautological. 
> 
> If you are trying to claim Skip is wrong, you *do not understand* what you
> are talking about. Talk less, listen and study more. (This is my best
> guess, as like I said, observing that allocating things increases the
> number of things that are allocated isn't worth posting so my thought is
> you think you are proving something. If you really are just posting
> something tautological, my apologies and disregard this paragraph but,
> well, it's certainly not out of line at this point.)

Well I obviously don't understand so perhaps you can explain these results

I implemented a simple scanning algorithm in two ways. First buffered scan 
tscan0.py; second mmapped scan tscan1.py.

For small file sizes the times are comparable.

C:\code\reportlab\demos\gadflypaper>\tmp\tscan0.py bingo.pdf
len=27916653 w=103 time=22.13

C:\code\reportlab\demos\gadflypaper>\tmp\tscan1.py bingo.pdf
len=27916653 w=103 time=22.20

for large file sizes when paging becomes of interest buffered scan wins even 
though it has to do a lot more python statements. If this were coded in C the 
results would be plainer still. As I said this isn't about right or wrong it's 
an observation. If I inspect the performance monitor tscan0 is at 100%, but 
tscan1 is at 80-90% and all of memory gets used up so paging is important. This 
may be an effect of the poor design of xp if so perhaps it won't hold for other 
os's.

C:\code\reportlab\demos\gadflypaper>\tmp\tscan0.py dingo.dat
len=139583265 w=103 time=110.91

C:\code\reportlab\demos\gadflypaper>\tmp\tscan1.py dingo.dat
len=139583265 w=103 time=140.53

C:\code\reportlab\demos\gadflypaper>cat \tmp\tscan0.py
import sys, time
fn = sys.argv[1]
f=open(fn,'rb')
n=0
w=0
t0 = time.time()
while 1:
     buf = f.read(4096)
     lb = len(buf)
     if not lb: break
     n += lb
     for i in xrange(lb):
         w ^= ord(buf[i])
t1 = time.time()

print "len=%d w=%d time=%.2f" % (n, w, (t1-t0))

C:\code\reportlab\demos\gadflypaper>cat \tmp\tscan1.py
import sys, time, mmap, os
fn = sys.argv[1]
fh=os.open(fn,os.O_BINARY|os.O_RDONLY)
s=mmap.mmap(fh,0,access=mmap.ACCESS_READ)
n=len(s)
w=0
t0 = time.time()
for i in xrange(n):
     w ^= ord(s[i])
t1 = time.time()

print "len=%d w=%d time=%.2f" % (n, w, (t1-t0))

-- 
Robin Becker