regex over files

Robin Becker robin at
Wed Apr 27 05:29:06 EDT 2005

Jeremy Bowers wrote:
> On Tue, 26 Apr 2005 20:54:53 +0000, Robin Becker wrote:
>>Skip Montanaro wrote:
>>>If I mmap() a file, it's not slurped into main memory immediately, though as
>>>you pointed out, it's charged to my process's virtual memory.  As I access
>>>bits of the file's contents, it will page in only what's necessary.  If I
>>>mmap() a huge file, then print out a few bytes from the middle, only the
>>>page containing the interesting bytes is actually copied into physical
>>my simple rather stupid experiment indicates that windows mmap at least 
>>will reserve 25Mb of paged file for a linear scan through a 25Mb file. I 
>>probably only need 4096b to scan. That's a lot less than even the page 
>>table requirement. This isn't rocket science just an old style observation.
> Are you trying to claim Skip is wrong, or what? There's little value in
> saying that by mapping a file of 25MB into VM pages, you've increased your
> allocated paged file space by 25MB. That's effectively tautological. 
> If you are trying to claim Skip is wrong, you *do not understand* what you
> are talking about. Talk less, listen and study more. (This is my best
> guess, as like I said, observing that allocating things increases the
> number of things that are allocated isn't worth posting so my thought is
> you think you are proving something. If you really are just posting
> something tautological, my apologies and disregard this paragraph but,
> well, it's certainly not out of line at this point.)

Well I obviously don't understand so perhaps you can explain these results

I implemented a simple scanning algorithm in two ways. First buffered scan; second mmapped scan

For small file sizes the times are comparable.

C:\code\reportlab\demos\gadflypaper>\tmp\ bingo.pdf
len=27916653 w=103 time=22.13

C:\code\reportlab\demos\gadflypaper>\tmp\ bingo.pdf
len=27916653 w=103 time=22.20

for large file sizes when paging becomes of interest buffered scan wins even 
though it has to do a lot more python statements. If this were coded in C the 
results would be plainer still. As I said this isn't about right or wrong it's 
an observation. If I inspect the performance monitor tscan0 is at 100%, but 
tscan1 is at 80-90% and all of memory gets used up so paging is important. This 
may be an effect of the poor design of xp if so perhaps it won't hold for other 

C:\code\reportlab\demos\gadflypaper>\tmp\ dingo.dat
len=139583265 w=103 time=110.91

C:\code\reportlab\demos\gadflypaper>\tmp\ dingo.dat
len=139583265 w=103 time=140.53

C:\code\reportlab\demos\gadflypaper>cat \tmp\
import sys, time
fn = sys.argv[1]
t0 = time.time()
while 1:
     buf =
     lb = len(buf)
     if not lb: break
     n += lb
     for i in xrange(lb):
         w ^= ord(buf[i])
t1 = time.time()

print "len=%d w=%d time=%.2f" % (n, w, (t1-t0))

C:\code\reportlab\demos\gadflypaper>cat \tmp\
import sys, time, mmap, os
fn = sys.argv[1],os.O_BINARY|os.O_RDONLY)
t0 = time.time()
for i in xrange(n):
     w ^= ord(s[i])
t1 = time.time()

print "len=%d w=%d time=%.2f" % (n, w, (t1-t0))

Robin Becker

More information about the Python-list mailing list