Scanning a file

Paul Watson pwatson at redlinepy.com
Sun Oct 30 20:53:17 EST 2005


Fredrik Lundh wrote:
> Paul Watson wrote:
> 
>>This is Cyngwin on Windows XP.
> 
> using cygwin to analyze performance characteristics of portable API:s
> is a really lousy idea.

Ok.  So, I agree.  That is just what I had at hand.  Here are some other 
numbers to which due diligence has also not been applied.  Source code 
is at the bottom for both file and mmap process.  I would be willing for 
someone to tell me what I could improve.

$ python -V
Python 2.4.1

$ uname -a
Linux ruth 2.6.13-1.1532_FC4 #1 Thu Oct 20 01:30:08 EDT 2005 i686

$ cat /proc/meminfo|head -2
MemTotal:       514232 kB
MemFree:         47080 kB

$ time ./scanfile.py
16384

real    0m0.06s
user    0m0.03s
sys     0m0.01s

$ time ./scanfilemmap.py
16384

real    0m0.10s
user    0m0.06s
sys     0m0.00s

Using a ~ 250 MB file, not even half of physical memory.

$ time ./scanfile.py
16777216

real    0m11.19s
user    0m10.98s
sys     0m0.17s

$ time ./scanfilemmap.py
16777216

real    0m55.09s
user    0m43.12s
sys     0m11.92s

==============================

$ cat scanfile.py
#!/usr/bin/env python

import sys

fn = 't.dat'
ss = '\x00\x00\x01\x00'
ss = 'time'

be = len(ss) - 1        # length of overlap to check
blocksize = 64 * 1024    # need to ensure that blocksize > overlap

fp = open(fn, 'rb')
b = fp.read(blocksize)
count = 0
while len(b) > be:
     count += b.count(ss)
     b = b[-be:] + fp.read(blocksize)
fp.close()

print count
sys.exit(0)

===================================

$ cat scanfilemmap.py
#!/usr/bin/env python

import sys
import os
import mmap

fn = 't.dat'
ss = '\x00\x00\x01\x00'
ss='time'

fp = open(fn, 'rb')
b = mmap.mmap(fp.fileno(), os.stat(fp.name).st_size, 
access=mmap.ACCESS_READ)

count = 0
foundpoint = b.find(ss, 0)
while foundpoint != -1 and (foundpoint + 1) < b.size():
     #print foundpoint
     count = count + 1
     foundpoint = b.find(ss, foundpoint + 1)
b.close()

print count

fp.close()
sys.exit(0)



More information about the Python-list mailing list