Scanning a file

Sun Oct 30 20:37:18 EST 2005

In article <1130637600.659212.66140 at g43g2000cwa.googlegroups.com>,
 netvaibhav at gmail.com wrote:

> Steve Holden wrote:
> > Indeed, but reading one byte at a time is about the slowest way to
> > process a file, in Python or any other language, because it fails to
> > amortize the overhead cost of function calls over many characters.
> >
> > Buffering wasn't invented because early programmers had nothing better
> > to occupy their minds, remember :-)
> 
> Buffer, and then read one byte at a time from the buffer.

Have you mesured it?

#!/usr/bin/python
'''Time some file scanning.
'''

import sys, time

f = open(sys.argv[1])
t = time.time()
while True:
    b = f.read(256*1024)
    if not b:
        break
print 'initial read', time.time() - t
f.close()

f = open(sys.argv[1])
t = time.time()
while True:
    b = f.read(256*1024)
    if not b:
        break
print 'second read', time.time() - t
f.close()

if 1:
    f = open(sys.argv[1])
    t = time.time()
    while True:
        b = f.read(256*1024)
        if not b:
            break
        for c in b:
            pass
    print 'third chars', time.time() - t
    f.close()

f = open(sys.argv[1])
t = time.time()
n = 0
srch = '\x00\x00\x01\x00'
laplen = len(srch)-1
lap = ''
while True:
    b = f.read(256*1024)
    if not b:
        break
    n += (lap+b[:laplen]).count(srch)
    n += b.count(srch)
    lap = b[-laplen:]
print 'fourth scan', time.time() - t, n
f.close()

On my (old) system, with a 512 MB file so it won't all buffer, the 
second time I get:

initial read 14.513395071
second read 14.8771388531
third chars 178.250257969
fourth scan 26.1602909565 1
________________________________________________________________________
TonyN.:'                        *firstname*nlsnews at georgea*lastname*.com
      '                                  <http://www.georgeanelson.com/>