Scanning a file

Fri Oct 28 23:03:17 EDT 2005

Mike Meyer <mwm at mired.org> wrote:
   ...
> Except if you can't read the file into memory because it's to large,
> there's a pretty good chance you won't be able to mmap it either.  To
> deal with huge files, the only option is to read the file in in
> chunks, count the occurences in each chunk, and then do some fiddling
> to deal with the pattern landing on a boundary.

That's the kind of things generators are for...:

def byblocks(f, blocksize, overlap):
    block = f.read(blocksize)
    yield block
    while block:
        block = block[-overlap:] + f.read(blocksize-overlap)
        if block: yield block

Now, to look for a substring of length N in an open binary file f:

f = open(whatever, 'b')
count = 0
for block in byblocks(f, 1024*1024, len(subst)-1):
    count += block.count(subst)
f.close()

not much "fiddling" needed, as you can see, and what little "fiddling"
is needed is entirely encompassed by the generator...

Alex