Scanning a file

Sat Oct 29 14:12:40 EDT 2005

On Sat, 29 Oct 2005 10:34:24 +0200, Peter Otten <__peter__ at web.de> wrote:

>Bengt Richter wrote:
>
>> On Fri, 28 Oct 2005 20:03:17 -0700, aleaxit at yahoo.com (Alex Martelli)
>> wrote:
>> 
>>>Mike Meyer <mwm at mired.org> wrote:
>>>   ...
>>>> Except if you can't read the file into memory because it's to large,
>>>> there's a pretty good chance you won't be able to mmap it either.  To
>>>> deal with huge files, the only option is to read the file in in
>>>> chunks, count the occurences in each chunk, and then do some fiddling
>>>> to deal with the pattern landing on a boundary.
>>>
>>>That's the kind of things generators are for...:
>>>
>>>def byblocks(f, blocksize, overlap):
>>>    block = f.read(blocksize)
>>>    yield block
>>>    while block:
>>>        block = block[-overlap:] + f.read(blocksize-overlap)
>>>        if block: yield block
>>>
>>>Now, to look for a substring of length N in an open binary file f:
>>>
>>>f = open(whatever, 'b')
>>>count = 0
>>>for block in byblocks(f, 1024*1024, len(subst)-1):
>>>    count += block.count(subst)
>>>f.close()
>>>
>>>not much "fiddling" needed, as you can see, and what little "fiddling"
>>>is needed is entirely encompassed by the generator...
>>>
>> Do I get a job at google if I find something wrong with the above? ;-)
>
>Try it with a subst of length 1. Seems like you missed an opportunity :-)
>
I was thinking this was an example a la Alex's previous discussion
of interviewee code challenges ;-)

What struck me was

 >>> gen = byblocks(StringIO.StringIO('no'),1024,len('end?')-1)
 >>> [gen.next() for i in xrange(10)]
 ['no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no']

Regards,
Bengt Richter