Scanning a file

Sun Oct 30 19:23:41 EST 2005

On Sat, 29 Oct 2005 21:10:11 +0100, Steve Holden <steve at holdenweb.com> wrote:

>Peter Otten wrote:
>> Bengt Richter wrote:
>> 
>> 
>>>What struck me was
>>>
>>>
>>>>>> gen = byblocks(StringIO.StringIO('no'),1024,len('end?')-1)
>>>>>> [gen.next() for i in xrange(10)]
>>>
>>>['no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no']
>> 
>> 
>> Ouch. Seems like I spotted the subtle cornercase error and missed the big
>> one.
>> 
>
>No, you just realised subconsciously that we'd all spot the obvious one 
>and decided to point out the bug that would remain after the obvious one 
>had been fixed.
>
I still smelled a bug in the counting of substring in the overlap region,
and you motivated me to find it (obvious in hindsight, but aren't most ;-)

A substring can get over-counted if the "overlap" region joins infelicitously with the next
input. E.g., try counting 'xx' in 10*'xx' with a read chunk of 4 instead of 1024*1024:

Assuming corrections so far posted as I understand them:

 >>> def byblocks(f, blocksize, overlap):
 ...     block = f.read(blocksize)
 ...     yield block
 ...     if overlap>0:
 ...         while True:
 ...             next = f.read(blocksize-overlap)
 ...             if not next: break
 ...             block = block[-overlap:] + next
 ...             yield block
 ...     else:
 ...         while True:
 ...             next = f.read(blocksize)
 ...             if not next: break
 ...             yield next
 ...
 >>> def countsubst(f, subst, blksize=1024*1024):
 ...     count = 0
 ...     for block in byblocks(f, blksize, len(subst)-1):
 ...         count += block.count(subst)
 ...     f.close()
 ...     return count
 ...

 >>> from StringIO import StringIO as S
 >>> countsubst(S('xx'*10), 'xx',  4)
 13
 >>> ('xx'*10).count('xx')
 10
 >>> list(byblocks(S('xx'*10), 4, len('xx')-1))
 ['xxxx', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'xx']

Of course, a large read chunk will make the problem either
go away

 >>> countsubst(S('xx'*10), 'xx',  1024)
 10

or might make it low probability depending on the data.

Regards,
Bengt Richter