Scanning a file
Bengt Richter
bokr at oz.net
Sun Oct 30 19:23:41 EST 2005
On Sat, 29 Oct 2005 21:10:11 +0100, Steve Holden <steve at holdenweb.com> wrote:
>Peter Otten wrote:
>> Bengt Richter wrote:
>>
>>
>>>What struck me was
>>>
>>>
>>>>>> gen = byblocks(StringIO.StringIO('no'),1024,len('end?')-1)
>>>>>> [gen.next() for i in xrange(10)]
>>>
>>>['no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no']
>>
>>
>> Ouch. Seems like I spotted the subtle cornercase error and missed the big
>> one.
>>
>
>No, you just realised subconsciously that we'd all spot the obvious one
>and decided to point out the bug that would remain after the obvious one
>had been fixed.
>
I still smelled a bug in the counting of substring in the overlap region,
and you motivated me to find it (obvious in hindsight, but aren't most ;-)
A substring can get over-counted if the "overlap" region joins infelicitously with the next
input. E.g., try counting 'xx' in 10*'xx' with a read chunk of 4 instead of 1024*1024:
Assuming corrections so far posted as I understand them:
>>> def byblocks(f, blocksize, overlap):
... block = f.read(blocksize)
... yield block
... if overlap>0:
... while True:
... next = f.read(blocksize-overlap)
... if not next: break
... block = block[-overlap:] + next
... yield block
... else:
... while True:
... next = f.read(blocksize)
... if not next: break
... yield next
...
>>> def countsubst(f, subst, blksize=1024*1024):
... count = 0
... for block in byblocks(f, blksize, len(subst)-1):
... count += block.count(subst)
... f.close()
... return count
...
>>> from StringIO import StringIO as S
>>> countsubst(S('xx'*10), 'xx', 4)
13
>>> ('xx'*10).count('xx')
10
>>> list(byblocks(S('xx'*10), 4, len('xx')-1))
['xxxx', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'xx']
Of course, a large read chunk will make the problem either
go away
>>> countsubst(S('xx'*10), 'xx', 1024)
10
or might make it low probability depending on the data.
Regards,
Bengt Richter
More information about the Python-list
mailing list