Scanning a file
Peter Otten
__peter__ at web.de
Mon Oct 31 03:19:10 EST 2005
Bengt Richter wrote:
> I still smelled a bug in the counting of substring in the overlap region,
> and you motivated me to find it (obvious in hindsight, but aren't most ;-)
>
> A substring can get over-counted if the "overlap" region joins
> infelicitously with the next input. E.g., try counting 'xx' in 10*'xx'
> with a read chunk of 4 instead of 1024*1024:
>
> Assuming corrections so far posted as I understand them:
>
> >>> def byblocks(f, blocksize, overlap):
> ... block = f.read(blocksize)
> ... yield block
> ... if overlap>0:
> ... while True:
> ... next = f.read(blocksize-overlap)
> ... if not next: break
> ... block = block[-overlap:] + next
> ... yield block
> ... else:
> ... while True:
> ... next = f.read(blocksize)
> ... if not next: break
> ... yield next
> ...
> >>> def countsubst(f, subst, blksize=1024*1024):
> ... count = 0
> ... for block in byblocks(f, blksize, len(subst)-1):
> ... count += block.count(subst)
> ... f.close()
> ... return count
> ...
>
> >>> from StringIO import StringIO as S
> >>> countsubst(S('xx'*10), 'xx', 4)
> 13
> >>> ('xx'*10).count('xx')
> 10
> >>> list(byblocks(S('xx'*10), 4, len('xx')-1))
> ['xxxx', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'xx']
>
> Of course, a large read chunk will make the problem either
> go away
>
> >>> countsubst(S('xx'*10), 'xx', 1024)
> 10
>
> or might make it low probability depending on the data.
[David Rasmussen]
> First of all, this isn't a text file, it is a binary file. Secondly,
> substrings can overlap. In the sequence 0010010 the substring 0010
> occurs twice.
Coincidentally the "always overlap" case seems the easiest to fix. It
suffices to replace the count() method with
def count_overlap(s, token):
pos = -1
n = 0
while 1:
try:
pos = s.index(token, pos+1)
except ValueError:
break
n += 1
return n
Or so I hope, without the thorough tests that are indispensable as we should
have learned by now...
Peter
More information about the Python-list
mailing list