Scanning a file

Mon Oct 31 03:19:10 EST 2005

Bengt Richter wrote:

> I still smelled a bug in the counting of substring in the overlap region,
> and you motivated me to find it (obvious in hindsight, but aren't most ;-)
> 
> A substring can get over-counted if the "overlap" region joins
> infelicitously with the next input. E.g., try counting 'xx' in 10*'xx'
> with a read chunk of 4 instead of 1024*1024:
> 
> Assuming corrections so far posted as I understand them:
> 
>  >>> def byblocks(f, blocksize, overlap):
>  ...     block = f.read(blocksize)
>  ...     yield block
>  ...     if overlap>0:
>  ...         while True:
>  ...             next = f.read(blocksize-overlap)
>  ...             if not next: break
>  ...             block = block[-overlap:] + next
>  ...             yield block
>  ...     else:
>  ...         while True:
>  ...             next = f.read(blocksize)
>  ...             if not next: break
>  ...             yield next
>  ...
>  >>> def countsubst(f, subst, blksize=1024*1024):
>  ...     count = 0
>  ...     for block in byblocks(f, blksize, len(subst)-1):
>  ...         count += block.count(subst)
>  ...     f.close()
>  ...     return count
>  ...
> 
>  >>> from StringIO import StringIO as S
>  >>> countsubst(S('xx'*10), 'xx',  4)
>  13
>  >>> ('xx'*10).count('xx')
>  10
>  >>> list(byblocks(S('xx'*10), 4, len('xx')-1))
>  ['xxxx', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'xx']
> 
> Of course, a large read chunk will make the problem either
> go away
> 
>  >>> countsubst(S('xx'*10), 'xx',  1024)
>  10
> 
> or might make it low probability depending on the data.

[David Rasmussen]

> First of all, this isn't a text file, it is a binary file. Secondly,
> substrings can overlap. In the sequence 0010010 the substring 0010
> occurs twice.

Coincidentally the "always overlap" case seems the easiest to fix. It
suffices to replace the count() method with

def count_overlap(s, token):
    pos = -1
    n = 0
    while 1:
        try:
            pos = s.index(token, pos+1)
        except ValueError:
            break
        n += 1
    return n

Or so I hope, without the thorough tests that are indispensable as we should
have learned by now...

Peter