Scanning a file

Mon Oct 31 12:02:18 EST 2005

On Mon, 31 Oct 2005 09:19:10 +0100, Peter Otten <__peter__ at web.de> wrote:

>Bengt Richter wrote:
>
>> I still smelled a bug in the counting of substring in the overlap region,
>> and you motivated me to find it (obvious in hindsight, but aren't most ;-)
>> 
>> A substring can get over-counted if the "overlap" region joins
>> infelicitously with the next input. E.g., try counting 'xx' in 10*'xx'
>> with a read chunk of 4 instead of 1024*1024:
>> 
>> Assuming corrections so far posted as I understand them:
>> 
>>  >>> def byblocks(f, blocksize, overlap):
>>  ...     block = f.read(blocksize)
>>  ...     yield block
>>  ...     if overlap>0:
>>  ...         while True:
>>  ...             next = f.read(blocksize-overlap)
>>  ...             if not next: break
>>  ...             block = block[-overlap:] + next
>>  ...             yield block
>>  ...     else:
>>  ...         while True:
>>  ...             next = f.read(blocksize)
>>  ...             if not next: break
>>  ...             yield next
>>  ...
>>  >>> def countsubst(f, subst, blksize=1024*1024):
>>  ...     count = 0
>>  ...     for block in byblocks(f, blksize, len(subst)-1):
>>  ...         count += block.count(subst)
>>  ...     f.close()
>>  ...     return count
>>  ...
>> 
>>  >>> from StringIO import StringIO as S
>>  >>> countsubst(S('xx'*10), 'xx',  4)
>>  13
>>  >>> ('xx'*10).count('xx')
>>  10
>>  >>> list(byblocks(S('xx'*10), 4, len('xx')-1))
>>  ['xxxx', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'xx']
>> 
>> Of course, a large read chunk will make the problem either
>> go away
>> 
>>  >>> countsubst(S('xx'*10), 'xx',  1024)
>>  10
>> 
>> or might make it low probability depending on the data.
>
>[David Rasmussen]
>
>> First of all, this isn't a text file, it is a binary file. Secondly,
>> substrings can overlap. In the sequence 0010010 the substring 0010
>> occurs twice.
The OP didn't reply to my post re the above for some reason
   http://groups.google.com/group/comp.lang.python/msg/dd4125bf38a54b7c?hl=en&

>
>Coincidentally the "always overlap" case seems the easiest to fix. It
>suffices to replace the count() method with
>
>def count_overlap(s, token):
>    pos = -1
>    n = 0
>    while 1:
>        try:
>            pos = s.index(token, pos+1)
>        except ValueError:
>            break
>        n += 1
>    return n
>
>Or so I hope, without the thorough tests that are indispensable as we should
>have learned by now...
>
Unfortunately, there is such a thing as a correct implementation of an incorrect spec ;-)
I have some doubts about the OP's really wanting to count overlapping patterns as above,
which is what I asked about in the above referenced post. Elsewhere he later reveals:

[David Rasmussen]
>> If you must know, the above one-liner actually counts the number of 
>> frames in an MPEG2 file. I want to know this number for a number of 
>> files for various reasons. I don't want it to take forever.

In which case I doubt whether he wants to count as above. Scanning for the
particular 4 bytes would assume that non-frame-marker data is escaped
one way or another so it can't contain the marker byte sequence.
(If it did, you'd want to skip it, not count it, I presume). Robust streaming video
format would presumably be designed for unambigous re-synching, meaning
the data stream can't contain the sync mark. But I don't know if that
is guaranteed in conversion from file to stream a la HDLC or some link packet protocol
or whether it is actually encoded with escaping in the file. If framing in the file is with
length-specifying packet headers and no data escaping, then the filebytes.count(pattern)
approach is not going to do the job reliably, as Lasse was pointing out.

Requirements, requirements ;-)

Regards,
Bengt Richter