Scanning a file
Bengt Richter
bokr at oz.net
Mon Oct 31 12:02:18 EST 2005
On Mon, 31 Oct 2005 09:19:10 +0100, Peter Otten <__peter__ at web.de> wrote:
>Bengt Richter wrote:
>
>> I still smelled a bug in the counting of substring in the overlap region,
>> and you motivated me to find it (obvious in hindsight, but aren't most ;-)
>>
>> A substring can get over-counted if the "overlap" region joins
>> infelicitously with the next input. E.g., try counting 'xx' in 10*'xx'
>> with a read chunk of 4 instead of 1024*1024:
>>
>> Assuming corrections so far posted as I understand them:
>>
>> >>> def byblocks(f, blocksize, overlap):
>> ... block = f.read(blocksize)
>> ... yield block
>> ... if overlap>0:
>> ... while True:
>> ... next = f.read(blocksize-overlap)
>> ... if not next: break
>> ... block = block[-overlap:] + next
>> ... yield block
>> ... else:
>> ... while True:
>> ... next = f.read(blocksize)
>> ... if not next: break
>> ... yield next
>> ...
>> >>> def countsubst(f, subst, blksize=1024*1024):
>> ... count = 0
>> ... for block in byblocks(f, blksize, len(subst)-1):
>> ... count += block.count(subst)
>> ... f.close()
>> ... return count
>> ...
>>
>> >>> from StringIO import StringIO as S
>> >>> countsubst(S('xx'*10), 'xx', 4)
>> 13
>> >>> ('xx'*10).count('xx')
>> 10
>> >>> list(byblocks(S('xx'*10), 4, len('xx')-1))
>> ['xxxx', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'xx']
>>
>> Of course, a large read chunk will make the problem either
>> go away
>>
>> >>> countsubst(S('xx'*10), 'xx', 1024)
>> 10
>>
>> or might make it low probability depending on the data.
>
>[David Rasmussen]
>
>> First of all, this isn't a text file, it is a binary file. Secondly,
>> substrings can overlap. In the sequence 0010010 the substring 0010
>> occurs twice.
The OP didn't reply to my post re the above for some reason
http://groups.google.com/group/comp.lang.python/msg/dd4125bf38a54b7c?hl=en&
>
>Coincidentally the "always overlap" case seems the easiest to fix. It
>suffices to replace the count() method with
>
>def count_overlap(s, token):
> pos = -1
> n = 0
> while 1:
> try:
> pos = s.index(token, pos+1)
> except ValueError:
> break
> n += 1
> return n
>
>Or so I hope, without the thorough tests that are indispensable as we should
>have learned by now...
>
Unfortunately, there is such a thing as a correct implementation of an incorrect spec ;-)
I have some doubts about the OP's really wanting to count overlapping patterns as above,
which is what I asked about in the above referenced post. Elsewhere he later reveals:
[David Rasmussen]
>> If you must know, the above one-liner actually counts the number of
>> frames in an MPEG2 file. I want to know this number for a number of
>> files for various reasons. I don't want it to take forever.
In which case I doubt whether he wants to count as above. Scanning for the
particular 4 bytes would assume that non-frame-marker data is escaped
one way or another so it can't contain the marker byte sequence.
(If it did, you'd want to skip it, not count it, I presume). Robust streaming video
format would presumably be designed for unambigous re-synching, meaning
the data stream can't contain the sync mark. But I don't know if that
is guaranteed in conversion from file to stream a la HDLC or some link packet protocol
or whether it is actually encoded with escaping in the file. If framing in the file is with
length-specifying packet headers and no data escaping, then the filebytes.count(pattern)
approach is not going to do the job reliably, as Lasse was pointing out.
Requirements, requirements ;-)
Regards,
Bengt Richter
More information about the Python-list
mailing list