Scanning a file

Fri Oct 28 23:27:41 EDT 2005

"Paul Watson" <pwatson at redlinepy.com> wrote in message 
news:3sf070Fo0klqU1 at individual.net...
> <pinkfloydhomer at gmail.com> wrote in message 
> news:1130497567.764104.125110 at g44g2000cwa.googlegroups.com...
>>I want to scan a file byte for byte for occurences of the the four byte
>> pattern 0x00000100. I've tried with this:
>>
>> # start
>> import sys
>>
>> numChars = 0
>> startCode = 0
>> count = 0
>>
>> inputFile = sys.stdin
>>
>> while True:
>>    ch = inputFile.read(1)
>>    numChars += 1
>>
>>    if len(ch) < 1: break
>>
>>    startCode = ((startCode << 8) & 0xffffffff) | (ord(ch))
>>    if numChars < 4: continue
>>
>>    if startCode == 0x00000100:
>>        count = count + 1
>>
>> print count
>> # end
>>
>> But it is very slow. What is the fastest way to do this? Using some
>> native call? Using a buffer? Using whatever?
>>
>> /David

Here is a better one that counts, and not just detects, the substring.  This 
is -much- faster than using mmap; especially for a large file that may cause 
paging to start.  Using mmap can be -very- slow.

#!/usr/bin/env python
import sys

fn = 't2.dat'
ss = '\x00\x00\x01\x00'

be = len(ss) - 1        # length of overlap to check
blocksize = 64 * 1024    # need to ensure that blocksize > overlap

fp = open(fn, 'rb')
b = fp.read(blocksize)
count = 0
while len(b) > be:
    count += b.count(ss)
    b = b[-be:] + fp.read(blocksize)
fp.close()

print count
sys.exit(0)