Scanning a file
Paul Watson
pwatson at redlinepy.com
Fri Oct 28 23:27:41 EDT 2005
"Paul Watson" <pwatson at redlinepy.com> wrote in message
news:3sf070Fo0klqU1 at individual.net...
> <pinkfloydhomer at gmail.com> wrote in message
> news:1130497567.764104.125110 at g44g2000cwa.googlegroups.com...
>>I want to scan a file byte for byte for occurences of the the four byte
>> pattern 0x00000100. I've tried with this:
>>
>> # start
>> import sys
>>
>> numChars = 0
>> startCode = 0
>> count = 0
>>
>> inputFile = sys.stdin
>>
>> while True:
>> ch = inputFile.read(1)
>> numChars += 1
>>
>> if len(ch) < 1: break
>>
>> startCode = ((startCode << 8) & 0xffffffff) | (ord(ch))
>> if numChars < 4: continue
>>
>> if startCode == 0x00000100:
>> count = count + 1
>>
>> print count
>> # end
>>
>> But it is very slow. What is the fastest way to do this? Using some
>> native call? Using a buffer? Using whatever?
>>
>> /David
Here is a better one that counts, and not just detects, the substring. This
is -much- faster than using mmap; especially for a large file that may cause
paging to start. Using mmap can be -very- slow.
#!/usr/bin/env python
import sys
fn = 't2.dat'
ss = '\x00\x00\x01\x00'
be = len(ss) - 1 # length of overlap to check
blocksize = 64 * 1024 # need to ensure that blocksize > overlap
fp = open(fn, 'rb')
b = fp.read(blocksize)
count = 0
while len(b) > be:
count += b.count(ss)
b = b[-be:] + fp.read(blocksize)
fp.close()
print count
sys.exit(0)
More information about the Python-list
mailing list