Scanning a file

Sat Oct 29 01:38:21 EDT 2005

"Paul Watson" <pwatson at redlinepy.com> writes:
> Here is a better one that counts, and not just detects, the substring.  This 
> is -much- faster than using mmap; especially for a large file that may cause 
> paging to start.  Using mmap can be -very- slow.
>
> #!/usr/bin/env python
> import sys
>
> fn = 't2.dat'
> ss = '\x00\x00\x01\x00'
>
> be = len(ss) - 1        # length of overlap to check
> blocksize = 64 * 1024    # need to ensure that blocksize > overlap
>
> fp = open(fn, 'rb')
> b = fp.read(blocksize)
> count = 0
> while len(b) > be:
>     count += b.count(ss)
>     b = b[-be:] + fp.read(blocksize)
> fp.close()
>
> print count
> sys.exit(0) 
>
>

Did you do timings on it vs. mmap? Having to copy the data multiple
times to deal with the overlap - thanks to strings being immutable -
would seem to be a lose, and makes me wonder how it could be faster
than mmap in general.

     Thanks,
     <mike
-- 
Mike Meyer <mwm at mired.org>			http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.