Scanning a file
Mike Meyer
mwm at mired.org
Sat Oct 29 01:38:21 EDT 2005
"Paul Watson" <pwatson at redlinepy.com> writes:
> Here is a better one that counts, and not just detects, the substring. This
> is -much- faster than using mmap; especially for a large file that may cause
> paging to start. Using mmap can be -very- slow.
>
> #!/usr/bin/env python
> import sys
>
> fn = 't2.dat'
> ss = '\x00\x00\x01\x00'
>
> be = len(ss) - 1 # length of overlap to check
> blocksize = 64 * 1024 # need to ensure that blocksize > overlap
>
> fp = open(fn, 'rb')
> b = fp.read(blocksize)
> count = 0
> while len(b) > be:
> count += b.count(ss)
> b = b[-be:] + fp.read(blocksize)
> fp.close()
>
> print count
> sys.exit(0)
>
>
Did you do timings on it vs. mmap? Having to copy the data multiple
times to deal with the overlap - thanks to strings being immutable -
would seem to be a lose, and makes me wonder how it could be faster
than mmap in general.
Thanks,
<mike
--
Mike Meyer <mwm at mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
More information about the Python-list
mailing list