Scanning a file

Fri Oct 28 14:39:04 EDT 2005

Andrew McCarthy <a_mccarthy at hotmail.com> writes:

> On 2005-10-28, pinkfloydhomer at gmail.com <pinkfloydhomer at gmail.com> wrote:
>> I'm now down to:
>>
>> f = open("filename", "rb")
>> s = f.read()
>> sub = "\x00\x00\x01\x00"
>> count = s.count(sub)
>> print count
>>
>> Which is quite fast. The only problems is that the file might be huge.
>> I really have no need for reading the entire file into a string as I am
>> doing here. All I want is to count occurences this substring. Can I
>> somehow count occurences in a file without reading it into a string
>> first?
>
> Yes - use memory mapping (the mmap module). An mmap object is like a
> cross between a file and a string, but the data is only read into RAM
> when, and for as long as, necessary. An mmap object doesn't have a
> count() method, but you can just use find() in a while loop instead.

Except if you can't read the file into memory because it's to large,
there's a pretty good chance you won't be able to mmap it either.  To
deal with huge files, the only option is to read the file in in
chunks, count the occurences in each chunk, and then do some fiddling
to deal with the pattern landing on a boundary.

      <mike
-- 
Mike Meyer <mwm at mired.org>			http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.