Scanning a file

Mon Oct 31 05:16:41 EST 2005

David Rasmussen wrote:

> Steven D'Aprano wrote:
> 
>> On Fri, 28 Oct 2005 06:22:11 -0700, pinkfloydhomer at gmail.com wrote:
>>
>>> Which is quite fast. The only problems is that the file might be huge.
>>
>>
>> What *you* call huge and what *Python* calls huge may be very different
>> indeed. What are you calling huge?
>>
> 
> I'm not saying that it is too big for Python. I am saying that it is too 
> big for the systems it is going to run on. These files can be 22 MB or 5 
> GB or ..., depending on the situation. It might not be okay to run a 
> tool that claims that much memory, even if it is available.

If your files can reach multiple gigabytes, you will 
definitely need an algorithm that avoids reading the 
entire file into memory at once.

[snip]

> print file("filename", "rb").count("\x00\x00\x01\x00")
> 
> (or something like that)
> 
> instead of the original
> 
> print file("filename", "rb").read().count("\x00\x00\x01\x00")
> 
> it would be exactly what I am after. 

I think I can say, without risk of contradiction, that 
there is no built-in method to do that.

 > What is the conceptual difference?
> The first solution should be at least as fast as the second. I have to 
> read and compare the characters anyway. I just don't need to store them 
> in a string. In essence, I should be able to use the "count occurences" 
> functionality on more things, such as a file, or even better, a file 
> read through a buffer with a size specified by me.

Of course, if you feel like coding the algorithm and 
submitting it to be included in the next release of 
Python... :-)

I can't help feeling that a generator with a buffer is 
the way to go, but I just can't *quite* deal with the 
case where the pattern overlaps the boundary... it is 
very annoying.

But not half as annoying as it must be to you :-)

However, there may be a simpler solution *fingers 
crossed* -- you are searching for a sub-string 
"\x00\x00\x01\x00", which is hex 0x100. Surely you 
don't want any old substring of "\x00\x00\x01\x00", but 
only the ones which align on word boundaries?

So "ABCD\x00\x00\x01\x00" would match (in hex, it is 
0x41424344 0x100), but "AB\x00\x00\x01\x00CD" should 
not, because that is 0x41420000 0x1004344 in hex.

If that is the case, your problem is simpler: you don't 
have to worry about the pattern crossing a boundary, so 
long as your buffer is a multiple of four bytes.

-- 
Steven.