Scanning a file

Sat Oct 29 18:55:32 EDT 2005

Steven D'Aprano wrote:
> On Fri, 28 Oct 2005 06:22:11 -0700, pinkfloydhomer at gmail.com wrote:
> 
>>Which is quite fast. The only problems is that the file might be huge.
> 
> What *you* call huge and what *Python* calls huge may be very different
> indeed. What are you calling huge?
> 

I'm not saying that it is too big for Python. I am saying that it is too 
big for the systems it is going to run on. These files can be 22 MB or 5 
GB or ..., depending on the situation. It might not be okay to run a 
tool that claims that much memory, even if it is available.

> 
>>I really have no need for reading the entire file into a string as I am
>>doing here. All I want is to count occurences this substring. Can I
>>somehow count occurences in a file without reading it into a string
>>first?
> 
> Magic?
> 

That would be nice :)

But you misunderstand me...

> You have to read the file into memory at some stage, otherwise how can you
> see what value the bytes are? 

I haven't said that I would like to scan the file without reading it. I 
am just saying that the .count() functionality implemented into strings 
could just as well be applied to some abstraction such as a stream (I 
come from C++). In C++, the count() functionality would be separated as 
much as possible from any concrete datatype (such as a string), 
precisely because it is a concept that is applicable at a more abstract 
level. I should be able to say "count the substring occurences of this 
stream" or "using this iterator" or something to that effect. If I could say

print file("filename", "rb").count("\x00\x00\x01\x00")

(or something like that)

instead of the original

print file("filename", "rb").read().count("\x00\x00\x01\x00")

it would be exactly what I am after. What is the conceptual difference? 
The first solution should be at least as fast as the second. I have to 
read and compare the characters anyway. I just don't need to store them 
in a string. In essence, I should be able to use the "count occurences" 
functionality on more things, such as a file, or even better, a file 
read through a buffer with a size specified by me.

> 
> Here is another thought. What are you going to do with the count when you
> are done? That sounds to me like a pretty pointless result: "Hi user, the
> file XYZ has 27 occurrences of bitpattern \x00\x00\x01\x00. Would you like
> to do another file?"
> 

It might sound pointless to you, but it is not pointless for my purposes :)

If you must know, the above one-liner actually counts the number of 
frames in an MPEG2 file. I want to know this number for a number of 
files for various reasons. I don't want it to take forever.

> If you are planning to use this count to do something, perhaps there is a
> more efficient way to combine the two steps into one -- especially
> valuable if your files really are huge.
> 

Of course, but I don't need to do anything else in this case.

/David