Scanning a file
David Rasmussen
david.rasmussen at gmx.net
Sat Oct 29 18:55:32 EDT 2005
Steven D'Aprano wrote:
> On Fri, 28 Oct 2005 06:22:11 -0700, pinkfloydhomer at gmail.com wrote:
>
>>Which is quite fast. The only problems is that the file might be huge.
>
> What *you* call huge and what *Python* calls huge may be very different
> indeed. What are you calling huge?
>
I'm not saying that it is too big for Python. I am saying that it is too
big for the systems it is going to run on. These files can be 22 MB or 5
GB or ..., depending on the situation. It might not be okay to run a
tool that claims that much memory, even if it is available.
>
>>I really have no need for reading the entire file into a string as I am
>>doing here. All I want is to count occurences this substring. Can I
>>somehow count occurences in a file without reading it into a string
>>first?
>
> Magic?
>
That would be nice :)
But you misunderstand me...
> You have to read the file into memory at some stage, otherwise how can you
> see what value the bytes are?
I haven't said that I would like to scan the file without reading it. I
am just saying that the .count() functionality implemented into strings
could just as well be applied to some abstraction such as a stream (I
come from C++). In C++, the count() functionality would be separated as
much as possible from any concrete datatype (such as a string),
precisely because it is a concept that is applicable at a more abstract
level. I should be able to say "count the substring occurences of this
stream" or "using this iterator" or something to that effect. If I could say
print file("filename", "rb").count("\x00\x00\x01\x00")
(or something like that)
instead of the original
print file("filename", "rb").read().count("\x00\x00\x01\x00")
it would be exactly what I am after. What is the conceptual difference?
The first solution should be at least as fast as the second. I have to
read and compare the characters anyway. I just don't need to store them
in a string. In essence, I should be able to use the "count occurences"
functionality on more things, such as a file, or even better, a file
read through a buffer with a size specified by me.
>
> Here is another thought. What are you going to do with the count when you
> are done? That sounds to me like a pretty pointless result: "Hi user, the
> file XYZ has 27 occurrences of bitpattern \x00\x00\x01\x00. Would you like
> to do another file?"
>
It might sound pointless to you, but it is not pointless for my purposes :)
If you must know, the above one-liner actually counts the number of
frames in an MPEG2 file. I want to know this number for a number of
files for various reasons. I don't want it to take forever.
> If you are planning to use this count to do something, perhaps there is a
> more efficient way to combine the two steps into one -- especially
> valuable if your files really are huge.
>
Of course, but I don't need to do anything else in this case.
/David
More information about the Python-list
mailing list