Optimizing a text statistics function

Thu Apr 22 11:16:33 EDT 2004

"Nickolay Kolev" <nmkolev at uni-bonn.de> wrote in message
news:c67kmc$utu$1 at f1node01.rhrz.uni-bonn.de...
>  From my understanding, for line in file('...'): is only useful if I
> want to make checks on the lines and then do something with them:

It is also useful for limiting amount in RAM at one time.  The main
downside would be if the units you want to analyse, in this case words, are
split across line endings.  But split() on whitespace will also split the
units at line endings.

> Both methods should produce the same amount of memory usage as all words
> are stored in a list.

It seems to me that 300 meg file chunk + 400 meg list/dict is larger than
small file chunk + 400 meg list/dict, but maybe I misunderstand what you
meant.

> Reading a file line by line should be slower, as
> Python would have to check where the newline characters are.

'for line in file' has now been optimized to run quickly.  Behind the
scenes, sensibly sized chunks (such as 64K)  are read from disk into a
memory buffer.  Lines are then doled out one at a time.  This is done with
compiled C.  So I suspect the slowdown compared to read-all and split is
minimal.

Terry J. Reedy