Optimizing a text statistics function

Thu Apr 22 01:20:42 EDT 2004

Scott David Daniels wrote:

> Better not to do several huge string allocs above (I suspect).
> This method lets you to work on files too large to read into memory:

... code ...

Now this is something I was thinking about when I started considering 
options for this task.

 From my understanding, for line in file('...'): is only useful if I 
want to make checks on the lines and then do something with them:

for line in file('...'):
	if line.startswith('XXX'):
		myDesiredLines.append(line)

This avoids creating a huge list of all lines and then filtering it to 
another list with only the desired ones.

In my case I want to get all the lines regardless of any condition. I 
also do not even need them as a list of lines, a single huge string is 
completely sufficient. It will be split on whitespace later anyway.

Both methods should produce the same amount of memory usage as all words 
are stored in a list. Reading a file line by line should be slower, as 
Python would have to check where the newline characters are. Please 
comment and correct, I am making assumptions here. Clarification from 
someone who know how these things are internally implemented would be nice.

Nicky