Optimizing a text statistics function
Nickolay Kolev
nmkolev at uni-bonn.de
Thu Apr 22 01:20:42 EDT 2004
Scott David Daniels wrote:
> Better not to do several huge string allocs above (I suspect).
> This method lets you to work on files too large to read into memory:
... code ...
Now this is something I was thinking about when I started considering
options for this task.
From my understanding, for line in file('...'): is only useful if I
want to make checks on the lines and then do something with them:
for line in file('...'):
if line.startswith('XXX'):
myDesiredLines.append(line)
This avoids creating a huge list of all lines and then filtering it to
another list with only the desired ones.
In my case I want to get all the lines regardless of any condition. I
also do not even need them as a list of lines, a single huge string is
completely sufficient. It will be split on whitespace later anyway.
Both methods should produce the same amount of memory usage as all words
are stored in a list. Reading a file line by line should be slower, as
Python would have to check where the newline characters are. Please
comment and correct, I am making assumptions here. Clarification from
someone who know how these things are internally implemented would be nice.
Nicky
More information about the Python-list
mailing list