reading in lines from a file -FAST!

Wed Jul 30 14:25:25 EDT 2003

Rajarshi Guha <rajarshi at presidency.com> writes:

>  I have a file containing 168092 lines (each line a single word) and when
> I use
> 
> for line in f:
>   s = s + line
> 
> it takes for ages to read it all in - so long in fact that it makes the

Others have explained better ways to do it, but the *reason* your way
is so slow is that Python strings are immutable.

Say all your lines have length N.  Since strings are immutable, a new
one gets created for each + operation -- ie. for each line.  So, the
number of characters copied for i lines is N + 2N + 3N + 4N + ... +
iN, which (to a first approximation) is proportional to i**2.  That
gets bad quickly as i gets bigger!

Illustrative list comprehensions:

>>> [sum([j*65 for j in range(1, i+1)]) for i in range(1,20)]
[65, 195, 390, 650, 975, 1365, 1820, 2340, 2925, 3575, 4290, 5070, 5915, 6825, 7800, 8840, 9945, 11115, 12350]
>>> [65*i*(i+1)/2 for i in range(1, i+1)]
[65, 195, 390, 650, 975, 1365, 1820, 2340, 2925, 3575, 4290, 5070, 5915, 6825, 7800, 8840, 9945, 11115, 12350]
>>> [(65/2)*(i**2) for i in range(1, i+1)]
[32, 128, 288, 512, 800, 1152, 1568, 2048, 2592, 3200, 3872, 4608, 5408, 6272, 7200, 8192, 9248, 10368, 11552]
>>> [65*i*(i+1)/2 for i in [168092]]
[918290378070L]
>>>

So, don't do that...

John