Memory issues when storing as List of Strings vs List of List

OW Ghim Siong owgs at bii.a-star.edu.sg
Tue Nov 30 05:29:35 EST 2010


Hi all,

I have a big file 1.5GB in size, with about 6 million lines of 
tab-delimited data. I have to perform some filtration on the data and 
keep the good data. After filtration, I have about 5.5 million data left 
remaining. As you might already guessed, I have to read them in batches 
and I did so using .readlines(100000000). After reading each batch, I 
will split the line (in string format) to a list using .split("\t") and 
then check several conditions, after which if all conditions are 
satisfied, I will store the list into a matrix.

The code is as follows:
-----Start------
a=open("bigfile")
matrix=[]
while True:
    lines = a.readlines(100000000)
    for line in lines:
        data=line.split("\t")
        if several_conditions_are_satisfied:
            matrix.append(data)
    print "Number of lines read:", len(lines), "matrix.__sizeof__:", 
matrix.__sizeof__()
    if len(lines)==0:
        break
-----End-----

Results:
Number of lines read: 461544 matrix.__sizeof__: 1694768
Number of lines read: 449840 matrix.__sizeof__: 3435984
Number of lines read: 455690 matrix.__sizeof__: 5503904
Number of lines read: 451955 matrix.__sizeof__: 6965928
Number of lines read: 452645 matrix.__sizeof__: 8816304
Number of lines read: 448555 matrix.__sizeof__: 9918368

Traceback (most recent call last):
MemoryError

The peak memory usage at the task manager is > 2GB which results in the 
memory error.

However, if I modify the code, to store as a list of string rather than 
a list of list by changing the append statement stated above to 
"matrix.append("\t".join(data))", then I do not run out of memory.

Results:
Number of lines read: 461544 matrix.__sizeof__: 1694768
Number of lines read: 449840 matrix.__sizeof__: 3435984
Number of lines read: 455690 matrix.__sizeof__: 5503904
Number of lines read: 451955 matrix.__sizeof__: 6965928
Number of lines read: 452645 matrix.__sizeof__: 8816304
Number of lines read: 448555 matrix.__sizeof__: 9918368
Number of lines read: 453455 matrix.__sizeof__: 12552984
Number of lines read: 432440 matrix.__sizeof__: 14122132
Number of lines read: 432921 matrix.__sizeof__: 15887424
Number of lines read: 464259 matrix.__sizeof__: 17873376
Number of lines read: 450875 matrix.__sizeof__: 20107572
Number of lines read: 458552 matrix.__sizeof__: 20107572
Number of lines read: 453261 matrix.__sizeof__: 22621044
Number of lines read: 413456 matrix.__sizeof__: 22621044
Number of lines read: 166464 matrix.__sizeof__: 25448700
Number of lines read: 0 matrix.__sizeof__: 25448700

In this case, the peak memory according to the task manager is about 1.5 GB.

Does anyone know why is there such a big difference memory usage when 
storing the matrix as a list of list, and when storing it as a list of 
string? According to __sizeof__ though, the values are the same whether 
storing it as a list of list, or storing it as a list of string. Is 
there any methods how I can store all the info into a list of list? I 
have tried creating such a matrix of equivalent size and it only uses 
35mb of memory but I am not sure why when using the code above, the 
memory usage shot up so fast and exceeded 2GB.

Any advice is greatly appreciated.

Regards,
Jinxiang



More information about the Python-list mailing list