Memory issues when storing as List of Strings vs List of List

Peter Otten __peter__ at web.de
Tue Nov 30 07:22:40 EST 2010


OW Ghim Siong wrote:

> Hi all,
> 
> I have a big file 1.5GB in size, with about 6 million lines of
> tab-delimited data. I have to perform some filtration on the data and
> keep the good data. After filtration, I have about 5.5 million data left
> remaining. As you might already guessed, I have to read them in batches
> and I did so using .readlines(100000000). After reading each batch, I
> will split the line (in string format) to a list using .split("\t") and
> then check several conditions, after which if all conditions are
> satisfied, I will store the list into a matrix.
> 
> The code is as follows:
> -----Start------
> a=open("bigfile")
> matrix=[]
> while True:
>     lines = a.readlines(100000000)
>     for line in lines:
>         data=line.split("\t")
>         if several_conditions_are_satisfied:
>             matrix.append(data)
>     print "Number of lines read:", len(lines), "matrix.__sizeof__:",
> matrix.__sizeof__()
>     if len(lines)==0:
>         break
> -----End-----

As Ulrich says, don't use readlines(), use

for line in a:
   ... 

that way you have only one line in memory at a time instead of the huge 
lines list.

> Results:
> Number of lines read: 461544 matrix.__sizeof__: 1694768
> Number of lines read: 449840 matrix.__sizeof__: 3435984
> Number of lines read: 455690 matrix.__sizeof__: 5503904
> Number of lines read: 451955 matrix.__sizeof__: 6965928
> Number of lines read: 452645 matrix.__sizeof__: 8816304
> Number of lines read: 448555 matrix.__sizeof__: 9918368
> 
> Traceback (most recent call last):
> MemoryError
> 
> The peak memory usage at the task manager is > 2GB which results in the
> memory error.
> 
> However, if I modify the code, to store as a list of string rather than
> a list of list by changing the append statement stated above to
> "matrix.append("\t".join(data))", then I do not run out of memory.
> 
> Results:
> Number of lines read: 461544 matrix.__sizeof__: 1694768
> Number of lines read: 449840 matrix.__sizeof__: 3435984
> Number of lines read: 455690 matrix.__sizeof__: 5503904
> Number of lines read: 451955 matrix.__sizeof__: 6965928
> Number of lines read: 452645 matrix.__sizeof__: 8816304
> Number of lines read: 448555 matrix.__sizeof__: 9918368
> Number of lines read: 453455 matrix.__sizeof__: 12552984
> Number of lines read: 432440 matrix.__sizeof__: 14122132
> Number of lines read: 432921 matrix.__sizeof__: 15887424
> Number of lines read: 464259 matrix.__sizeof__: 17873376
> Number of lines read: 450875 matrix.__sizeof__: 20107572
> Number of lines read: 458552 matrix.__sizeof__: 20107572
> Number of lines read: 453261 matrix.__sizeof__: 22621044
> Number of lines read: 413456 matrix.__sizeof__: 22621044
> Number of lines read: 166464 matrix.__sizeof__: 25448700
> Number of lines read: 0 matrix.__sizeof__: 25448700
> 
> In this case, the peak memory according to the task manager is about 1.5
> GB.
> 
> Does anyone know why is there such a big difference memory usage when
> storing the matrix as a list of list, and when storing it as a list of
> string? According to __sizeof__ though, the values are the same whether
> storing it as a list of list, or storing it as a list of string. Is

sizeof gives you the "shallow" size of the list, basically the memory to 
hold C pointers to the items in the list. A better approximation for the 
total size of a list of lists of string is

>>> from sys import getsizeof as sizeof
>>> matrix = [["alpha", "beta"], ["gamma", "delta"]]
>>> sizeof(matrix), sum(sizeof(row) for row in matrix), sum(sizeof(entry) 
for row in matrix for entry in row)
(88, 176, 179)
>>> sum(_)
443

As you can see the outer list requires only a small portion of the total 
memory, and its relative size will decrease as the matrix grows.

The above calculation may still be wrong because some of the strings could 
be identical. Collapsing identical strings into a single object is also a 
way to save memory if you have a significant number of repetitions. Try

matrix = []
with open(...) as f:
    for line in f:
        data = line.split("\t")
        if ...:
            matrix.append(map(intern, data))

to see whether it sufficiently reduces the amount of memory needed.

> there any methods how I can store all the info into a list of list? I
> have tried creating such a matrix of equivalent size and it only uses
> 35mb of memory but I am not sure why when using the code above, the
> memory usage shot up so fast and exceeded 2GB.
> 
> Any advice is greatly appreciated.
> 
> Regards,
> Jinxiang




More information about the Python-list mailing list