Memory issues when storing as List of Strings vs List of List

Tue Nov 30 15:19:32 EST 2010

OW Ghim Siong <owgs at bii.a-star.edu.sg> writes:

> I have a big file 1.5GB in size, with about 6 million lines of
> tab-delimited data. I have to perform some filtration on the data and
> keep the good data. After filtration, I have about 5.5 million data
> left remaining. As you might already guessed, I have to read them in
> batches and I did so using .readlines(100000000).

Why do you need to handle the batching in your code? Perhaps you're not
aware that a file object is already an iterator for the lines of text in
the file.

> After reading each batch, I will split the line (in string format) to
> a list using .split("\t") and then check several conditions, after
> which if all conditions are satisfied, I will store the list into a
> matrix.

As I understand it, you don't need a line after moving to the next. So
there's no need to maintain a manual buffer of lines at all; please
explain if there is something additional requiring a huge buffer of
input lines.

> The code is as follows:
> -----Start------
> a=open("bigfile")
> matrix=[]
> while True:
>    lines = a.readlines(100000000)
>    for line in lines:
>        data=line.split("\t")
>        if several_conditions_are_satisfied:
>            matrix.append(data)
>    print "Number of lines read:", len(lines), "matrix.__sizeof__:",
> matrix.__sizeof__()
>    if len(lines)==0:
>        break
> -----End-----

Using the file's native line iterator::

    infile = open("bigfile")
    matrix = []
    for line in infile:
        record = line.split("\t")
        if several_conditions_are_satisfied:
            matrix.append(record)

> Results:
> Number of lines read: 461544 matrix.__sizeof__: 1694768
> Number of lines read: 449840 matrix.__sizeof__: 3435984
> Number of lines read: 455690 matrix.__sizeof__: 5503904
> Number of lines read: 451955 matrix.__sizeof__: 6965928
> Number of lines read: 452645 matrix.__sizeof__: 8816304
> Number of lines read: 448555 matrix.__sizeof__: 9918368
>
> Traceback (most recent call last):
> MemoryError

If you still get a MemoryError, you can use the ‘pdb’ module
<URL:http://docs.python.org/library/pdb.html> to debug it interactively.

Another option is to catch the MemoryError and construct a diagnostic
message similar to the one you had above::

    import sys

    infile = open("bigfile")
    matrix = []
    for line in infile:
        record = line.split("\t")
        if several_conditions_are_satisfied:
            try:
                matrix.append(record)
            except MemoryError:
                matrix_len = len(matrix)
                sys.stderr.write(
                    "len(matrix): %(matrix_len)d\n" % vars())
                raise

> I have tried creating such a matrix of equivalent size and it only
> uses 35mb of memory but I am not sure why when using the code above,
> the memory usage shot up so fast and exceeded 2GB.
>
> Any advice is greatly appreciated.

With large data sets, and the manipulation and computation you will
likely be wanting to perform, it's probably time to consider the NumPy
library <URL:http://numpy.scipy.org/> which has much more powerful array
types, part of the SciPy library <URL:http://www.scipy.org/>.

-- 
 \        “[It's] best to confuse only one issue at a time.” —Brian W. |
  `\  Kernighan, Dennis M. Ritchie, _The C programming language_, 1988 |
_o__)                                                                  |
Ben Finney