Memory efficient tuple storage

Fri Mar 13 12:49:51 EDT 2009

On Fri, 2009-03-13 at 08:59 -0700, psaffrey at googlemail.com wrote:
> I'm reading in some rather large files (28 files each of 130MB). Each
> file is a genome coordinate (chromosome (string) and position (int))
> and a data point (float). I want to read these into a list of
> coordinates (each a tuple of (chromosome, position)) and a list of
> data points.
> 
> This has taught me that Python lists are not memory efficient, because
> if I use lists it gets through 100MB a second until it hits the swap
> space and I have 8GB physical memory in this machine. I can use Python
> or numpy arrays for the data points, which is much more manageable.
> However, I still need the coordinates. If I don't keep them in a list,
> where can I keep them?

If you just have one list, of objects then it's actually relatively
efficient, it's if you have lots of lists that it's inefficient.

I'm not certain without seeing your code (and my biology isn't good
enough to know the answer to my question below)

How many unique chromosome strings do you have (by equivalence)?

If the same chromosome string is being used multiple times then you may
find it more efficient to reference the same string, so you don't need
to have multiple copies of the same string in memory. That may be what
is taking up the space.

i.e. something like (written verbosely)

reference_dict = {}
list_of_coordinates = []
for (chromosome,posn) in my_file:
    chromosome = reference_dict.setdefault(chromosome,chromosome)
    list_of_coordinates.append((chromosome,posn))

(or something like that)

Tim Wintle