Memory efficient tuple storage

Fri Mar 13 12:33:31 EDT 2009

On Fri, Mar 13, 2009 at 10:59 AM, psaffrey at googlemail.com
<psaffrey at googlemail.com> wrote:
> I'm reading in some rather large files (28 files each of 130MB). Each
> file is a genome coordinate (chromosome (string) and position (int))
> and a data point (float). I want to read these into a list of
> coordinates (each a tuple of (chromosome, position)) and a list of
> data points.
>
> This has taught me that Python lists are not memory efficient, because
> if I use lists it gets through 100MB a second until it hits the swap
> space and I have 8GB physical memory in this machine. I can use Python
> or numpy arrays for the data points, which is much more manageable.
> However, I still need the coordinates. If I don't keep them in a list,
> where can I keep them?

Assuming your data is in a plaintext file something like
'genomedata.txt' below, the following will load it into a numpy array
with a customized dtype.  You can access the different fields by name
('chromo', 'position', and 'dpoint' -- change to your liking).  Don't
know if this works or not; might give it a try.

===============================================

[186]$ cat genomedata.txt
gene1 120189 5.34849
gene2 84040 903873.1
gene3 300822 -21002.2020

[187]$ cat g2arr.py
import numpy as np

def g2arr(fname):
    # the 'S100' should be modified to be large enough for your string field.
    dt = np.dtype({'names': ['chromo', 'position', 'dpoint'],
'formats': ['S100', np.int, np.float]})
    return np.loadtxt(fname, delimiter=' ', dtype=dt)

if __name__ == '__main__':
    arr = g2arr('genomedata.txt')
    print arr
    print arr['chromo']
    print arr['position']
    print arr['dpoint']

=================================================

Take a look at the np.loadtxt and np.dtype documentation.

Kurt