optimizing memory utilization

Tue Sep 14 01:08:22 EDT 2004

    anon> [[<Alb1ID#>, '<Alb1Artist>', '<Alb1Title>', '<Alb1Genre>','<Alb1Year>',
    anon>   [["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]],
    anon>  [<Alb2ID#>, '<Alb2Artist>', '<Alb2Title>', '<Alb2Genre>','<Alb2Year>',
    anon>   [["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]],
    anon>     ...
    anon>  [<AlbNID#>, '<AlbNArtist>', '<AlbNTitle>', '<AlbNGenre>','<AlbNYear>',
    anon>   [["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]]]]

    anon> So the problem I'm having is that I want to load it all in memory
    anon> (the two files total about 250MB of raw data) but just loading the
    anon> first 50,000 lines of tracks (about 25MB of raw data) consumes
    anon> 75MB of RAM.  If the approximation is fairly accurate, I'd need
    anon> >750MB of available RAM just to load my in-memory database.

    anon> The bottom line is, is there a more memory efficient way to load
    anon> all this arbitrary field length and count type data into RAM?  

Sure, assuming you know what your keys are, store them in a db file.  Let's
assume you want to search by artist.  Do your csv thing, but store the
records in a shelve keyed by the AlbNArtist field:

    import shelve
    import csv

    reader = csv.reader(open("file1.csv", "rb"))
    db = shelve.open("file1.db")

    for row in reader:
        stuff = db.get(row[1], [])
        stuff.append(row)
        db[row[1]] = stuff
    db.close()

I'm not sure I've interpreted your sample csv quite right, but I think
you'll get the idea.  You can of course have multiple db files, each keyed
by a different field (or part of a field).

Obviously, using a db file will be slower than an in-memory dictionary, but
if memory is a bottleneck, this will likely help.  You can also avoid
initializing the db file on subsequent program runs if the csv file is older
than the db file, probably resulting in faster startup.

Skip