optimizing memory utilization
Skip Montanaro
skip at pobox.com
Tue Sep 14 01:08:22 EDT 2004
anon> [[<Alb1ID#>, '<Alb1Artist>', '<Alb1Title>', '<Alb1Genre>','<Alb1Year>',
anon> [["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]],
anon> [<Alb2ID#>, '<Alb2Artist>', '<Alb2Title>', '<Alb2Genre>','<Alb2Year>',
anon> [["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]],
anon> ...
anon> [<AlbNID#>, '<AlbNArtist>', '<AlbNTitle>', '<AlbNGenre>','<AlbNYear>',
anon> [["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]]]]
anon> So the problem I'm having is that I want to load it all in memory
anon> (the two files total about 250MB of raw data) but just loading the
anon> first 50,000 lines of tracks (about 25MB of raw data) consumes
anon> 75MB of RAM. If the approximation is fairly accurate, I'd need
anon> >750MB of available RAM just to load my in-memory database.
anon> The bottom line is, is there a more memory efficient way to load
anon> all this arbitrary field length and count type data into RAM?
Sure, assuming you know what your keys are, store them in a db file. Let's
assume you want to search by artist. Do your csv thing, but store the
records in a shelve keyed by the AlbNArtist field:
import shelve
import csv
reader = csv.reader(open("file1.csv", "rb"))
db = shelve.open("file1.db")
for row in reader:
stuff = db.get(row[1], [])
stuff.append(row)
db[row[1]] = stuff
db.close()
I'm not sure I've interpreted your sample csv quite right, but I think
you'll get the idea. You can of course have multiple db files, each keyed
by a different field (or part of a field).
Obviously, using a db file will be slower than an in-memory dictionary, but
if memory is a bottleneck, this will likely help. You can also avoid
initializing the db file on subsequent program runs if the csv file is older
than the db file, probably resulting in faster startup.
Skip
More information about the Python-list
mailing list