Memory Error while constructing Compound Dictionary

Wed Sep 8 02:42:28 EDT 2004

On Tue, 07 Sep 2004 20:46:18 -0700, Benjamin Scott wrote:

> The Original Problem:
> 
> The data set:  3 Columns and at least 100,000 rows.  However, it can
> be up to 1,000,000 rows.
> 
> For the purpose of illustration let's suppose that the first column
> has the name of 1,000 "Factories", i.e. there are 1,000 unique symbols
> in the first column.  Likewise, suppose the second column contains a
> "production date" or just a date; there are 250 unique dates in the
> second column.  Finally, suppose the third column contains a
> description of a "widget type"; there are 500 unique widget
> descriptions.

Based on these numbers, aren't the majority of your lists empty? In that
case, key off of a tuple as you need them. For some "factory", "date",
"widget" and "whatever" (the data, which you don't specify):

Dict[(factory, date, widget)].setdefault([]).append(whatever)

(You may actually benefit from changing the setdefault around to some
"try"-based scheme; you'd have to benchmark the various choices if it
matters to you. Also, depending on your data, consider Psyco.)

No extra memory. Some potentially significant slowdown due to tuple
construction and empty list construction.

(Depending on what you are doing, C++ and the STL can be not *that* much
harder to program in, and it should support hash tables with tuple keys,
although I don't know the magic incantations off the top of my head. If
you're just adding some numbers or something, this is also likely to be
enough faster that it may even be faster to develop in, since the program
will almost certainly execute much, much faster, potentially making the
write-compile-test-debug cycle still faster than Python. Of course, if
this is a prelude to complicated meta-hack programming of the type Python
makes easy, never mind; I'm kinda guessing you're just storing numbers on
the grounds that you can't store much else in the machine at once. If
that's a bad assumption, you are in some trouble, as even C++ can only be
100%-episilon efficient and it will likely be affected by the same memory
limit, in which case you either need a larger machine or a database.

Actually, you probably should go to a database now; these problems have a
habit of doubling or tripling in size when someone decides it'd be neat to
add Just One More attribute and you already pretty much can't afford that.)