Memory Error while constructing Compound Dictionary

Wed Sep 8 03:16:04 EDT 2004

Jeremy Bowers <jerf at jerf.org> wrote:
   ...
> Based on these numbers, aren't the majority of your lists empty? In that
> case, key off of a tuple as you need them. For some "factory", "date",
> "widget" and "whatever" (the data, which you don't specify):
> 
> Dict[(factory, date, widget)].setdefault([]).append(whatever)

Well, he does specify a '1' as the datum, which is why I suggested
keeping the matrix dense (minimal change to his program) and just using
numbers instead of lists (he only needs the length of the lists, he
might as well be keeping the counts differently).

But, yes, indexing with the tuple would probably cut down memory
consumption further, since he's got no more than a million entries; each
entry may take a few tens of bytes (if it needs to uniquely keep a tuple
and the strings in it) so his memory consumption would plummet.  Also,
he would save the long initialization phase entirely.

The parentheses are not needed, and for the "keeping counts" approach he
should be updating the counts with:

  Dict[factory, date, widget] = 1 + Dict.get((factory, date, widget), 0)

Also, Dict.get((factory, date, widget), 0) is what he would be using to
check his Dict at the end.  It's no doubt worth making this prettier...:

class SparseMatrix(dict):
    def __getitem__(self, key): return self.get(key, 0)

so the update becomes:
   Dict[factory, date, widget] += 1 
and the check at the end just Dict[factory, date, widget].

> (You may actually benefit from changing the setdefault around to some
> "try"-based scheme; you'd have to benchmark the various choices if it
> matters to you. Also, depending on your data, consider Psyco.)

If you're trying to SAVE memory, do NOT, repeat NOT, consider Psyco,
which eats it for breakfast;-).

> No extra memory. Some potentially significant slowdown due to tuple
> construction and empty list construction.

Nah, he's saving 125 M operations at initialization, and doing at most
about 1M updates, no way this will slow things down.  But then it
appears he doesn't need the lists.

> (Depending on what you are doing, C++ and the STL can be not *that* much
> harder to program in, and it should support hash tables with tuple keys,
> although I don't know the magic incantations off the top of my head. If

I doubt they'll beat Python dict's, which are a wonder of speed.

> you're just adding some numbers or something, this is also likely to be
> enough faster that it may even be faster to develop in, since the program
> will almost certainly execute much, much faster, potentially making the

I think you're wrong here.  C++ in terms of standard only has
order-based containers, whose performance just can't compare to hash
tables.  I do believe the STL, which has been evolving independently
from C++ for many years now, has added hash tables (and surely you can
find good ones on www.boost.org, you can find EVERYTHING there and if
you're programming in C++ it should be your second home).

> Actually, you probably should go to a database now; these problems have a
> habit of doubling or tripling in size when someone decides it'd be neat to
> add Just One More attribute and you already pretty much can't afford that.)

With a tuple-indexed SparseMatrix of numbers, his program will be taking
up a tiny fraction of the memory it's burning now, and he will have room
to spare for enhancements.  And adding an attribute will cost little,
say a few tens of megabytes if he has a million entries and the new
attribute is typically a string a few tens of bytes long.

 Relational databases are cool because you can make impromptu queries
(if you speak SQL) and with clever indexing the cost of reading in the
dataset is only paid once.  But for his problem as stated he may well be
happier staying in Python and just using the tricks suggested in these
posts.

Suppose he did need lists and mere counts would not do.  A fuller Null
Object DP can help.  E.g., consider:

In [5]: class Nullo:
   ...:     def __iadd__(self, wot): return wot
   ...:     def __len__(self): return 0
   ...:     def __iter__(self): return iter(())
   ...:     

In [6]: na = Nullo() 

In [7]: alist = [na] * 33

In [8]: alist[7] += [23]

In [9]: alist[7] += [45]

In [10]: alist[12] += [3]

In [12]: [(i,x) for i,x in enumerate(alist) if x]
Out[12]: [(7, [23, 45]), (12, [3])]

Some might consider this an abuse of __iadd__ (shouldn't it "always"
return self?!) but I consider it a case of "practicality beats purity".

Use one (ONE) instance of this Nullo class as the default object
returned by the above-shown class SparseMatrix, and you have a sparse
matrix of "lists" (most of them are that one Nullo instance, which
behaves, ducktyping-like, as an empty list, for the requested uses; the
non-empty ones are bona fide lists)...

Alex