Memory Error while constructing Compound Dictionary

Wed Sep 8 02:48:45 EDT 2004

Benjamin Scott <mynewjunkaccount at hotmail.com> wrote:

> Thanks for the replies.
> 
> First I will make a minor correction to the code I originally posted
> and then I will describe the original problem I am trying to solve,
> per Alex's request.
> 
> Correction:
> 
> for s in Lst:
>      for t in nuerLst:
>           for r in nuestLst:
>                Dict[s][t][r]={}
> 
> ...should actually be...
> 
> for s in Lst:
>      for t in nuerLst:
>           for r in nuestLst:
>                Dict[s][t][r]=[]
> 
> That is, the object accessed by 3 keys is a list, not a 4th
> dictionary.

OK, unfortunately that doesn't change memory requirements, as 16 bytes
is still a minimum allocation for an object.

> 
> 
> The Original Problem:
> 
> The data set:  3 Columns and at least 100,000 rows.  However, it can
> be up to 1,000,000 rows.

Aha --  a sparse 3D matrix, VERY sparse, no more than 1 million "true"
entries out of 125 million slots, and all the rest just
"placeholders"...

> 
> For the purpose of illustration let's suppose that the first column
> has the name of 1,000 "Factories", i.e. there are 1,000 unique symbols
> in the first column.  Likewise, suppose the second column contains a
> "production date" or just a date; there are 250 unique dates in the
> second column.  Finally, suppose the third column contains a
> description of a "widget type"; there are 500 unique widget
> descriptions.

Sure, quite clear.

> 
> *** i.e. each row contains the name of one factory which produced one
> widget type on a particular date.  If a factory produced more than one
> widget on a given date it is reflected in the data as a new row. ***
> 
> The motivation to construct the mentioned compound dictionary comes
> from the fact that I need quick access to the following data sets:
   ...
> len(Lst[n])=3
> 
> Lst[n][0]="Factory"
> Lst[n][1]="date"
> Lst[n][2]="WidgetType"
> 
> for s in Lst:
>      Dict[s[0]][s[1]][s[2]].append('1')
> .
> .
> .
> 
> len(Dict["Factory"]["date"]["WidgetType"]) = #Widgets of some type
> produced at a Factory on a given date.
> 
> The idea here is that I will be graphing a handful of the data sets at
> a time; they will then be discarded and a new handful will be
> graphed... etc.
> 
> What I might attempt next is to construct the required data in R (or
> NumPy) since an array object seems better suited for the task. 
> However, I'm not sure this will avert the memory error.  So, does

When you represent a sparse matrix as if it was a dense one, you hit a
typical wall of combinatorial explosion: you need memory proportional to
the product of all the dimensions, and for a matrix of high
dimensionality it's a horrid prospect.

> anyone know how to increase the RAM limit for a process?  Other

With a 32-bit CPU you're SOL, IMHO.  One option is to change machines:
Apple has just announced a very reasonably priced iMac G5, a 64-bit
machine intended for the home; or, you can do as I did, and look for a
little-used, well-reconditioned, guaranteed PowerMac G5 -- the latter
can use 8 GB of physical RAM and more importantly the address space is
only bounded by the amount of disk available, so a few hundred GBs may
be handled if you're in no hurry.  While these are wonderful machines,
however, I think you can do better.  Consider...:

> suggestions are also welcome.

The Null Object Design Pattern is more likely to be what you want ( a
fancy name for what in this case is quite simple, read on...):

Start by placing in each slot of the compound dictionary the SAME
object, which is just a placeholder.  So you'll still have 125 million
slots, but all initially will point at the same placeholder: so you're
spending only 125 million times the size of a SLOT, about 4 bytes, for a
total of 500 megabytes -- plus something because dictionaries being hash
table are always "overdimensioned" a bit, but you should fit very
comfortably in your 2GB anyway.

Now, as the data come in, you ADD 1 instead of APPENDING a string of '1'
to the appropriate slot.  THEN and only then, for those relatively very
few cells of the 3D matrix take up space for a new object,

Moreover with the operations you appear to need you don't need to make a
special null object, I think: just the integer 0 will do, and you will
not call len() at the end since the integer is already stored in the
cell.  If you wanted to store more info in each cell or somehow keep
track more directly of what cells are non-empty, etc etc, then you would
go for a more complete Null Object DP.  But for your problem as stated,
the following might suffice:

1. initialize your dictionary with:

 for s in Lst:
      for t in nuerLst:
           for r in nuestLst:
                Dict[s][t][r] = 0

2. update it on each incoming datum with:

for s in Lst:
     Dict[s[0]][s[1]][s[2]] += 1

3 consult it when done with:

Dict["Factory"]["date"]["WidgetType"] = #Widgets of some type
produced at a Factory on a given date.

Hope this helps -- if you do need a bit more, write about it here and
I'll happily show you a richer Null Object Design Pattern variant!

Alex