Hashed lookups for tabular data

Mon Jan 19 12:17:57 EST 2015

On Tue, Jan 20, 2015 at 4:09 AM, Joseph L. Casale
<jcasale at activenetwerx.com> wrote:
>> row = (foo, bar, quux) # there could be duplicate quuxes but not foos or bars
>> foo_dict = {}
>> bar_dict = {}
>> quux_dict = collections.defaultdict(set)
>>
>> foo_dict[row[0]] = row
>> bar_dict[row[1]] = row
>> quux_dict[row[2]].add(row)
>
> This is actually far simpler than I had started imagining, however the row data
> is duplicated. I am hacking away at an attempt with references to one copy of
> the row.

As long as you start with an actual tuple that you then assign in that
way, they will all be references to one copy of the row. Here's a
concrete example:

$ python3
Python 3.5.0a0 (default:1c51f1650c42+, Dec 29 2014, 02:29:06)
[GCC 4.7.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> row = ("abc","def","ghi")
>>> foo_dict = {}
>>> bar_dict = {}
>>> import collections
>>> quux_dict = collections.defaultdict(set)
>>> foo_dict[row[0]] = row
>>> bar_dict[row[1]] = row
>>> quux_dict[row[2]].add(row)
>>> id(foo_dict["abc"])
140127396717840
>>> id(bar_dict["def"])
140127396717840
>>> id(next(iter(quux_dict["ghi"])))
140127396717840

The IDs of the objects prove that they're actually all the same
object. The memory requirement for this is just what the dictionaries
themselves require; their keys and values are all shared with other
usage.

ChrisA