Hashed lookups for tabular data

Chris Angelico rosuav at gmail.com
Mon Jan 19 09:26:46 EST 2015


On Tue, Jan 20, 2015 at 1:13 AM, Joseph L. Casale
<jcasale at activenetwerx.com> wrote:
> No surprise the data originates from a database however the data is utilized in
> a recursive processing loop where the user accessing the data benefits from a
> simplified and quick means to access it. Currently its done with classes that
> resemble named tuples equating to a row, but it goes pear shaped with the outer
> container. Grouping these similar objects and providing the sql like query into
> the collection in some cases is O(n)...

So presumably your data's small enough to fit into memory, right? If
it isn't, going back to the database every time would be the best
option. But if it is, can you simply keep three dictionaries in sync?

row = (foo, bar, quux) # there could be duplicate quuxes but not foos or bars
foo_dict = {}
bar_dict = {}
quux_dict = collections.defaultdict(set)

foo_dict[row[0]] = row
bar_dict[row[1]] = row
quux_dict[row[2]].add(row)

You can now look up anything by any key. If you use the quux_dict,
you'll get back a list (zero elements if you've never seen one,
otherwise all elements ever seen, in arbitrary order). Adding, in this
way, is idempotent.

Finding an item by foo or bar is easy - use either of:
row = foo_dict["asdf"] # will raise KeyError if it isn't there
row = foo_dict.get("qwer") # will return None if it isn't there

Finding items by quux is a matter of iteration:
for row in quux_dict["zxcv"]:
    do something

It's up to you to decide whether it's an error to get back an empty set or not.

Does this give a basic idea of what you can do?

The memory requirements wouldn't be much more than your existing data.
You have three lookup dictionaries, but the actual keys and values
would be shared. The main consideration is that you'd need to update
in lots of places if you remove a row. If you're working with stable
data across one run of the program, though, that should be safe.

ChrisA



More information about the Python-list mailing list