Best way to structure data for efficient searching

John Nagle nagle at animats.com
Tue Apr 3 12:38:50 EDT 2012


On 3/28/2012 11:39 AM, Larry.Martell at gmail.com wrote:
> I have the following use case:
>
> I have a set of data that is contains 3 fields, K1, K2 and a
> timestamp. There are duplicates in the data set, and they all have to
> processed.
>
> Then I have another set of data with 4 fields: K3, K4, K5, and a
> timestamp. There are also duplicates in that data set, and they also
> all have to be processed.
>
> I need to find all the items in the second data set where K1==K3 and
> K2==K4 and the 2 timestamps are within 20 seconds of each other.
>
> I have this working, but the way I did it seems very inefficient - I
> simply put the data in 2 arrays (as tuples) and then walked through
> the entire second data set once for each item in the first data set,
> looking for matches.
>
> Is there a better, more efficient way I could have done this?

    How big are the data sets?  Millions of entries?  Billions?
Trillions?  Will all the data fit in memory, or will this need
files or a database.

    In-memory, it's not hard.  First, decide which data set is smaller.
That one gets a dictionary keyed by K1 or K3, with each entry being
a list of tuples.  Then go through the other data set linearly.

    You can also sort one database by K1, the other by K3, and
match.  Then take the matches, sort by K2 and K4, and match again.
Sort the remaining matches by timestamp and pull the ones within
the threshold.

    Or you can load all the data into a database with a query
optimizer, like MySQL, and let it figure out, based on the
index sizes, how to do the join.

    All of these approaches are roughly O(N log N), which
beats the O(N^2) approach you have now.

				John Nagle



More information about the Python-list mailing list