Best way to structure data for efficient searching

Larry.Martell@gmail.com larry.martell at gmail.com
Wed Mar 28 16:05:41 EDT 2012


On Mar 28, 1:52 pm, Jon Clements <jon... at googlemail.com> wrote:
> On Wednesday, 28 March 2012 19:39:54 UTC+1, Larry.... at gmail.com  wrote:
> > I have the following use case:
>
> > I have a set of data that is contains 3 fields, K1, K2 and a
> > timestamp. There are duplicates in the data set, and they all have to
> > processed.
>
> > Then I have another set of data with 4 fields: K3, K4, K5, and a
> > timestamp. There are also duplicates in that data set, and they also
> > all have to be processed.
>
> > I need to find all the items in the second data set where K1==K3 and
> > K2==K4 and the 2 timestamps are within 20 seconds of each other.
>
> > I have this working, but the way I did it seems very inefficient - I
> > simply put the data in 2 arrays (as tuples) and then walked through
> > the entire second data set once for each item in the first data set,
> > looking for matches.
>
> > Is there a better, more efficient way I could have done this?
>
> It might not be more *efficient* but others might find it more readable, and it'd be easier to change later. Try an in-memory SQL DB (such as sqlite3) and query as (untested)
>
> select t2.* from t1 join t2 on k1=k3 and k2=k4 where abs(t1.timestamp - t2.timestamp) < 20

This is part of django app, and the data came from mysql. Through a
mechanism I can't change at this time (it would take a lot of code
changes and this new functionality is needed ASAP) I get all the data
at once and have to winnow it down.


> Failing that, two (default)dicts with a tuple as the pair, then use that as your base.

Since there are duplicates, I can't use a dict. And if I have any
extraneous data in the keys (i.e. something to make them unique) then
I still have to walk through the entire dict to find the matches.



More information about the Python-list mailing list