efficient way to process data

Sun Jan 12 22:25:57 EST 2014

On Sun, Jan 12, 2014 at 5:43 PM, Dennis Lee Bieber
<wlfraed at ix.netcom.com> wrote:
> On Sun, 12 Jan 2014 14:23:17 -0500, Larry Martell <larry.martell at gmail.com>
> declaimed the following:
>
>>I have an python app that queries a MySQL DB. The query has this form:
>>
>>SELECT a, b, c, d, AVG(e), STD(e), CONCAT(x, ',', y) as f
>>FROM t
>>GROUP BY a, b, c, d, f
>>
>>x and y are numbers (378.18, 2213.797 or 378.218, 2213.949 or
>>10053.490, 2542.094).
>>
>
>         Decimal (Numeric) or floating/real. If the latter, the internal storage
> may not be exact (378.1811111111 and 378.179999999 may both "display" as
> 378.18, but will not match for grouping).

In the database they are decimal. They are being converted to char by
the CONCAT(x, ',', y).

>>The business issue is that if either x or y in 2 rows that are in the
>>same a, b, c, d group are within 1 of each other then they should be
>>grouped together. And to make it more complicated, the tolerance is
>>applied as a rolling continuum. For example, if the x and y in a set
>>of grouped rows are:
>>
>         As I understand group by, it will first group by "a", WITHIN the "a"
> groups it will then group by "b"... Probably not a matter germane to the
> problem as you are concerning yourself with the STRING representation of
> "x" and "y" with a comma delimiter -- which is only looked at if the
> "a,b,c,d" are equal... Thing is, a string comparison is going to operate
> strictly left to right -- it won't even see your "y" value unless all the
> "x" value is equal.

Yes, that is correct. The original requirement was to group by (X, Y),
so the CONCAT(x, ',', y) was correct and working. Then the requirement
was change to apply the tolerance as I described.

>
>         You may need to operate using subselects... So that you can specify
> something like
>
>         where   abs(s1.x -s2.x) < tolerance or abs(s1.y-s2.y) < tolerance
>                 and (s1.a = s2.a ... s1.d = s2.d)
>
> s1/s1 are the subselects (you may need a primary key <> primary key to
> avoid having it output a record where the two subselects are for the SAME
> record -- or maybe not, since you /do/ want that record also output). Going
> to be a costly query since you are basically doing
>
>         foreach r1 in s1
>                 foreach r2 in s2
>                         emit r2 when...

Speed is an issue here, and while the current query performs well, in
my experience subqueries and self joins do not. I'm going to try and
do it all in python and see how it performs. The other option is to
pre-process the data on the way into the database. Doing that will
eliminate some of the data partitioning as all of the data that could
be joined will be in the same input file. I'm just not sure if it will
OK to actually munge the data. I'll find that out tomorrow.