efficient way to process data

Sun Jan 12 18:27:02 EST 2014

On Mon, Jan 13, 2014 at 6:23 AM, Larry Martell <larry.martell at gmail.com> wrote:
> I have an python app that queries a MySQL DB. The query has this form:
>
> SELECT a, b, c, d, AVG(e), STD(e), CONCAT(x, ',', y) as f
> FROM t
> GROUP BY a, b, c, d, f
>
> x and y are numbers (378.18, 2213.797 or 378.218, 2213.949 or
> 10053.490, 2542.094).
>
> The business issue is that if either x or y in 2 rows that are in the
> same a, b, c, d group are within 1 of each other then they should be
> grouped together. And to make it more complicated, the tolerance is
> applied as a rolling continuum. For example, if the x and y in a set
> of grouped rows are:
>
> row 1: 1.5, 9.5
> row 2: 2.4, 20.8
> row 3: 3.3, 40.6
> row 4: 4.2, 2.5
> row 5: 5.1, 10.1
> row 6: 6.0, 7.9
> row 7: 8.0, 21.0
> row 8: 100, 200
>
> 1 through 6 get combined because all their X values are within the
> tolerance of some other X in the set that's been combined. 7's Y value
> is within the tolerance of 2's Y, so that should be combined as well.
> 8 is not combined because neither the X or Y value is within the
> tolerance of any X or Y in the set that was combined.

Trying to get my head around this a bit more. Are columns a/b/c/d
treated as a big category (eg type, brand, category, model), such that
nothing will ever be grouped that has any difference in those four
columns? If so, we can effectively ignore them and pretend we have a
table with exactly one set (eg stick a WHERE clause onto the query
that stipulates their values). Then what you have is this:

* Aggregate based on proximity of x and y
* Emit results derived from e

Is that correct?

So here's my way of writing it.

* Subselect: List all values for x, in order, and figure out which
ones are less than the previous value plus one
* Subselect: Ditto, for y.
* Outer select: Somehow do an either-or group. I'm not quite sure how
to do that part, actually!

A PGSQL window function would cover the two subselects - at least, I'm
fairly sure it would. I can't quite get the whole thing, though; I can
get a true/false flag that says whether it's near to the previous one
(that's easy), and creating a grouping column value should be possible
from that but I'm not sure how.

But an either-or grouping is a bit trickier. The best I can think of
is to collect all the y values for each group of x values, and then if
any two groups 'overlap' (ie have points within 1.0 of each other),
merge the groups. That's going to be seriously tricky to do in SQL, I
think, so you may have to go back to Python on that one.

My analysis suggests that, whatever happens, you're going to need
every single y value somewhere. So it's probably not worth trying to
do any grouping/aggregation in SQL, since you need to further analyze
all the individual data points. I can't think of any way better than
just leafing through the whole table (either in Python or in a stored
procedure - if you can run your script on the same computer that's
running the database, I'd do that, otherwise consider a stored
procedure to reduce network transfers) and building up mappings.

Of course, "I can't think of a way" does not equate to "There is no
way". There may be some magic trick that I didn't think of, or some
arcane incantation that gets what you want. Who knows? If you can
produce an ASCII art Mandelbrot set [1] in pure SQL, why not this!

ChrisA

[1] http://wiki.postgresql.org/wiki/Mandelbrot_set