efficient way to process data

Sun Jan 12 22:35:47 EST 2014

On Sun, Jan 12, 2014 at 6:27 PM, Chris Angelico <rosuav at gmail.com> wrote:
> On Mon, Jan 13, 2014 at 6:23 AM, Larry Martell <larry.martell at gmail.com> wrote:
>> I have an python app that queries a MySQL DB. The query has this form:
>>
>> SELECT a, b, c, d, AVG(e), STD(e), CONCAT(x, ',', y) as f
>> FROM t
>> GROUP BY a, b, c, d, f
>>
>> x and y are numbers (378.18, 2213.797 or 378.218, 2213.949 or
>> 10053.490, 2542.094).
>>
>> The business issue is that if either x or y in 2 rows that are in the
>> same a, b, c, d group are within 1 of each other then they should be
>> grouped together. And to make it more complicated, the tolerance is
>> applied as a rolling continuum. For example, if the x and y in a set
>> of grouped rows are:
>>
>> row 1: 1.5, 9.5
>> row 2: 2.4, 20.8
>> row 3: 3.3, 40.6
>> row 4: 4.2, 2.5
>> row 5: 5.1, 10.1
>> row 6: 6.0, 7.9
>> row 7: 8.0, 21.0
>> row 8: 100, 200
>>
>> 1 through 6 get combined because all their X values are within the
>> tolerance of some other X in the set that's been combined. 7's Y value
>> is within the tolerance of 2's Y, so that should be combined as well.
>> 8 is not combined because neither the X or Y value is within the
>> tolerance of any X or Y in the set that was combined.
>
> Trying to get my head around this a bit more. Are columns a/b/c/d
> treated as a big category (eg type, brand, category, model), such that
> nothing will ever be grouped that has any difference in those four
> columns? If so, we can effectively ignore them and pretend we have a
> table with exactly one set (eg stick a WHERE clause onto the query
> that stipulates their values). Then what you have is this:
>
> * Aggregate based on proximity of x and y
> * Emit results derived from e
>
> Is that correct?

There will be multiple groups of a/b/c/d. I simplified the query for
the purposes of posting my question. There is a where clause with
values that come from user input. None, any, or all of a, b, c, or d
could be in the where clause.

> So here's my way of writing it.
>
> * Subselect: List all values for x, in order, and figure out which
> ones are less than the previous value plus one
> * Subselect: Ditto, for y.
> * Outer select: Somehow do an either-or group. I'm not quite sure how
> to do that part, actually!
>
> A PGSQL window function would cover the two subselects - at least, I'm
> fairly sure it would. I can't quite get the whole thing, though; I can
> get a true/false flag that says whether it's near to the previous one
> (that's easy), and creating a grouping column value should be possible
> from that but I'm not sure how.
>
> But an either-or grouping is a bit trickier. The best I can think of
> is to collect all the y values for each group of x values, and then if
> any two groups 'overlap' (ie have points within 1.0 of each other),
> merge the groups. That's going to be seriously tricky to do in SQL, I
> think, so you may have to go back to Python on that one.
>
> My analysis suggests that, whatever happens, you're going to need
> every single y value somewhere. So it's probably not worth trying to
> do any grouping/aggregation in SQL, since you need to further analyze
> all the individual data points. I can't think of any way better than
> just leafing through the whole table (either in Python or in a stored
> procedure - if you can run your script on the same computer that's
> running the database, I'd do that, otherwise consider a stored
> procedure to reduce network transfers) and building up mappings.
>
> Of course, "I can't think of a way" does not equate to "There is no
> way". There may be some magic trick that I didn't think of, or some
> arcane incantation that gets what you want. Who knows? If you can
> produce an ASCII art Mandelbrot set [1] in pure SQL, why not this!
>
> ChrisA
>
> [1] http://wiki.postgresql.org/wiki/Mandelbrot_set

Thanks for the reply. I'm going to take a stab at removing the group
by and doing it all in python. It doesn't look too hard, but I don't
know how it will perform.