Finding subsets for a robust regression

Mon Sep 29 16:04:48 EDT 2008

tkp... at hotmail.com:

> My code follows, and it seems a bit clumsy - is there a cleaner way to do it?

The code doesn't look bad. I can suggest few things:
- When you have "paragraphs" of code that do something definite then
the comment before them can be written without indentation, to denote
it regards the whole little block. Or you can even call a subfunction
(a function inside another function/method)
- Finding good names for variables is important.

So this code:

>     d = {}                  #identify unique instances of x and y
>     for xx,yy in zip(x,y):
>         if xx in d:
>             d[xx].append(yy)
>         else:
>             d[xx] = [yy]

May become (untested):

>     # identify unique instances of x and y
>     d = {}
>     for xx, yy in zip(x, y):
>         if xx in d:
>             d[xx].append(yy)
>         else:
>             d[xx] = [yy]

Or better:

>     # identify unique instances of seqx and seqx
>     d = defaultdict(list)
>     for x, y in izip(seqx, seqx):
>         d[x].append(y)

Or even something like:

>     def unique_x(seqx, seqx):
>         """identify unique instances of seqx and seqx"""
>         result = defaultdict(list)
>         for x, y in izip(seqx, seqx):
>             result[x].append(y)
>         return result
>     ...
>     d = unique_x(x, y)

The following line is too much heavy, it looks like a misuse of the
inline if syntax:

>             y1.append( (yy[l//2-1] + yy[l//2])/2.0 if l % 2 == 0 else
> yy[l//2] )

It was not added to python for that long lines. This looks better to
me:

        l = len(yy)
        if l == 1:
            y_median = yy[0])
        else:
            yy.sort()
            if l & 1:
                y_median = yy[l // 2]
            else:
                y_median = yy[l // 2 - 1] + yy[l // 2]) / 2.0
        y1.append(y_median)

But even better is to use a stronger library code, for example a
median() able to run in O(n), and a more strong unique(), like:
http://code.activestate.com/recipes/438599/
http://code.activestate.com/recipes/466330/
But using a stronger unique() this time isn't useful, so I like this
compromise:

    # identify unique instances of seqx and seqx
    uniques_x = defaultdict(list)
    for x, y in izip(seqx, seqx):
        uniques_x[x].append(y)

    result_x = []
    result_y = []
    for x, ys in uniques_x.iteritems():
        result_x.append(x)
        result_y.append(median(ys))

If you want to use the last bit of semantics of CPython you can even
write that code like this:

    # identify unique instances of seqx and seqx
    uniques_x = defaultdict(list)
    for x, y in izip(seqx, seqx):
        uniques_x[x].append(y)

    result_x = uniques_x.keys()
    result_y = map(median, uniques_x.itervalues())

I think it works because keys and values are given in the same order,
but in real code I tend to avoid using such subtle things. Because if
you translate that code to another language, or you use another Python
implementation it may not work anymore, and lot of code sooner or
later becomes translated...

Bye,
bearophile