Newbie: finding the key/index of the min/max element

Tim Peters tim.one at comcast.net
Sun May 5 05:22:04 EDT 2002


[David Eppstein]
> ...
> I'm pretty sure it would usually be better just to sort the list and
> return the middle item, instead of using a linear-time median
> finding algorithm, despite the nonlinear big-O time of a sort.

Many years ago I had to write a linear-time selector in Pascal.  It was so
horridly painful I never tried it again.  But I'm old enough now to confront
my fears <wink>, so I tried it in Python, and was pleasantly surprised at
how easily it went.  Its performance isn't bad, considering it's written in
Python, and how much effort went into speeding Python's C sort.  Here are
the seconds each took for finding the median of various sizes of arrays of
random doubles (these are the means of 3 runs at each size):

       n  clever    sort
 -------  ------  ------
       2   0.000   0.000
       3   0.000   0.000
       5   0.000   0.000
       9   0.000   0.000
      17   0.000   0.000
      33   0.000   0.000
      65   0.001   0.000
     129   0.002   0.000
     257   0.002   0.000
     513   0.004   0.001
    1025   0.008   0.002
    2049   0.018   0.004
    4097   0.037   0.009
    8193   0.088   0.020
   16385   0.166   0.047
   32769   0.303   0.106
   65537   0.638   0.236
  131073   1.276   0.528
  262145   2.579   1.152
  524289   5.245   2.523
 1048577  10.674   5.460
 2097153  22.063  11.863

I got bored then.  This is using -O with current CVS Python, which is
somewhat zippier than currently released Pythons.  For contrast, the Python
version doesn't take much longer to find the median than it takes just to
fill the array with random doubles (random.random() is coded in Python too).
There are obvious but unpleasant ways to speed the Python version (it does a
lot of data copying), which I'll skip.

exorcising-old-demons-ly y'rs  - tim


# Find the rank'th-smallest value in a, in worst-case quadratic time.
def short_find(a, rank):
    a.sort()
    return a[rank - 1]

# Find the rank'th-smallest value in a, in worst-case linear time.
def find(a, rank):
    n = len(a)
    assert 1 <= rank <= n
    if n <= 7:
        return short_find(a, rank)

    # Find median of median-of-7's.
    medians = [short_find(a[i : i+7], 4) for i in xrange(0, n-6, 7)]
    median = find(medians, (len(medians) + 1) // 2)

    # Partition around the median.
    # a[:i]   <= median
    # a[j+1:] >= median
    i, j = 0, n-1
    while i <= j:
        while a[i] < median:
            i += 1
        while a[j] > median:
            j -= 1
        if i <= j:
            a[i], a[j] = a[j], a[i]
            i += 1
            j -= 1

    if rank <= i:
        return find(a[:i], rank)
    else:
        return find(a[i:], rank - i)

def tryone(n, tries):
    from random import random
    from time import clock as now

    x = [None] * n
    median_rank = (n+1) // 2
    sum1 = sum2 = 0.0
    for i in range(tries):
        for i in xrange(n):
            x[i] = random()
        y = x[:]

        start = now()
        got1 = find(x, median_rank)
        elapsed = now() - start
        sum1 += elapsed

        start = now()
        got2 = short_find(y, median_rank)
        elapsed = now() - start
        sum2 += elapsed

        if got1 != got2:
            print "OUCH got1 %r got2 %r" % (got1, got2)

    return sum1, sum2

def drive(tries):
    for i in range(23):
        n = 2**i + 1
        fast, slow = tryone(n, tries)
        print "%8d %7.3f %7.3f" % (n, fast/tries, slow/tries)

if __name__ == "__main__":
    drive(3)


PS:  If you're more interested in expected time than worst-case time,
replacing the median-of-median-of-7s business with

        median = short_find([a[0], a[-1], a[n//2]], 2)

yields a Python median-finder that's usually significantly faster than the
sort method starting in the range of 512K to 1M elements.

PPS:  If you do care about worst-case time, the Python version can be made
significantly faster by boosting the median-of-7 gimmick to median-of-k for
larger odd values of k.  Fun for the whole family:  plot speedups against k
until you're all baffled <wink>.






More information about the Python-list mailing list