Comparing lists

Sat Oct 15 08:58:03 EDT 2005

On Sat, 15 Oct 2005 06:31:53 +0200, Christian Stapfer wrote:

> "jon" <jon_usenet at capisco.com> wrote in message 
> news:1129216693.445956.155490 at g47g2000cwa.googlegroups.com...
>>
>> To take the heat out of the discussion:
>>
>> sets are blazingly fast.
> 
> I'd prefer a (however) rough characterization
> of computational complexity in terms of Big-Oh
> (or Big-whatever) *anytime* to marketing-type
> characterizations like this one...

Oh how naive.

The marketing department says: "It's O(N), so it is blindingly fast."

Translation: the amount of computation it does is linearly proportional
to N. The constant of proportionality is 1e10.

The marketing department says: "Our competitor's product is O(N**2), so it
runs like a three-legged dog."

Translation: the amount of computation it does is linearly proportional to
N squared. The constant of proportionality is 1e-100.

You do the maths.

Big O notation is practically useless for judging how fast a single
algorithm will be, or how one algorithm compares to another. It is only
useful for telling you how a single algorithm will scale as the input
increases.

It is very common for sensible programmers to fall back on a "less
efficient" O(N**2) or even O(2**N) algorithm for small amounts of data, if
that algorithm runs faster than the "more efficient" O(N) or O(log N)
algorithm. In fact, that's exactly what the sort() method does in Python:
for small enough lists, say, under 100 elements, it is quicker to run an
O(N**2) algorithm (shell sort I believe) than it is to perform the
complex set up for the merge-sort variant used for larger lists.

As for sets, they are based on dicts, which are effectively hash tables.
Hash tables are O(1), unless there are collisions, in which case the more
common algorithms degenerate to O(N). So, a very rough and ready estimate
of the complexity of the algorithm using sets would be somewhere between
O(1) and O(N) depending on the details of Python's algorithms.

So, let's do an experiment, shall we?

from sets import Set
import time

def compare_and_separate_with_sets(A, B):
    AB = Set(A+B)
    A = Set(A)
    B = Set(B)
    only_A = list(A-B)
    only_B = list(B-A)
    both_AB = list(AB - Set(only_A+only_B))
    return (only_A, only_B, both_AB)

def timeit(f, args, n):
    """Time function f when called with *args. For timing purposes, 
    does n calls of f. Returns the average time used per call in seconds.
    """
    loopit = range(n)
    t = time.time()
    for i in loopit:
        results = f(*args)
    t = time.time() - t
    return t/n

def single_test(N):
    print ("N = %-8d" % N), 
    A = range(N)
    B = range(N/2, 3*N/2)
    return timeit(compare_and_separate_with_sets, (A, B), 20)

def test_Order():
    # test how compare_and_separate_with_sets scales with size
    for i in range(7):
        print single_test(10**i)

Now run the test code:

py> test_Order()
N = 1        0.000219106674194
N = 10       0.000135183334351
N = 100      0.000481128692627
N = 1000     0.0173740386963
N = 10000    0.103679180145
N = 100000   0.655336141586
N = 1000000  8.12827801704

In my humble opinion, that's not bad behaviour. It looks O(log N) to me,
and quite fast too: about 8 seconds to compare and separate two lists of
one million items each.

The craziest thing is, the amount of time it took to write and test two
different algorithms was probably 1% of the time it would take to hunt up
theoretical discussions of what the big O behaviour of the algorithms
would be.

-- 
Steven.