Comparing lists

Sat Oct 15 12:17:36 EDT 2005

"Steven D'Aprano" <steve at REMOVETHIScyber.com.au> wrote in message 
news:pan.2005.10.15.12.58.03.207379 at REMOVETHIScyber.com.au...
> On Sat, 15 Oct 2005 06:31:53 +0200, Christian Stapfer wrote:
>
>> "jon" <jon_usenet at capisco.com> wrote in message
>> news:1129216693.445956.155490 at g47g2000cwa.googlegroups.com...
>>>
>>> To take the heat out of the discussion:
>>>
>>> sets are blazingly fast.
>>
>> I'd prefer a (however) rough characterization
>> of computational complexity in terms of Big-Oh
>> (or Big-whatever) *anytime* to marketing-type
>> characterizations like this one...
>
> Oh how naive.

Why is it that even computer science undergrads
are required to learn the basics of Big-Oh and
all that? Are computer scientists really morons,
as Xah Lee suggests? I can't believe it, but
maybe you have a point...

> The marketing department says: "It's O(N), so it is blindingly fast."

I might as well interpret "blindingly fast"
as meaning O(1). - Why not?
  Surely marketing might also have reasoned like
this: "It's O(1), so its blindingly fast".
But I *want*, nay, I *must* know whether it is
O(N) or O(1). So forget about marketingspeak,
it's a deadly poison for developers. It might
be ok to induce grandiose feelings in childish
users - but developers are grown ups: they must
face reality...

> Translation: the amount of computation it does is linearly proportional
> to N. The constant of proportionality is 1e10.
>
> The marketing department says: "Our competitor's product is O(N**2), so it
> runs like a three-legged dog."
>
> Translation: the amount of computation it does is linearly proportional to
> N squared. The constant of proportionality is 1e-100.
>
> You do the maths.
>
> Big O notation is practically useless for judging how fast a single
> algorithm will be, or how one algorithm compares to another.

That's why Knuth liked it so much?
That's why Aho, Hopcroft and Ullman liked it so much?
That's why Gonnet and Baeza-Yates liked it so much?

> It is only useful for telling you how a single algorithm
> will scale as the input increases.

And that's really very useful information indeed.
Since, given such information for the basic data types
and operations, as implemented by the language and
its standard libraries, I stand a real chance of
being able to determine the computational complexity
of the *particular*combination* of data types and
algorithms of my own small utility or of a
critical piece of my wonderful and large application,
on which the future of my company depends, with some
confidence and accuracy.

> It is very common for sensible programmers to fall back on a "less
> efficient" O(N**2) or even O(2**N) algorithm for small amounts of data, if
> that algorithm runs faster than the "more efficient" O(N) or O(log N)
> algorithm. In fact, that's exactly what the sort() method does in Python:
> for small enough lists, say, under 100 elements, it is quicker to run an
> O(N**2) algorithm (shell sort I believe) than it is to perform the
> complex set up for the merge-sort variant used for larger lists.
>
> As for sets, they are based on dicts, which are effectively hash tables.
> Hash tables are O(1), unless there are collisions,

Depending on the "load factor" of the hash tables.
So we would want to ask, if we have very large
lists indeed, how much space needs to be invested
to keep the load factor so low that we can say
that the membership test is O(1). Do A-B and A&B
have to walk the entire hash table (which must be
larger than the sets, because of a load factor
< 1)? Also: the conversion of lists to sets needs
the insertion of N elements into those hash tables.
That alone already makes the overall algorithm
*at*least* O(N). So forget about O(log N).

> in which case the more
> common algorithms degenerate to O(N).

So here, indeed, we have the kind of reasoning that
one ought to be able to deliver, based on what's in
the Python documentation. Luckily, you have that
kind the knowledge of both, how sets are implemented
and what Big-Oh attaches to the hash table operation
of "look up".
  In order to *enable* SUCH reasoning for *everyone*,
starting from the module interface documentation only,
one clearly needs something along the lines that
I was suggesting...

> So, a very rough and ready estimate
> of the complexity of the algorithm using sets would be somewhere between
> O(1) and O(N) depending on the details of Python's algorithms.
>
> So, let's do an experiment, shall we?
>
> from sets import Set
> import time
>
> def compare_and_separate_with_sets(A, B):
>    AB = Set(A+B)
>    A = Set(A)
>    B = Set(B)
>    only_A = list(A-B)
>    only_B = list(B-A)
>    both_AB = list(AB - Set(only_A+only_B))
>    return (only_A, only_B, both_AB)
>
> def timeit(f, args, n):
>    """Time function f when called with *args. For timing purposes,
>    does n calls of f. Returns the average time used per call in seconds.
>    """
>    loopit = range(n)
>    t = time.time()
>    for i in loopit:
>        results = f(*args)
>    t = time.time() - t
>    return t/n
>
> def single_test(N):
>    print ("N = %-8d" % N),
>    A = range(N)
>    B = range(N/2, 3*N/2)
>    return timeit(compare_and_separate_with_sets, (A, B), 20)
>
> def test_Order():
>    # test how compare_and_separate_with_sets scales with size
>    for i in range(7):
>        print single_test(10**i)
>
>
> Now run the test code:
>
> py> test_Order()
> N = 1        0.000219106674194
> N = 10       0.000135183334351

Curious: N=10 takes less time than N=1?

> N = 100      0.000481128692627

Why do we have such a comparatively large jump
here, from N=100 to N=1000? Did hash tables
overflow during conversion or something like
that?

> N = 1000     0.0173740386963
> N = 10000    0.103679180145
> N = 100000   0.655336141586
> N = 1000000  8.12827801704

Doesn't look quite O(n). Not yet...

> In my humble opinion, that's not bad behaviour.
> It looks O(log N) to me,

How could that be? *Every* element of A and B must touched,
if only to be copied: that can't make it O(log(N)).
Also, the conversion of lists to sets must be at least
O(N). And N isn't the right measure anyway. It would probably
have to be in terms of |A| and |B|. For example, if |A| is very
small, as compared to |B|, then A-B and A & B can be determined
rather quickly by only considering elements of A.

> and quite fast too: about 8 seconds to compare and separate two lists of
> one million items each.
>
> The craziest thing is, the amount of time it took to write and test two
> different algorithms was probably 1% of the time it would take to hunt up
> theoretical discussions of what the big O behaviour of the algorithms
> would be.

You must distinguish questions of principle
and questions of muddling through like this
testing bit you've done. It would take me some
time to even be *sure* how to interpret the
result. I would never want to say "it looks
O(log N) to me", as you do, and leave it at
that. Rather, I might say, as you do, "it
looks O(log N) to me", *but* then try to figure
out, given my knowledge of the implementation
(performance wise, based on information that
is sadly missing in the Python documentation),
*why* that might be. Then, if my experiments says
"it looks like O(log N)" AND if my basic
knowledge of the implementation of set and
list primitives says "it should be O(log N)"
as well, I would venture, with some *confidence*,
to claim: "it actually IS O(log N)"....

 You do not compare the convert-it-to-sets-approach
to the single list-walk either. Supposing the OP
had actually sorted lists to begin with, then a
single, simultaneous walk of the lists would be
about as fast as it can get. Very, very likely
*faster* than conversion to sets would be...

Regards,
Christian