Comparing lists

Tue Oct 11 01:05:53 EDT 2005

"Steve Holden" <steve at holdenweb.com> wrote in message 
news:mailman.1843.1128959845.509.python-list at python.org...
> Christian Stapfer wrote:
>> "George Sakkis" <gsakkis at rutgers.edu> wrote in message 
>> news:1128952417.0544cc73190bbbd4e683448c137969c8 at teranews...
>>
>>>"Christian Stapfer" <nil at dev.nul> wrote:
>>>><ajikoe at gmail.com> wrote:
>>>>
>>>>>try to use set.
>>>>
>>>>    Sorting the two lists and then extracting
>>>>A-B, B-A, A|B, A & B and A ^ B in one single
>>>>pass seems to me very likely to be much faster
>>>>for large lists.
>>>
>>>Why don't you implement it, test it and time it
>>>to be more convincing about your intuition ?
>>
>> The problem is in the generation of the test data.
>> Even merely generating a set of (suitably "average",
>> "random", and suitably "worst case") datasets might
>> turn out to be a major undertaking.
>>  If the documentation stated the order-of-magnitude
>> behavior of those basic operations up front, then
>> I (and *anyone* else who ever wanted to use those
>> operations on large lists / large sets) could do
>> a quick order-of-magnitude estimation of how
>> a certain program design will behave, performance
>> wise.
>>   *Experimenting* is not necessarily as easy to
>> do as you seem to believe. How do you, for example,
>> hit upon the worst-case behavior with your test
>> data? - Without knowing *anything* about the
>> implementation it might a matter sheer luck.
>> If you *know* something about the implementation
>> then, of course, you might be able to figure it
>> out. (But note that if you know *that* much about
>> the implementation, you usually have an order-of-
>> magnitude estimate anyway and don't need to do
>> *any* experimenting in order to answer my question.)
>>
> You are, of course, either assuming that there's a
> single implementation of Python,

Of course not!

> or that all implementations have the same behaviour.

Of course not!

But it is reasonable, anyway, to ask for information
about a specific implementation (that one is *forced*
to use). Performance *will* depend on it, whether we
like it or not, whether that information is given by
the implementer or not.
  And if the implementer wants to complain that
giving such information would break his wonderful
abstraction then I can only answer: It is *reality*
that *will* break your abstraction as regards
performance! It is, therefore, absolutely *no*
use to *pretend* you *can* avoid this form of "breaking
the abstraction" by simply avoiding to mention it
in the documentation...

> Or  alternatively you are asking all implementers
> to do what you seem to consider so difficult
> (i.e. identify worst-case scenarios and then estimate order-of-magnitude 
> behaviour for them).

I consider it the job of the implementer to know
about the trade-offs that he has been making in
choosing one particular implementation, and to
know what computational complexity therefore
attaches to the various operations exposed in
its interface. Why should *every* user of his
module be forced to either read the source
code of his implementation or try to figure
it out experiment-wise on his own?

> Test results with known test data are relatively
> easy to extrapolate from, and if your test data
> are reasonably representative of live data then so will your performance 
> estimates.

How reasonable is it to ask me, or anyone else
for that matter, to extract, experiment-wise
(if it can be done at all with reasonable effort)
information that properly belongs to the implementer
and really should have been exposed in the
documentation in the first place?

> Anyway, aren't you more interested in average
> behaviour than worst-case? Most people are.

It depends. *I* am NOT the OP of this thread.
*The*OP* had asked for an "efficient" way
to do what he needed to do. There are many
measures of efficiency. Even "programmer efficiency"
may be one of them. But I assumed that, since
the OP took the time to ask his question in this
NG, that it really was about "computational
efficiency". Then he was offered solutions -
but they were offered just matter-of-factly.
*Surely* those who had posted those solutions
would be able to answer my question *why*
it was, that they *assumed* conversion into
sets to be more efficient than operating
on the lists in the first place. I was *not*
at all criticizing anyone (except, perhaps,
the Python documentation for its lack of
*insightful* information): I was just asking
for a *small* clarification that I *assumed*
(mistakenly it now appears) others could
*easily* provide.

Regards,
Christian
--
"Those who cannot remember the past are
 condemned to repeat it."
 - George Santayana