Ordering python sets

Tue Nov 4 15:40:52 EST 2008

On Sun, 02 Nov 2008 02:08:37 +0000, Steven D'Aprano wrote:

> On Sat, 01 Nov 2008 18:58:59 +0000, Tim Rowe wrote:
> 
>> 2008/10/27  <bearophileHUGS at lycos.com>:
>>> Lie Ryan:
>>>
>>>>Oh no, the two dict implementation would work _exactly_ the same from
>>>>the outside, they are transparently interchangeable. Only the
>>>>performance characteristic differs because of the different
>>>>implementation.<
>>>
>>> I don't agree with the general idea. If the operations done by your
>>> data structure have different computational complexity, then they are
>>> fit for different usages. When you program you must know what
>>> computational complexity has each of the operations of your data
>>> structyre, otherwise there's no way to know the complexity of your
>>> whole program, so instead of programming you are just become a mage
>>> that tries magical spells and hopes for the better.
>> 
>> No, when you program you know how you will be using the data structure,
>> so you can choose the implementation that's right for the application.
> 
> I don't understand why you have prefixed that sentence with "No". It
> seems like you are agreeing with bearophile's point.

On linguistic note: As a non-native speaker of English, I've never relied 
on the correct usage of Yes and No and would instead rely on the 
following text. In some languages, situations where English-people 
usually use Yes is answered with No and vice versa.

>> That's what the container libraries for other languages do. At the
>> moment, you just have one implementation, and have to hope it's right
>> for the job.
> 
> To the extent that this is true, it is a sign of the poverty of the
> standard Python library. But in fact it isn't true. Python has quite a
> decent collection of collection types. Just off the top of my head:
> 
> tuple, namedtuple, list, set, frozenset, dict, defaultdict, queue, deque
> 
> set, list and dict are highly optimized for general use, and are very
> suitable for building new data structures. And you are wrong to say that
> we have to "hope it's right for the job", we can build perfectly
> acceptable pure-Python data structures from these building blocks, or
> add our own C extensions. You're never forced to use a standard library
> data structure, it's a trade-off of your time and effort against runtime
> efficiency. And because the standard library types are generally so
> excellent, that trade-off is often (but not always) a no-brainer.

There are cases where the algorithm you uses ignited the standard 
library's data structure's worst case scenario.

>> Adding an *optional* parameter that says, in effect, "I want this list
>> optimised for writes and reads at both ends" or "I want this list
>> optimised for fully random reads but writes only at the end" doesn't
>> *lose* you any information about what you're programming with.
> 
> It might not lose information, but it does obfuscate it.
> 
> Right now, we can do this:
<snip>

I've repelled this same argument multiple times. How often do you think 
you'd need to do check for an object's implementation? Compare that to 
how often we need to ensure that the object we're handling is a list-
replacement.

> That's assuming the implementation is available at runtime at all; if
> it's not, then you have lost information.

The implementation property is always available, because it's a non-
optional argument when registering the data structure.

>> Of course it's essential that the data structure has identican
>> /functional/ behaviour whatever optimisation you use.
> 
> "Essential"? That's an interesting point. Let's look at an actual
> example.
> 
> Consider one of Lie's initial examples (paraphrased):
> 
> D1 = dict({1: 'a', 2: 'b'})  # standard hash-table D2 = dict({1: 'a', 2:
> 'b'}, implementation="binary tree")
> 
> You say it's "essential" that the binary tree implementation has
> identical functional behaviour to standard hash-table implementation of
> dict. Okay, let's see where that leads us. The functional behaviour of
> hash-tables is that they are O(1) for reads, while binary trees are
> O(log(N)). The only way for them to have identical functional behaviour
> is to *artificially* slow down dicts to the lowest common speed. That is
> a Bad Thing, and I don't understand why you think it is essential.

I can't think why you consider a function's performance characteristic as 
its behavior, a "characteristic" of something means it is a it's property/
character. Not behavior.

> we no longer have to 
> artificially change the behaviour of the type -- be we do have to 
> artificially cripple the API of the type!

I don't get why you think we've got to restrict it to the lowest common 
denominator. We expect people that messes with a field called 
"implementation" to know enough about the limitations of the 
implementation he chose. Regular python programmers nowadays have to be 
aware that the limitation of the (hashed) dict is its key must be 
immutable. A true dict/mapping type doesn't have that limitation, we 
already have the so-called "artificial" limitations right now. The way it 
is now gives the impression that hashed dict's limitations is equal to 
dict/mapping's limitation. Instead, it should be: dict is a mapping type 
of any object to any object, however the default hashed dict's limitation 
is immutable key.

Aside: Also with my way, we could classify a "feature" into three status: 
limitation, exact replacement, extension. limitation is minor points 
where the implementation simply couldn't fulfill (not fulfilling major 
points would fail registration), exact replacement is the regular things, 
and extension as features that, if used, might make changing of 
implementations harder, so should only be used if that implementation's 
extension is the natural way to write an algorithm.

> But maybe you'll argue that Lie's examples are bad examples. 

Yes, I agree it is a bad example for this issue. When I wrote that, what 
I had in mind is to demonstrate how it'd look like in a simple way. Way 
too simple, it seems.

> Perhaps
> what you have in mind is something like the difference between "dicts
> implemented as hash tables with chaining" and "dicts implemented as hash
> tables with open addressing". (Aside: what sort of open addressing?
> Linear? Quadratic? Something else?) At least now we're talking about
> different implementations with the same API and more-or-less the same
> functional behaviour.

Not quite. I'm looking for a more radical differences. This is more 
apparent in array-list vs bintree-list. One is better for straight 
traversal and random access, while the other is better for searching and 
inserting.

> I'd suggest that 99% of the time, client code simply shouldn't care
> about the differences. The existing implementations are Good Enough.
> Have you *ever* found yourself saying "I can't use the built-in dict,
> because it uses chaining and for my application I need linear open
> addressing"? (Or vice versa.)

Yes, that's why we have reasonable default.

> I doubt you've every said that. I doubt you know anyone who has said it.
> I doubt you know *of* anyone who has said it. And even if you have, then
> you are free to go and write your own dict type as a C extension type
> and call it dict_with_linear_openaddressing and write this:
> 
> D = dict_with_linear_openaddressing(data)
> 
> instead of:
> 
> D = dict(data, implementation = "linear open addressing")
> 
> Yet again Lie's proposal gains you nothing. In fact, as an application
> writer, you're better off writing your own type rather than waiting for
> it to be added to the built-in implementation of dict.

There are times when it'd be too much work to implement it, test it, and 
integrate it. There are times when we know that using alternative 
implementation could make our program faster, but simply lack enough time 
to implement and test it. I was talking about rapid application 
development or even the wilder cousin, quick-one-time-disposable scripts. 
(I'm not implying that only rapid programming benefits from this, it is 
only one example)

Plus, currently it is impossible, without extensive reading of the whole 
docs, to search for other data structure that implements the same 
abstract type to list/dict/anything. Their documentations are scattered 
around here and there. It's helpful if help(list) could list what 
implementations are available (no pun intended).

<snip>
> Here's an even better way:
> 
> class StandardFoo(data):
>     def __init__(self, data):
>          self.data = foo(data)
>     def __len__(self):
>          return len(self.data)
> 
> class MagicFoo(data):
>     def __init__(self, data):
>          self.data = bar(data)
>     def __len__(self):
>          return len(self.data[0]) + len(self.data[1:])
> 
> class GreenFoo(data):
>     def __init__(self, data):
>          self.data = baz(data)
>     def __len__(self):
>          return self._len
> 
> 
> and let the client code call whichever implementation they want.

You missed the last part. Integrating the implementation to current code. 
They are the one that makes the difference.

And I wasn't talking about any of those you mentioned.
It's more like this:

class StandardFoo(data):
    def __init__(self, data):
         self.data = foo(data)
    def __len__(self):
         return len(self.data)

class MagicFoo(data):
    def __init__(self, data):
         self.data = bar(data)
    def __len__(self):
         return len(self.data[0]) + len(self.data[1:])

class GreenFoo(data):
    def __init__(self, data):
         self.data = baz(data)
    def __len__(self):
         return self._len

class Foo(AbstractType): pass
Foo.register({'standard': StandardFoo, 'green': GreenFoo, 'magic': 
MagicFoo})

> [snip]
>> They're not different data structures from the client point of view.
> 
> But of course they are. They have different functional behaviour, and
> different APIs. 

> Unless we artificially cripple the implementations 

Then please state the implementation's limitations on the docs. People 
who knows enough to chose non-standard implementation should be 
knowledgeable enough about the limitations and extension of the chosen 
implementation.

> and
> make them all identical (which defeats the purpose of having different
> implementations!) they are different data structures, and those
> differences will be visible to the client code.

Yes, they're different data structure, same Abstract Type.

> This scheme is a case of over-generalisation. It's also impractical: who
> is going to spend the time and effort to do it? It's hard enough to get
> somebody to write a binary tree implementation for the standard library,
> without expecting them to *also* write a wrapper for it to make it look
> like a dict. 

It's hard to write it because everyone knows noone is going to use it 
because their implementation would be buried inside 60 feet of soil and 
sand somewhere in the dark corners of the docs. And when you said "it's 
hard to get somebody to write", do you realize how many people have, in 
desperation, write their own implementation of something because they 
can't find it in the standard library. And I think the 'register' way 
would make it unnecessary to write wrappers as long as you've provided a 
compatible data structure the generic register function would handle the 
rest.

> It's working against the grain of Python, which is well
> suited for the one-type = one-API model instead of the one-type = many-
> APIs model which Lie is suggesting. And for what benefit? The only
> benefit suggested so far is that we can write:
> 
> dict({1: 'a', 2: 'b'}, implementation="binary tree")
> 
> instead of
> 
> binarytree({1: 'a', 2: 'b'})
> 
> If you call that a benefit. I don't.

A real benefit is apparent if we have this kind of snippet in our codes 
(again, in my habit of oversimplifying: f is handling too many things for 
its own good, isinstance is usually evil, and there is a strong smell of 
bad design):

def f(listofiterable, someiterable):
    for lst in listofiterable:
        if isinstance(lst, list):
            lst.extend(somelist)
        elif isinstance(lst, [set, dict]):
            lst.update(somelist)
        else:
            for n in somelist:
                lst += n

if you have that kind of code, and many of them scattered, when you want 
to change the data structure, you've got to hunt all those isinstance 
down and change them to the new data structure. The other solutions, to 
shadow 'list' (i.e. the previous name), doesn't work if we want to use 
the original data structure too in other unrelated places. With 
implementation, only the object creation would need to be searched and 
isinstance(something, list) would pass it as an array-list replacement.