Ordering python sets

Sat Nov 1 22:08:37 EDT 2008

On Sat, 01 Nov 2008 18:58:59 +0000, Tim Rowe wrote:

> 2008/10/27  <bearophileHUGS at lycos.com>:
>> Lie Ryan:
>>
>>>Oh no, the two dict implementation would work _exactly_ the same from
>>>the outside, they are transparently interchangeable. Only the
>>>performance characteristic differs because of the different
>>>implementation.<
>>
>> I don't agree with the general idea. If the operations done by your
>> data structure have different computational complexity, then they are
>> fit for different usages. When you program you must know what
>> computational complexity has each of the operations of your data
>> structyre, otherwise there's no way to know the complexity of your
>> whole program, so instead of programming you are just become a mage
>> that tries magical spells and hopes for the better.
> 
> No, when you program you know how you will be using the data structure,
> so you can choose the implementation that's right for the application.

I don't understand why you have prefixed that sentence with "No". It 
seems like you are agreeing with bearophile's point.

> That's what the container libraries for other languages do. At the
> moment, you just have one implementation, and have to hope it's right
> for the job. 

To the extent that this is true, it is a sign of the poverty of the 
standard Python library. But in fact it isn't true. Python has quite a 
decent collection of collection types. Just off the top of my head:

tuple, namedtuple, list, set, frozenset, dict, defaultdict, queue, deque

set, list and dict are highly optimized for general use, and are very 
suitable for building new data structures. And you are wrong to say that 
we have to "hope it's right for the job", we can build perfectly 
acceptable pure-Python data structures from these building blocks, or add 
our own C extensions. You're never forced to use a standard library data 
structure, it's a trade-off of your time and effort against runtime 
efficiency. And because the standard library types are generally so 
excellent, that trade-off is often (but not always) a no-brainer.

> Adding an *optional* parameter that says, in effect, "I
> want this list optimised for writes and reads at both ends" or "I want
> this list optimised for fully random reads but writes only at the end"
> doesn't *lose* you any information about what you're programming with.

It might not lose information, but it does obfuscate it.

Right now, we can do this:

if isinstance(obj, list):
    print "optimized for random reads and writes at the end"
elif isinstance(obj, deque):
    print "optimized for writes at both the start and end"

If you care about such things, it's easy to find out, because there is a 
one-to-one mapping from type to behaviour.

Under Lie's suggestion, we would have a one-to-many mapping from type to 
behaviour, forcing us to do this:

if isinstance(obj, list):
    if obj.implementation == 'deque'
        print "optimized for writes at both the start and end"
    else:
        print "optimized for random reads and writes at the end"

That's assuming the implementation is available at runtime at all; if 
it's not, then you have lost information.

> Of course it's essential that the data structure has identican
> /functional/ behaviour whatever optimisation you use.

"Essential"? That's an interesting point. Let's look at an actual example.

Consider one of Lie's initial examples (paraphrased):

D1 = dict({1: 'a', 2: 'b'})  # standard hash-table 
D2 = dict({1: 'a', 2: 'b'}, implementation="binary tree")

You say it's "essential" that the binary tree implementation has 
identical functional behaviour to standard hash-table implementation of 
dict. Okay, let's see where that leads us. The functional behaviour of 
hash-tables is that they are O(1) for reads, while binary trees are 
O(log(N)). The only way for them to have identical functional behaviour 
is to *artificially* slow down dicts to the lowest common speed. That is 
a Bad Thing, and I don't understand why you think it is essential.

But perhaps you didn't actually mean identical behaviour, and merely an 
identical *interface*. That's more reasonable, we no longer have to 
artificially change the behaviour of the type -- be we do have to 
artificially cripple the API of the type!

The advantage of binary trees as a mapping is that they are ordered. 
Binary trees have a rich set of traversal methods: you can walk the tree 
in post-order, pre-order, in-order or depth first. But hash-tables don't 
have *any* of those, so now we have to forbid all binary-tree traversal 
methods. In which case, why bother paying the cost of binary trees if you 
don't get the extra functionality?

But maybe you'll argue that Lie's examples are bad examples. Perhaps what 
you have in mind is something like the difference between "dicts 
implemented as hash tables with chaining" and "dicts implemented as hash 
tables with open addressing". (Aside: what sort of open addressing? 
Linear? Quadratic? Something else?) At least now we're talking about 
different implementations with the same API and more-or-less the same 
functional behaviour.

I'd suggest that 99% of the time, client code simply shouldn't care about 
the differences. The existing implementations are Good Enough. Have you 
*ever* found yourself saying "I can't use the built-in dict, because it 
uses chaining and for my application I need linear open addressing"? (Or 
vice versa.)

I doubt you've every said that. I doubt you know anyone who has said it. 
I doubt you know *of* anyone who has said it. And even if you have, then 
you are free to go and write your own dict type as a C extension type and 
call it dict_with_linear_openaddressing and write this:

D = dict_with_linear_openaddressing(data)

instead of:

D = dict(data, implementation = "linear open addressing")

Yet again Lie's proposal gains you nothing. In fact, as an application 
writer, you're better off writing your own type rather than waiting for 
it to be added to the built-in implementation of dict.

The one exception to this is when you're developing a *new type* that 
doesn't already exist in the standard library, and you haven't decided 
which implementation(s) you should use, so you write multiple 
implementations for testing. Here's the wrong way to do it:

class Foo(object):
    def __init__(self, data, implementation='standard'):
        self.implementation = implementation
        if implementation == 'standard':
             # save data one way
             self.data = foo(data)
        elif implementation == 'magic':
             # save data another way
             self.data = bar(data)
        elif implementation == 'green':
             # save data a third way
             self.data = baz(data)
        else:
             raise ValueError('unknown implementation')
    def len(self):
        if implementation == 'standard':
             return len(self.data)
        elif implementation == 'magic':
             return len(self.data[0]) + len(self.data[1:])
        elif implementation == 'green':
             return self._len

Why would you write it that way? Worst Design Ever -- it's confusing and 
error prone and fails to obey the Don't Repeat Yourself principle, thus 
increasing the risk of errors. And testing and documentation are 
significantly more complicated.

Here's a better way:

class Foo(object):

    class _StandardFoo(data):
        def __init__(self, data):
             self.data = foo(data)
        def __len__(self):
             return len(self.data)

    class _MagicFoo(data):
        def __init__(self, data):
             self.data = bar(data)
        def __len__(self):
             return len(self.data[0]) + len(self.data[1:])

    class _GreenFoo(data):
        def __init__(self, data):
             self.data = baz(data)
        def __len__(self):
             return self._len

    _mapping = {'standard': _StandardFoo, 
              'magic': _MagicFoo, 'green': _GreenFoo}

    def __new__(cls, data, implementation='standard'):
        try:
            return cls._mapping[implementation]
        except KeyError:
            raise ValueError('unknown implementation')

Here's an even better way:

class StandardFoo(data):
    def __init__(self, data):
         self.data = foo(data)
    def __len__(self):
         return len(self.data)

class MagicFoo(data):
    def __init__(self, data):
         self.data = bar(data)
    def __len__(self):
         return len(self.data[0]) + len(self.data[1:])

class GreenFoo(data):
    def __init__(self, data):
         self.data = baz(data)
    def __len__(self):
         return self._len

and let the client code call whichever implementation they want.

[snip]
> They're not different data structures from the client point of view.

But of course they are. They have different functional behaviour, and 
different APIs. Unless we artificially cripple the implementations and 
make them all identical (which defeats the purpose of having different 
implementations!) they are different data structures, and those 
differences will be visible to the client code.

This scheme is a case of over-generalisation. It's also impractical: who 
is going to spend the time and effort to do it? It's hard enough to get 
somebody to write a binary tree implementation for the standard library, 
without expecting them to *also* write a wrapper for it to make it look 
like a dict. It's working against the grain of Python, which is well 
suited for the one-type = one-API model instead of the one-type = many-
APIs model which Lie is suggesting. And for what benefit? The only 
benefit suggested so far is that we can write:

dict({1: 'a', 2: 'b'}, implementation="binary tree")

instead of 

binarytree({1: 'a', 2: 'b'})

If you call that a benefit. I don't.

-- 
Steven