[Python-Dev] collections module

Fri Jan 9 06:42:34 EST 2004

From: Raymond Hettinger
> I would like to establish a new module for some collection classes.

In general, I approve of this idea. However, one point you don't
make clear - are you proposing that this be coded in Python, or in
C?

I'd be against it being coded in C, but for a Python implementation. A
python module serves not only to provide the feature, but also to
demonstrate good Python code, to show how to code such basic data types
in Python, and to offer explicit, "use the source" form, documentation
of corner cases. As an example of this, I once used the heapq Python
implementation as a basis for developing my own variant, which allowed
updating of the priority. I even went as far as importing some of the
module's private functions to help my implementation. With a C coded
version, I couldn't do that - I'd have had to write the whole thing
for myself. (Yes, I know that Python 2.4 you recoded heapq in C - this
isn't a dig at that as such, although it will be a problem for me. It's
just a note that coding in C is not always a good thing...)

The only reason I can see for coding in C is performance. And performance
just isn't that big a deal to me. And we don't want to give a message
that "dictionaries and lists are slow" by offering "fast" versions of
things which can be coded in terms of them.

> The first type would be a bag

I don't, personally, have much use for this data type. In practice, I
would naturally choose to use a dictionary mapping item to count.
It would be important in the documentation to provide clear use
cases, showing when this type is more appropriate than a simple
dictionary, as I describe, and when it is not.

In terms of "making dictionaries look slow", how would you answer
the question "Why use this rather than just dict.get(k, 0) + 1"
without giving the impression that dictionaries "aren't fast enough"?

> The second type would be a high speed queue

This seems like a good idea. I have found myself on occasion
wanting to use Queue.Queue in non-threaded programs, simply because
a "queue" fits my mental model of what I am doing. But it's for
code readability reasons - not performance.

And the same issue here over performance impressions - how do you
make this seem like a good alternative to list.pop, without making
lists look slow?

> This module could also serve as the "one obvious place to put it"
> for other high performance datatypes that might be added in the
> future (such as fibheaps or some such).

That is the biggest benefit I see to this proposal. There have
recently been a number of discussions of useful library code, where
much of the problem seemed to be about where to put the code,
rather than about semantics, etc.

> Guido wanted this to be discussed by a number of people.  While
> a rich set of collection classes have proven their worth in other
> contexts, it could be that more is not better for python.  One the
> language's elegances is the ability to do just about anything with
> just lists and dicts.  If you have too many mutable containers to 
> choose from, the choice of which to use becomes less obvious.  

> My thoughts are that the existence of the array module, for example, 
> has never impaired anyone's ability to learn or use lists and dicts.

That is true, but equally the array module has a feeling of being
"specialised". I'm not sure I can quantify this, but your description
of the collection module doesn't feel similarly "specialised".

> Also, no one reaches for a bag unless they are counting things.

... and when they are, they reach for a dictionary - d.get(k, 0) + 1
feels like a very natural Python counting idiom.

> Likewise, queues are in a mental concept space distinct from dicts, 
> sets, and such.  My only worry with queues is differentiating
> them from Queue objects which have builtin synchronization for 
> use in threads.

Queues currently "feel like" a threading concept in Python (as opposed
to what they feel like in a more general data structures concept).
Having a queue in a collections module would clash with that. I'm not
against it, but I believe your worry is important - you're fiddling
with people's mental models, something bound to generate controversy...

Apologies for going on about performance. I am in favour of the idea of
a module containing implementations of standard container data structures,
but I believe it should be in Python, and should be presented as the
"best of breed" implementation in terms of handling corner cases, and
covering subtle algorithmic issues, and providing good design, and
offering subclassability where needed. Basically, "everyone can, and
often does, reinvent this stuff in Python - but getting the details
right is tricky, so we've done it for you."

If performance of a Python implementation is a problem, speed up the
bottlenecks in the Python VM (likely to be function calls, I'd guess)
rather than spending effort reimplementing in C...

Sorry if that was a bit rambling. I hope it's of use.
Paul.