[Python-3000] Two proposals for a new list-like type: one modest, one radical

Wed Apr 25 08:04:27 CEST 2007

[Daniel Stutzbach]
> As a pet project, I've written a container type that looks, acts, and
> quacks like a Python list(), but has better asymptotic performance.
> Specifically, O(log n) time for inserts and deletes in the middle of
> the list.  I'd like to offer it for inclusion in the collections
> module.

In principle, I think this new container would be a fine addition to the collections module.  That is a perfect place for alternate 
implementations of standard types. By matching the list-api, you've reduced the learning curve to near zero while still offering 
intermediate-level users an alternative data structure with different performance trade-offs.

In practice, the code for BList is somewhat complex, and its desirability and performance in actual applications is unproven. 
Fortunately, Py2.6 is still a long way off. My recommendation is that you release it right away as a third-party module and let the 
hordes of Pythonistas test it in battle. With the code base thoroughly exercised and some positive user feedback, it would be hard 
to say no to this going into Py2.6.

> I'm open to suggestions for a better name.

I'm hoping for a better name too.   Hope you enjoy the impending naming debate --  very few people will read your code, more than a 
few will read the proposal, but *everyone* will have an opinion on the name ;-)   Perhaps start-off with something romantic and 
enigmatic like "treelist" ;-)

> - just as fast as a Python list() when the list is small
> - getslice runs in O(log n) time
> - making a shallow copy runs in O(1) time
> - setslice runs in O(log n + log k) time if the inserted slice is a
> BList of length k
> - multipling a BList by k takes O(log k) time and O(log k) memory

How does BList's memory consumption compare with the current list over-allocation scheme?

Does it take longer to instantiate an emtpy list?  What are the malloc/free patterns?

> In other words, for short lists,
> a BList works just like Python's array-based list() type.  Thus, it
> has the same good performance on small lists.

While this is a good implementation idea (similar ideas are used to speed-up quicksort for instance), it makes the first listed 
advantage seem like a straw-man designed to mislead proposal reviewers into thinking that the blist structure is more efficient than 
is actually is.  Perhaps rewrite it as: "To keep the performance acceptable for short lists, btreelists fall back to the existing 
implementation."

> The Radical Proposal
> Replace list() with the BList.

FWIW, I prefer that the current list implementation remain in-place. Because what we have is simple, it is not hard to maintain, its 
performance characteristics are easily understood, and we have a straight-forward C-API that allows module writers to have fast, 
intuitive direct access to the structure.

OTOH, if your btreelist goes into Py2.6's collections module and users flock to it instead of regular lists, then there may be some 
basis for further discussion.

BTW, please offer-up a pure python version (like we do for the heapq module).  That will make it easier for the curious to find-out 
how the magic is done.  And, it will make life easier for Jython, IronPython, and PyPy implementer so they won't have to do any 
extra work to support Py2.6.

> There are only a few use-cases (that I can think of) where Python's
> list() regularly outperforms the BList.  These are:
>
> 1. A large LIFO stack, where there are many .append() and .pop(-1)
> operations.  These are O(1) for a Python list, but O(log n) for the
> BList().

This is a somewhat important use-case (we devote two methods to it).

> 2. An large, unchanging list, where there are many getitem() calls,
> and none of the inserts or deletes where the BList() shines.  Again,
> these are O(1) for a Python list, but O(log n) for the BList(), and
> the BList() is just as fast if the list is small.

This is also a critical use-case.

> When I first started learning Python, I often ended up inadvertently
> writing O(n**2) algorithms by putting a O(n) list operation inside a
> loop (which is a problem when n=100,000).  Since nothing in the
> signature of list() informs the user when an operation is going to be
> slow, it took me a while to learn how to write my code to work around
> this constraint.  In fact, I initially assumed "list" meant "linked
> list" so most of my assumptions about which operations would be fast
> vs. slow were backwards.

You've come a long way since then.  Congrats.

Thanks for your contribution and thoughtful discussion.  This weekend, I look forward to reading you source code in detail.

Raymond