[Python-ideas] Trie ABC?

Tue May 7 04:22:25 CEST 2013

Before getting further into trie…

The most common data structure people ask for in Python isn't tries, it's sorted mappings, like C++ std::map and friends. There are a dozen or more implementations out there—and they all extend the (Mutable)Mapping ABC in different ways. For example, https://pypi.python.org/pypi/bintrees/ adds bisect methods, key slices, and heapq-like methods; http://stutzbachenterprises.com/blist/sorteddict.html adds a different set of bisect methods, some subset methods, and a few other things. Wouldn't it be nice if they could all just implement the part that's actually interesting and worth competing over, and inherit collections.abc.MutableSortedMapping to get an ideal set of methods? MutableMapping is useful for exactly that reason.

Maybe we should go through the different sorteddict implementations, pick the best one, improve its API with the best features of the others (and drop any misfeatures), and stick it in the stdlib. But I think people are still going to want to use the alternatives. (In fact, blist was rejected for stdlib inclusion a while back for exactly that reason.)

So, I think it makes sense to add the ABC without adding a reference implementation.

Anyway, on to trie…

> From: Stephen J. Turnbull <turnbull at sk.tsukuba.ac.jp>

> Sent: Monday, May 6, 2013 5:34 PM
> 
> Nick Coghlan writes:
> 
>>  This seems reasonable, although I think it may make more sense in
>>  conjunction with a reference implementation in the standard library.
> 
> Reasonable in the abstract, yes.  But if it's possible to do as well
> as timsort[1], IMHO providing an ABC rather than going directly to a
> concrete implementation violates Occam's Razor.  All I want[2] here is
> somebody to tell me that "no, trie cannot (at this time) be reduced to
> a single algorithm that can be tuned to perform very well in almost
> all cases encountered in practice, and it will be common for different
> users[3] to want to implement different algorithms."

First, it's not an algorithm like sort, it's a data structure like set.

And there's at least one important related but different data structure that has the same API as an (immutable) trie, but a different implementation: a branch-merged trie. This is a huge space savings for sparse lookup tables, like Unicode data. (Actually, there's a whole class of compacted tries that are useful, but the branch-merged one seems to be the one that half the Unicode-data implementations out there use, so I'm being conservative here and saying maybe just that one would be good enough.)

Meanwhile, even if you stick with just the standard data structure, many implementations are hardcoded to deal only with ASCII (or UCS-2 or UTF-32 or some other fixed-size) characters instead of arbitrary objects. And the reason for this isn't just faster and simpler implementation, or ability to wrap up C code with a decade of optimization and testing behind it, but space. People store giant dictionaries in tries, and if our One True Trie took 4x as much space as an ASCII-only trie, people would still use the latter.

So, could we get away with a single trie? No… but maybe a single frozentrie and a single trie (although note that converting between them wouldn't be constant space/time, unlike frozenset/set), with both doing the same kind of transparent fallback as str (use ASCII if possible, UCS-2 if possible, UTF-32 if possible, PyObject* if the elements aren't single-char strings at all), would be a good enough 80% solution.

Is anyone volunteering to build that?