[Python-ideas] Trie ABC?

Tue May 7 20:03:14 CEST 2013

On 07/05/13 03:22, Andrew Barnert wrote:
>> >From: Stephen J. Turnbull<turnbull at sk.tsukuba.ac.jp>
>> >Sent: Monday, May 6, 2013 5:34 PM
>> >
>> >Nick Coghlan writes:
>> >
>>> >>  This seems reasonable, although I think it may make more sense in
>>> >>  conjunction with a reference implementation in the standard library.
>> >
>> >Reasonable in the abstract, yes.  But if it's possible to do as well
>> >as timsort[1], IMHO providing an ABC rather than going directly to a
>> >concrete implementation violates Occam's Razor.  All I want[2] here is
>> >somebody to tell me that "no, trie cannot (at this time) be reduced to
>> >a single algorithm that can be tuned to perform very well in almost
>> >all cases encountered in practice, and it will be common for different
>> >users[3] to want to implement different algorithms."
>
> First, it's not an algorithm like sort, it's a data structure like set.
>
> And there's at least one important related but different data structure that has the same API as an (immutable) trie, but a different implementation: a branch-merged trie. This is a huge space savings for sparse lookup tables, like Unicode data. (Actually, there's a whole class of compacted tries that are useful, but the branch-merged one seems to be the one that half the Unicode-data implementations out there use, so I'm being conservative here and saying maybe just that one would be good enough.)
>
> Meanwhile, even if you stick with just the standard data structure, many implementations are hardcoded to deal only with ASCII (or UCS-2 or UTF-32 or some other fixed-size) characters instead of arbitrary objects. And the reason for this isn't just faster and simpler implementation, or ability to wrap up C code with a decade of optimization and testing behind it, but space. People store giant dictionaries in tries, and if our One True Trie took 4x as much space as an ASCII-only trie, people would still use the latter.

Even more extreme than this is what the marisa-trie package does: it has 
different implementations depending on value type. It has one that can 
hold any Python object, one that can hold identically typed tuples using 
magic with the struct module, and one that can hold bytes objects.

Dealing with giant dictionaries how you encode the value is just as 
important as the key.

Note also libdatrie uses an alphabetmap, which allows arbitrary subsets 
of the Unicode plane to be used. Of course, this then makes it 
impossible to add a new arbitrary key without recreating the entire 
trie, so the mutable/immutable distinction is important.

> So, could we get away with a single trie? No… but maybe a single frozentrie and a single trie (although note that converting between them wouldn't be constant space/time, unlike frozenset/set), with both doing the same kind of transparent fallback as str (use ASCII if possible, UCS-2 if possible, UTF-32 if possible, PyObject* if the elements aren't single-char strings at all), would be a good enough 80% solution.

I'd be totally in favour of putting such a thing in the stdlib… if 
someone ever wrote it. :)

But yes: I think Andrew has succulently explained why I'm not rushing 
into trying to push a specific implementation into the stdlib. I'm not 
against putting one in, but it'll need to be carefully chosen: any 
pure-Python one will likely have undesirable memory usage due to the 
overhead per string object (there's a reason for my implementation 
implements the interface using a binary-search-tree and not a trie!); 
and most modern C implementations are separate libraries: the Python 
wrapper is only really interesting from an API POV, so it doesn't matter 
too much if we choose something not used before.

/gsnedders