[Python-ideas] Trie ABC?
Geoffrey Sneddon
foolistbar at googlemail.com
Tue May 7 20:03:14 CEST 2013
On 07/05/13 03:22, Andrew Barnert wrote:
>> >From: Stephen J. Turnbull<turnbull at sk.tsukuba.ac.jp>
>> >Sent: Monday, May 6, 2013 5:34 PM
>> >
>> >Nick Coghlan writes:
>> >
>>> >> This seems reasonable, although I think it may make more sense in
>>> >> conjunction with a reference implementation in the standard library.
>> >
>> >Reasonable in the abstract, yes. But if it's possible to do as well
>> >as timsort[1], IMHO providing an ABC rather than going directly to a
>> >concrete implementation violates Occam's Razor. All I want[2] here is
>> >somebody to tell me that "no, trie cannot (at this time) be reduced to
>> >a single algorithm that can be tuned to perform very well in almost
>> >all cases encountered in practice, and it will be common for different
>> >users[3] to want to implement different algorithms."
>
> First, it's not an algorithm like sort, it's a data structure like set.
>
> And there's at least one important related but different data structure that has the same API as an (immutable) trie, but a different implementation: a branch-merged trie. This is a huge space savings for sparse lookup tables, like Unicode data. (Actually, there's a whole class of compacted tries that are useful, but the branch-merged one seems to be the one that half the Unicode-data implementations out there use, so I'm being conservative here and saying maybe just that one would be good enough.)
>
> Meanwhile, even if you stick with just the standard data structure, many implementations are hardcoded to deal only with ASCII (or UCS-2 or UTF-32 or some other fixed-size) characters instead of arbitrary objects. And the reason for this isn't just faster and simpler implementation, or ability to wrap up C code with a decade of optimization and testing behind it, but space. People store giant dictionaries in tries, and if our One True Trie took 4x as much space as an ASCII-only trie, people would still use the latter.
Even more extreme than this is what the marisa-trie package does: it has
different implementations depending on value type. It has one that can
hold any Python object, one that can hold identically typed tuples using
magic with the struct module, and one that can hold bytes objects.
Dealing with giant dictionaries how you encode the value is just as
important as the key.
Note also libdatrie uses an alphabetmap, which allows arbitrary subsets
of the Unicode plane to be used. Of course, this then makes it
impossible to add a new arbitrary key without recreating the entire
trie, so the mutable/immutable distinction is important.
> So, could we get away with a single trie? No… but maybe a single frozentrie and a single trie (although note that converting between them wouldn't be constant space/time, unlike frozenset/set), with both doing the same kind of transparent fallback as str (use ASCII if possible, UCS-2 if possible, UTF-32 if possible, PyObject* if the elements aren't single-char strings at all), would be a good enough 80% solution.
I'd be totally in favour of putting such a thing in the stdlib… if
someone ever wrote it. :)
But yes: I think Andrew has succulently explained why I'm not rushing
into trying to push a specific implementation into the stdlib. I'm not
against putting one in, but it'll need to be carefully chosen: any
pure-Python one will likely have undesirable memory usage due to the
overhead per string object (there's a reason for my implementation
implements the interface using a binary-search-tree and not a trie!);
and most modern C implementations are separate libraries: the Python
wrapper is only really interesting from an API POV, so it doesn't matter
too much if we choose something not used before.
/gsnedders
More information about the Python-ideas
mailing list