matching strings in a large set of strings

Sat May 1 09:05:05 EDT 2010

Duncan Booth, 30.04.2010 10:20:
> So more than 3GB just for the strings (and that's for Python 2.x on
> Python 3.x you'll need nearly 5GB).
>
> Running on a 64 bit version of Python should be fine, but for a 32 bit
> system a naive approach just isn't going to work.
>
> Option 1: use a trie. That should reduce storage, maybe it will reduce
> it enough, maybe not. It depends on the data.

Depending on the implementation and the data, a trie can also use a lot 
/more/ space than the set of strings that it contains. The 83 million 14 
character strings can well include a relatively large subset of the 
possible permutations (imagine purely numeric, hex-numeric or lower-case 
alpha-numeric strings, for example), so the trie will likely need to branch 
very often with very low intermediate run length. If you use pointers for 
trie branches, that's at least 8 bytes per branch on a 64bit system, versus 
1 byte per character in a byte string list. Depending on the ratio of 
branches to characters, one or the other may win. So a "naive approach" 
likely won't work for tries either.

Stefan