Another Python rookie trying to port to C/C++ question.

Thu Sep 25 18:36:11 EDT 2003

On Thu, 25 Sep 2003 18:55:34 GMT, Don Bruder <dakidd at sonic.net> wrote:

>Dictionaries/"dict" types...
>
>Am I understanding/interpreting correctly when I go with the idea that a 
>"dict" variable can be looked at as (effectively) two parallel arrays? 
>Something like this C construct:
>
>struct DictStruct
>   {
>      char    *KeyArray[<somenumber>];
>      ValType ValArray[<somenumber>];
>   }

Sort of, in a sense, but not quite. The types of items within the
dictionary (both key and value) are not constrained. This is no
problem as all Python objects are held using a reference scheme. So
(assuming that a pair-object scheme is used in the dictionary) the
struct will be defined much like...

  struct DictPair
  {
    PyObject *Key;
    PyObject *Data;
  }

The items will almost certainly be held in a single 'array' of pair
objects.

The Python dictionaries use hashing. I don't know the details of the
hashing used. Anyway, there are many ways to handle dictionary-like
data structures.

C++ has the std::map template, which uses a form of binary tree called
a red-black tree (the red-black refers to a partial balancing system).

Tree-based mappings can be slightly more flexible than hashing (for
instance, a tree can be 'walked' in sorted order - the hashing process
does not preserve relative ordering) but hashing tends to be faster.

An interesting approach to dictionary-like objects is the ternary
tree. This is rather like cascading binary trees - each subtree
identifies one character and cascades into the subtree that finds the
next character. The third link in the 'ternary' is the 'I've
identified one character and moving on to the next' link.

A ternary tree can be faster than hashing - in particular at deciding
that a particular key is not present (hashing needs to calculate a
hash over the whole tree before looking up the result - a ternary tree
search may fail at the first character, without looking at the whole
key).

Digital trees (or tries, IIRC) can be even faster still - but wastes a
lot of memory. In a digital tree, each node has an array subscripted
by character/digit code of pointers to the next subtree.

Multiway trees (most often used on disk for database indexes - B trees
and B+ trees being types of multiway trees) may well be appropriate in
main store on modern desktop machines with caching and virtual memory.

If these options are just too complicated, sorted arrays can be
extremely effective as long as you don't have too many items. The C++
library includes std::vector (a resizable array) and binary search
algorithms which can ease the implementation.

Odds are, if you are translating Python to C++, you should use an
std::map to replace a dictionary. There are some different
assumptions, though. Dictionaries assume that the key is hashable, but
don't require relative comparisons ('<', '<=' etc). The std::map
requires relative comparisons.

This simply arises out of the different approaches - Python using
hashing and C++ using binary trees. Going from Python to C++, 99 times
out of 100 this is not a problem. From C++ to Python there is the
minor issue that Python dictionaries can't be _directly_ iterated in
sorted order, though there are certainly easy ways around that.

-- 
Steve Horne

steve at ninereeds dot fsnet dot co dot uk