implementing large trees

Michael Schmitt ms at NOMAIL.de
Mon Jul 15 21:15:25 EDT 2002


Hello, thanks for your thought-provoking questions.

Bengt Richter wrote: [ordering of citations changed!]
> - What does your tree actually represent IRL?
The tree represents factors of a maximal length of event sequences. In 
these sequences patterns should be discovered.

> - What about buying more memory ;-)
Well..., most of the algorithms seem to have linear complexity. If this 
would run on hard disc, with a "nice linear factor", it would scale 
sufficiently with problem size. My budget does not scale with problem size, 
so this is not an option at present.

> - Are your nodes and edges class instances? Are you using __slots__ ?
> - What are you using as keys in your dictionaries?
I'm not using nodes instances to avoid object overhead. As it seems, the 
__slot__ suggestion would make it affordable.

Actually, the node dictionary maps node ids to a tuple of parent id, data, 
child ids. The data field contains also information about the incoming edge.
The edge dictionary maps (nodeid,edge) tuples to nodeids.

> - Do you need to touch every node as you follow edges?
Yes. As path construction is a main point, I can see no way around it.

> - How many total nodes and edges?
300.000 are o.k for in-memory, 3.000.000 would be nice, 50.000.000 would 
make me happy.


> - How many edges per node? Min? Max? Usual?
At the root are about 2% of the edges, a bit less for larger node numbers.
At inner nodes very often 1, but up to 0.2 % in very rare cases.

> - Is it very deep or very wide?
Depth is limited to something between 5 to 15. Width at leaf level is about 
10%-20% of the number of nodes.

> - Is the tree essentially static for many traversals, so that it might pay
> to do some special overall pre-procesing to speed traversal?
The tree is static. Build tree, traverse, finished.

> - Is part of the tree "hot" and other parts hardly ever visited? Would
> caching pay off?
The tree is traversed rather uniformly. Probably it is possible to restrict 
traversal to breadth- or depth-first only, which would make caching simpler.

After answering all these questions, i'm not shure, what answer i expected. 
I hoped to find a simple tree on disc universal solution, but as it seems 
it has to be tailored for the algorithms that work on the tree.

Bengt, thanks for your questions.

Thanks for any further hints.

Michael




More information about the Python-list mailing list