implementing large trees

Tue Jul 16 04:39:23 EDT 2002

On 15/7/2002 14:22, in article aguia9$bks$06$1 at news.t-online.com, "Michael
Schmitt" <ms at NOMAIL.de> wrote:

> I'm trying to implement a large tree data structure, which does not fit
> into memory. Based on an in-memory implementation, which has two
> dictionarys, one for all nodes and one for all edges, the first idea was to
> use shelve to have disc-based dictionaries.
> 
> But for usual tree operations (depth-first traversal, breadth-first
> traversal) the shelve based tree was much slower, than the tree using two
> dictionaries and relying on swapping.
> 
> What is the "best" way to implement such a tree? Is it possible to improve
> performance without knowing how the tree is traversed? What about having a
> breadth-first and a depth-first cache, that prefetch nodes and edges after
> each access?

Virtual memory is a wonderful thing. I'd focus on optimising your algorithms
and layout to have as much locality of reference as possible and let VM do
the caching for you.

In particular I'd look at how you build the tree in the first place. If you
construct it in a very random fashion so that adjacent edges and nodes are
scattered all over memory then walking the tree will always result in poor
locality of reference.

Take a look at the VM stats. If:

    Number of Page Ins + Number of Page Outs
    ----------------------------------------  ~=  Number of Passes
           Total Virtual Set Size

then you're probably not going to do much better. If it's substantially
larger than the number of tree walks you do then you're memory layout is a
mess and it's striding.

Unless you can get it laid out neatly you'll have trouble. Even if you write
it to a file and try to implement your own intelligent loading, you'll just
end up hitting the same problem with the disk cache.

If you have a large number of passes, then you might want to consider
building the tree in memory and traversing it once to write it out evenly
and compactly to a file. This will have very poor performance, but the
resulting file you can mmap back in and walk neatly.

Jonathan