PEP 372 -- Adding an ordered directory to collections

Tue Jun 17 02:20:51 EDT 2008

On Jun 17, 12:28 am, "Martin v. Löwis" <mar... at v.loewis.de> wrote:
> > My guess is that the two main memory allocate/deallocate cases are 1)
> > appending a new item to the end, and 2) GC'ing the entire data
> > structure.  I would optimize these 2 at the expense of all others.
>
> Does that include dictionary lookups?
>
> Regards,
> Martin

Well, I did (try to) qualify my guess as pertaining specifically to
memory allocate/deallocate cases, the other topics in the nearby posts
were talking about updates to the odict, so I hope I can be forgiven
this oversight.  But you're right, the keyed access is one of the main
reasons this is being done with an odict instead of just a list of
tuples.

The point I was trying to make was that sometimes, the GC case occurs
far more frequently than other update methods, yet is overlooked
because it happens invisibly to most Python users.

I worked on a project a while ago where there was a huge performance
penalty in releasing a list structure, because each node was freed
individually, causing surrounding freelist entries to be updated for
every node.  In typical worst-case scenario fashion, a second list was
built at the same time as the first, and the default system memory
allocator interleaved the list nodes.  The truly frustrating part was
that at this point in the program, we were trying to free *all* the
nodes in *both* lists, and would have preferred to just release the
whole chunk of memory, if it had just been allocated in a large chunk
instead of a node at a time.  I implemented a zone-based memory
allocator, so that nodes were allocated within a memory zone, but when
we were done, we just released the entire zone, with tremendous
improvement in system performance.

But what kind of keyed access is likely to be used on an odict?  For
those cases where all of the items are desired, then I would recommend
that the odict iterator be preferred, for this type of retrieval:

for k,v in odictvar.iteritems():
    # do something with key k and value v

and discourage this usage:

for k in odictvar.keys():  # or iterkeys()
    # do something with key k and value odictvar[k]

How about this as a simple first approach?  Let's assume that the
nominal usage pattern for an odict is 1) create all entries in the
odict, probably using append, 2) access some or all of the entries,
either by iterating over the keys or values, or by keyed access to
specific values, and 3) delete the odict.  Have the odict keep an
internal dict representing the keyed lookups, and have this internal
dict set to null whenever the odict is updated.  When the odict is
accessed by specific key, the internal dict is used - if it null, it
is built using dict(odict.iteritems()).  In the nominal usage, this
dict will only be built once, since the keyed accesses wont begin
until after all of the key-value pairs have been added to the odict.

Or as a first-first approach (to avoid premature optimization), just
implement the odict as a list of tuples, and do linear search for
matching key if accessed by key.

I would bias any performance choices about keyed access toward cases
where the queried key exists in the odict - I think that in those
cases where odicts are used, such as ORMs and XML, the keys for a
particular odict are known and expected to exist.  Those applications
where the keys are not known are likely to be meta-type apps, like
debuggers and introspection GUIs.  I contend that the meat-and-
potatoes production apps will be those like database queries - when I
get an odict back after querying an EMPLOYEES table, I really should
reasonably expect odictvar["EMPNO"] to exist.

And what would be the conditions of an abusive form of odict?  How
about an application log kept in an odict, keyed by timestamp?  This
could have many 1000's of entries, and yet, I would guess that an
odict would be the wrong data structure for this, and that a list of
tuples or LogMessage objects would be more appropriate.  But someone
is likely to do it - if they do, what will happen?  Perhaps trying to
anticipate "abusive" or degenerate uses of an odict might shed some
light on ways to avoid, discourage, or maybe even accommodate these
uses in odict.

Well, this has become something of a rant, and a speculative one at
that, but I think the "delete the entire data structure" memory case
should be given reasonably high design attention.  (And maybe this is
already the norm in Python development - I'm really quite ignorant of
this process, not being in the Python development circle.)

Always nice to hear from you Martin - cheers!

-- Paul