Interning own classes like strings for speed and size?

Tue Dec 28 14:23:17 EST 2010

On 12/27/2010 3:05 AM, Ulrich Eckhardt wrote:
> Hi!
>
> I'm trying to solve a computational problem and of course speed and size is
> important there.

     Then you probably shouldn't be using CPython.

> Apart from picking the right algorithm, I came across an
> idea that could help speed up things and keep memory requirements down. What
> I have is regions described by min and max coordinates. At first, I just
> modeled these as a simple class containing two values for each axis.

     There are more effective techniques for that.  In game physics
collision detection systems, it's common to use axis-ordered lists
of bounding boxes to partition objects.  This still works even when
the ranges overlap.  Look up "I-Collide" for a classic implementation.

 > Am I looking in the wrong direction? Is there some better approach?
 > Please
 > don't tell me to use C, as I'm specifically interested in
 > learning Python,
 > I'm pretty sure I could have solved the problem quickly in
 > C++ otherwise.

     CPython is a naive interpreter which does dictionary
lookups for every variable and attribute access.  If you're doing
fast manipulation of linked data structures in CPython, performance
is going to suck.

     Check out Shed Skin 0.7, which is a much faster Python
system.  It's limited, but for the kind of data structure work
you're doing, all the features you should need already work.

				John Nagle
>
> In a second step, I derived this class from tuple instead of object. Some
> code then moved from __init__ to __new__ and some code that modified these
> objects had to be changed to replace them instead. The upside to this is
> that they can be used as keys in sets and dicts, which isn't the case for
> mutable types[1].
>
> What I'm now considering is to only allow a single instance of these objects
> for each set of values, similar to interned strings. What I would gain is
> that I could safely compare objects for identity instead of equality. What
> I'm not yet sure is how much overhead that would give me and/or how to keep
> it low. The idea is to store each instance in a set and after creating a new
> object I would first look up an equal object in the global set and return
> that instead, otherwise add the new one.
>
> The problem I foresee is that if I define equality as identity, this lookup
> when creating will never eliminate duplicates. If I only fall back to
> equality comparison for non-identical objects, I would probably sacrifice
> most of the gain. If I build a dict mapping between the values and the
> actual objects, I would have doubled the required memory and uselessly store
> the same values twice there.
>
>
> Cheers!
>
> Uli
>
>
> [1] Somebody correct me if I'm wrong, but I believe I could have defined a
> hashing function for the type and thus allowed its use in a set or dict,
> right? However, goofing up because you accidentally modified an object and
> changed its hash value is something I don't want to risk anyway.
>