[Python-ideas] Exporting dict Items for direct lookups of specific keys

Wed Jun 20 11:50:21 CEST 2007

I have created a new thread on Python Ideas to discuss this.
I have also wrote some code and attached a patch.

I did not eventually have to use a bitmap in dicts, but could abuse
the top hash bit instead:

https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1739789&group_id=5470

pystones and other benchmarks seem to accelerate by about 5%. Other
specific benchmarks built to measure the speed increase of the
globals/builtins keywords measure a 42% speedup.
Regression tests are less than 5% slower, but I assume that if adding
this acceleration to attr lookups as well, they will be accelerated
too.

Eyal

On 6/12/07, Eyal Lotem <eyal.lotem at gmail.com> wrote:
> On 6/11/07, Josiah Carlson <jcarlson at uci.edu> wrote:
> >
> > "Eyal Lotem" <eyal.lotem at gmail.com> wrote:
> > > On 6/11/07, Josiah Carlson <jcarlson at uci.edu> wrote:
> > > > "Eyal Lotem" <eyal.lotem at gmail.com> wrote:
> > > > > On 6/10/07, Josiah Carlson <jcarlson at uci.edu> wrote:
> > > > > > "Eyal Lotem" <eyal.lotem at gmail.com> wrote:
> > > > > > > On 6/10/07, Josiah Carlson <jcarlson at uci.edu> wrote:
> > > > > > > Only access of exported items is O(1) time (when accessed via your
> > > > > > > PyDictItem_obj->value), other items must be accessed normally and they
> > > > > > > take just as much time (or as I explained and you reiterated, a tad
> > > > > > > longer, as it requires a bitmap check and in the case of exported
> > > > > > > items another dereference).
> > > > > >
> > > > > > But you still don't explain *how* these exported keys are going to be
> > > > > > accessed.  Walk me through the steps required to improve access times in
> > > > > > the following case:
> > > > > >
> > > > > > def foo(obj):
> > > > > >     return obj.foo
> > > > > >
> > > > > >
> > > > > I think you missed what I said - I said that the functionality should
> > > > > probably not be exported to Python - as Python has little to gain from
> > > > > it (it would have to getattr a C method just to request the exported
> > > > > item -- which will nullify the speed benefit).
> > > > >
> > > > > It is the C code which can suddenly do direct access to access the
> > > > > exported dict items - not Python code.
> > [snip]
> > > While extensions are an optimization target, the main target is
> > > global/builtin/attribute accessing code.
> >
> > Or really, module globals and __builtin__ accessing.  Arbitrary
> > attribute access is one of those "things most commonly done in Python".
> > But just for the sake of future readers of this thread, could you
> > explicitly enumerate *which* things you intend to speed up with this
> > work.
>
> As for optimizing globals/builtins, this will be the effect of my patch:
>
> global x
> x = 5 # Direct access write instead of dict write
> x # Direct access read
> globals()['x'] = 5 # Same speed as before.
> globals()['x'] # Same speed as before.
> min # Two direct access reads, instead of 2 dict reads.
>
> As for attribute access in classes, the speedup I can gain depends on
> a psyco-like system. Lets assume that we have a system that utilizes a
> new TYPE_FORK opcode that jumps to use different code according to a
> map of types, for example, if we have:
>
> def f(x, y):
>   x.hello()
>
> Then we will create a TYPE_FORK opcode before x.hello() that takes 'x'
> as an argument, and a map of types (initially empty).  When the exact
> type of 'x' isn't in the map, then the rest of the code in the code
> object after TYPE_FORK will have a specialized version created for x's
> current type [only if that type doesn't override tp_getattr/o], and
> inserted in the map.
> The specialized version of the code will contain, instead of a
> LOAD_ATTR for the string "hello", a FAST_LOAD_ATTR for the string
> "hello" (associated with the direct-access dict item in the mro dict
> (if there's no mro cache, I actually have a problem here, because I
> don't know which dicts I need to export dict items from - and worse,
> that list may change with time. The simplest solution is to use an
> exported item from an mro cache dict)).
>
> FAST_LOAD_ATTR will not call PyObject_GetAttr, but instead use the
> exported dict items to find the descriptor/classattr using direct
> access.
> If it found a descriptor with __get__/__set__, it will return its get-call.
> Otherwise, it will do a normal expensive lookup on the instance dict
> (for "hello") (unless __slots__ is defined in which case there is no
> instance dict).
> If it found that, it will return that.
> Otherwise, it will return the descriptor's get-call if it has one or
> the descriptor itself as a classattr.
>
> In other words, I am reimplementing PyObject_GenericGetAttr here, but
> for mro-side lookups, using my direct lookup.
> The result is:
>
> class X(object):
>   def method(self, arg):
>     self.x = arg # One direct-lookup+dict lookup instead of two dict lookups
>     self.other_method() # One direct-lookup+dict lookup instead of two
> dict lookups
> class Y(object):
>   __slots__ = ["x"]
>   def method(self, arg):
>     self.x = arg # One direct-lookup instead of one dict lookup
>     self.other_method() # One direct-lookup instead of one dict lookup
>
> A direct lookup is significantly cheaper than a dict lookup (as
> optimized as dict is, it still involves C callstack setups/unwinds,
> more conditionals, assignments, potential collisions and far more
> instructions).
> Therefore, with the combination of a psyco-like system I can eliminate
> one of two dict lookup costs, and with the combination of __slots__ as
> well, I can eliminate one of one dict lookup costs.
>
> > > > > Since a "static lookup" costs a dereference and a conditional, and a
> > > > > dynamic lookup entails at least 4 C function calls (including C stack
> > > > > setup/unwinds), a few C assignments and C conditionals, I believe it
> > > > > is likely that this will pay off as a serious improvement in Python's
> > > > > performance, when combined with a psyco-like system (not an
> > > > > architecture-dependent ones).
> > > >
> > > > It's really only useful if you are accessing fixed attributes of a fixed
> > > > object many times.  The only case I can think of where this kind of
> > > > thing would be useful (sufficient accesses to make a positive difference)
> > > > is in the case of module globals, but in that case, we can merely change
> > > > how module globals are implemented (more or less like self.__dict__ = ...
> > > > in the module's __init__ method).
> > >
> > > That's not true.
> > >
> > > As I explained, getattr accesses the types's mro dicts as well. So
> > > even if you are accessing a lot of different instances, and those have
> > > a shared (fixed) type, you can speed up the type-side dict lookup
> > > (even if you still pay for a whole instance-side lookup).  Also,
> >
> > That's MRO caching, which you have already stated is orthogonal to this
> > particular proposal.
> I may have made a mistake before - its not completely orthogonal as an
> MRO cache dict which can export items for direct access in psyco'd
> code is a clean and simple solution, while the lack of an MRO cache
> means that finding which class-side dicts to take exported items from
> may be a difficult problem which may involve a cache of its own.
>
> > > "fixed-object" access can occur when you have a small number of
> > > objects whose attributes are looked up many times. In such a case, a
> > > psyco-like system can create a specialized code object specifically
> > > for _instances_ (not just for types), each code object using "static
> > > lookups" on the instance's dict as well, and not just on the class's
> > > dict.
> >
> > If you re-read my last posting, which you quoted above and I re-quote,
> > you can easily replace 'fixed attributes of a fixed object' with 'fixed
> > attributes of a small set of fixed objects' and get what you say.  Aside
> > from module globals, when is this seen?
> Its seen when many calls are made on singletons.
> Its seen when an inner loop is not memory-intensive but is
> computationally intensive - which would translate to having an
> instance calling methods on itself or other instances.
> You may only have 100 instances relevant in your inner loop which is
> running many millions of times. In such a case, you will benefit
> greatly if for every code object in the methods of every instance, you
> create specialized code for every one of the types it is called with
> (say, 3 types per code object), so you take perhaps a factor of 3 of
> space for code objects (which  I believe are not a significant portion
> of memory consumption in Python), but your performance will involve
> _no_ dict access at all for attribute lookups, and instead will just
> compare instance pointers and then use direct access.
>
> > > > You may want to change the name.  "Literal" implies a constant, like 1
> > > > or "hello", as in 'x = "hello"'.  LOAD_GLOBAL_FAST would seem to make
> > > > more sense to me, considering that is what it intends to do.
> > >
> > > Well, LOAD_GLOBAL_FAST can only be used when the string that's being
> > > looked up is known at the code-object creation time, which means that
> > > the attribute name was indeed literal.
> >
> > A literal is a value.  A name/identifier is a reference.
> >
> > In:
> >     a = "hello"
> > ... "hello" is a literal.
> >
> > In:
> >     hello = 1
> > ... hello is a name/identifier.
> >
> > In:
> >     b.hello = 1
> > ... hello is a named attribute of an object named/identified by b.
> Then I agree, the use of the word literal here is inappropriate,
> constant/static may be more appropriate.
>
> Eyal
>