Building CPython

Fri May 15 21:55:49 EDT 2015

On Sat, 16 May 2015 09:27 am, Mark Lawrence wrote:

> On 15/05/2015 23:44, Marko Rauhamaa wrote:
>> BartC <bc at freeuk.com>:
>>
>>> What /is/ a method lookup? Is it when you have this:
>>>
>>>   A.B()
>>>
>>> and need to find whether the expression A (or its class or type) has a
>>> name B associated with it? (And it then needs to check whether B is
>>> something that can be called.)
>>>
>>> If so, does that have to be done using Python's Dict mechanism? (Ie.
>>> searching for a key 'B' by name and seeing if the object associated
>>> with it is a method. That does not sound efficient.)

It's not as inefficient as you may think. Dicts are hash tables, and hash
tables are a standard computer science data structure for performing very
fast searches at constant (or near constant) time.

The dict is basically an array of pointers to (key, value). To look a name
up in the dict, you hash the string which gives you an index into the
array, then look at that position. If it is blank, you know there is no
match. If it points to a string, you compare that string to your string. If
they are equal, then you have a match. If they aren't equal, you have a
collision, and you have to look elsewhere (details differ) but typically
you don't end up looking in more than one or two positions. So all pretty
fast, and close enough to constant time.

To speed things up even further, I think that the hash value is cached with
the string, so it only needs to be calculated the first time.

>> That is a general feature among high-level programming languages. In
>> Python, it is even more complicated:
>>
>>   * first the object's dict is looked up for the method name
>>
>>   * if the method is not found (it usually isn't), the dict of the
>>     object's class is consulted
>>
>>   * if the method is found (it usually is), a function object is
>>     instantiated that delegates to the class's method and embeds a "self"
>>     reference to the object to the call

It's the other way around. The function object already exists: you created
it by writing `def method(self, *args): ... ` inside the class body. def
always makes a function. It's the *method* object which is created on the
fly, delegating to the function.

>> IOW, two dict lookups plus an object construction for each method call.
>>
>>
>> Marko
>>
> 
> As a picture paints a thousand words is anybody aware of a site or sites
> that show this diagramatically, as I think I and possibly others would
> find it far easier to grasp.

No I'm not aware of any such site, but I can try to make it more obvious
with an example.

Suppose we have a hierarchy of classes, starting from the root of the
hierarchy (object) to a specific instance:

class Animal(object): 
    pass

class Mammal(Animal):
    pass

class Dog(Mammal):
    def bark(self): ...

laddy = Dog()

We then look up a method:

laddy.bark()

In a non-dynamic language like Java, the compiler knows exactly where bark
is defined ("in the Dog class") and can call it directly. In dynamic
languages like Python, the compiler can't be sure that bark hasn't been
shadowed or overridden at runtime, so it has to search for the first match
found. Simplified:

* Does laddy.__dict__ contain the key "bark"? If so, we have a match.

* For each class in the MRO (Method Resolution Order), namely 
  [Dog, Mammal, Animal, object], does the class __dict__ contain the
  key "bark"? If so, we have a match.

* Do any of those classes in the MRO have a __getattr__ method? If
  so, then try calling __getattr__("bark") and see if it returns 
  a match.

* Give up and raise AttributeError.

[Aside: some of the things I haven't covered: __slots__, __getattribute__,
how metaclasses can affect this.]

In the case of laddy.bark, the matching attribute is found as
Dog.__dict__['bark']:

    temp = Dog.__bark__['bark']  # laddy.bark, is a function

At this point, the descriptor protocol applies. You can ignore this part if
you like, and just pretend that laddy.bark returns a method instead of a
function, but if you want to know what actually happens in all it's gory
details, it is something like this (again, simplified):

* Does the attribute have a __get__ method? If not, then we just 
  use the object as-is, with no changes.

* But if it does have a __get__ method, then it is a descriptor and
  we call the __get__ method to get the object we actually use.

Since functions are descriptors, we get this:

    temp = temp.__get__(laddy, Dog)  # returns a method object

and finally we can call the method:

    temp()  # laddy.bark()

None of these individual operations are particularly expensive, nor are
there a lot of them. For a typical instance, the MRO usually contains only
two or three classes, and __dict__ lookups are fast. Nevertheless, even
though each method lookup is individually *almost* as fast as the sort of
pre-compiled all-but-instantaneous access which Java can do, it all adds
up. So in Java, a long chain of dots:

    foo.bar.baz.foobar.spam.eggs.cheese

can be resolved at compile-time, and takes no more time than

    foo.cheese

but in Python's case it has to be resolved at run-time, so if you care about
speed, you should try to avoid long chains of dots in performance critical
loops. E.g. instead of:

    for x in sequence:
        foo.bar.baz.foobar.spam.eggs.cheese(x)

you can write:

    cheese = foo.bar.baz.foobar.spam.eggs.cheese
    for x in sequence:
        cheese(x)

-- 
Steven