[pypy-dev] Object identity and dict strategies

Bengt Richter bokr at oz.net
Sun Jul 10 15:48:08 CEST 2011


On 07/09/2011 10:17 AM Armin Rigo wrote:
> Hi,
>
> On Sat, Jul 9, 2011 at 5:20 AM, William ML Leslie
> <william.leslie.ttg at gmail.com>  wrote:
>> My point about small integers (...)
>
> I think that your point about small integers is broken (even assuming
> that smallints are enabled by default, which is not the case).  It
> means that we'd get an id() function which "behaves" as long as taking
> the id() of a 31-bit signed integer, and then doesn't behave as
> expected neither for full 32-bit integers, nor for longs, nor for
> floats.
>
>> This happens to work for larger integers and floats, because the id
>> will be preserved as long as a reference to them exists by their
>> boxedness.
>> Floats could use a similar mechanism to integers, eg, their bit
>> representation shifted left two and bitwise-ored with 2.
>
> I don't understand these two sentences, because they seem to say the
> exact opposite of each other...
>
>>   That does mean that id() is no longer word-sized, but it does not make it
>> unbounded.
>
> The "unbounded" part in my e-mail was about longs.  Obviously if you
> are computing id(x) where x is in some finite set (like ints or
> floats), then the results are in some finite set too.
>
>> What is keeping us from using the underlying rpython string as the
>> basis for id?
>
> This is probably a good enough workaround for strings and unicodes.
>
>> Speaking of, maybe it'd be easier to attempt to get the identity dict
>> into the language proper.
>
> We tried at some point, but python-dev refused the idea.  Maybe the
> idea has more chances for approval now that we can really show with
> performance numbers that it's a deep issue, as opposed to just wave
> our hands.  Feel free to try again.  In the meantime I've decided, at
> least myself, to stick with the approach that Python is whatever is in
> 2.7, and that we have to work around such issues instead of fixing
> them properly in the language.
>
Does that mean you have to follow the vagaries of the 2.7 compiler optimizations
(or lack of them, as the case may be) that make for results like these?:
(note fresh 2.7.2 build below ;-)
[& BTW kudos to those who made it easy with the tarball and config, make]

[10:30 ~]$ python
Python 2.7.2 (default, Jul  8 2011, 23:38:53)
[GCC 4.1.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
   >>> from ut.miscutil import disev,disex
   >>> id(2000) == id(2000)
True
   >>> id(2000) == id(20*100)
False
   >>> disev('id(2000) == id(2000)')
     1           0 LOAD_NAME                0 (id)
                 3 LOAD_CONST               0 (2000)
                 6 CALL_FUNCTION            1
                 9 LOAD_NAME                0 (id)
                12 LOAD_CONST               0 (2000)
                15 CALL_FUNCTION            1
                18 COMPARE_OP               2 (==)
                21 RETURN_VALUE
   >>> disev('id(2000) == id(20*100)')
     1           0 LOAD_NAME                0 (id)
                 3 LOAD_CONST               0 (2000)
                 6 CALL_FUNCTION            1
                 9 LOAD_NAME                0 (id)
                12 LOAD_CONST               3 (2000)
                15 CALL_FUNCTION            1
                18 COMPARE_OP               2 (==)
                21 RETURN_VALUE
   >>>

Notice that 20*100 was folded to 2000, but a different constant was generated,
hence, presumably, the different id. Will the next update of 2.7 do the same?

BTW, disev above is just from a grabbag module of mine, amounting to

def disev(s):
       import dis
       return dis.dis(compile(s,'disev','eval'))

It doesn't have to be about integers:

   >>> id([]) == id([])
True
   >>> id(list()) == id(list())
False
   >>> disev('id([]) == id([])')
     1           0 LOAD_NAME                0 (id)
                 3 BUILD_LIST               0
                 6 CALL_FUNCTION            1
                 9 LOAD_NAME                0 (id)
                12 BUILD_LIST               0
                15 CALL_FUNCTION            1
                18 COMPARE_OP               2 (==)
                21 RETURN_VALUE
   >>> disev('id(list()) == id(list())')
     1           0 LOAD_NAME                0 (id)
                 3 LOAD_NAME                1 (list)
                 6 CALL_FUNCTION            0
                 9 CALL_FUNCTION            1
                12 LOAD_NAME                0 (id)
                15 LOAD_NAME                1 (list)
                18 CALL_FUNCTION            0
                21 CALL_FUNCTION            1
                24 COMPARE_OP               2 (==)
                27 RETURN_VALUE
   >>>

Of course, the is operator call will keep both references alive
on the stack, so whereas you get

  >>> id([]) == id([])
True
  >>> id([]), id([])
(3082823052L, 3082823052L)

Keeping the arguments alive simultaneously gives

  >>> (lambda x,y:(id(x),id(y)))([],[])
(3082823052L, 3082881516L)
  >>> (lambda x,y:(id(x)==id(y)))([],[])
False
  >>>

... assuming the lambda above approximates what `is' does.

Using id comparisons instead of `is' can look superficially a bit weird:

  >>> id([1]) == id([2])
True
  >>>

Bottom line: id does not seem to have a complete abstract definition[1][2],
so how may one define a 'correct' implementation?

ISTM id now (CPython 2.7.2) is more like a version-dependent debugger function
that can be useful in app hacks, but with caveats ;-)

Regards,
Bengt Richter
[1] http://docs.python.org/reference/datamodel.html#index-824
[2] http://docs.python.org/library/functions.html#id

PS. How does the following cachingid.py compare
to your idea of what id should ideally do? (not tested except example)
(Obviously it is just a concept implementation, not resource efficient)
________________________________________

refval = id # for using old id
def id(x, objcache = {}):
     if type(x) in (int,float,bool,tuple,str,unicode): # etc
         t = (type(x), x)
         return refval(objcache.setdefault(t, t)[1])
     else:
         return refval(x) # as now? XXX what is abstract meaning of id(some_mutable)?
________________________________________

It tries to return the ordinary id of the first instance of equivalent immutable objects
passed to it, and otherwise uses the old id, which I'm not finished thinking about ;-)

At least it fixes the difference between id(2000) and id(20*100):

 >>> oldid=id
 >>> from ut.cachingid import id # ut is just my utility collection
 >>> oldid(2000)==oldid(2000)
True
 >>> oldid(2000)==oldid(20*100)
False
 >>> id(2000)==id(2000)
True
 >>> id(2000)==id(20*100)
True
 >>> id.func_defaults
({(<type 'int'>, 2000): (<type 'int'>, 2000)},)
 >>>

>
> A bientôt,
>
> Armin.




More information about the pypy-dev mailing list