[Python-Dev] Alternative implementation of string interning

Tim Peters tim.one@comcast.net
Tue, 02 Jul 2002 15:31:03 -0400


[Tim]
> This may be a problem.  Code now can rely on that
> id(some_interned_string) stays the same across the life of a run.

[Oren Tirosh]
> This requires code that stores the id of an object without keeping a
> reference to the actual object.  It also requires that no other piece of
> Python or C code keep a reference to that object and yet for its
> identity to be somehow still significant.  If find that extremely hard
> to imagine.

I would have guessed you had a more vivid imagination <wink>.  It's
precisely because the id has been guaranteed that a program may not care to
save a reference to an interned string.  For example,

"""
_ids = map(id, map(intern, "if then elif else".split()))
TOKEN_IF, TOKEN_THEN, TOKEN_ELIF, TOKEN_ELSE, TOKEN_NAME = range(5)
id2token = dict(zip(_ids, range(4)))
del _ids

def tokenvector(s):
    return [id2token.get(id(intern(word)), TOKEN_NAME)
            for word in s.split()]

print tokenvector("if this is the example, then what's the question?")
"""

This works reliably today to classify tokens.  I'm not certain I'd care if
it broke, but we have to consider that it hasn't been difficult to write
code that would break.

>> This was (or at least Guido thought it was <wink>) an important
>> optimization at the time.

> I see.  As far as I can tell, it isn't any more.

Which extension modules have you investigated?  The claim is too vague to
carry weight.  Zope's C code uses the interned-string C API directly, so it
doesn't matter to Zope code.  That's all I've looked at.  Making a case that
the optimization is no longer important requires investigating code.

> Now for something a bit more radical:
>
> Why not make interned strings a type?  <type 'istr'> could be an
> un-subclassable subclass of string.  intern would just be an
> alias for this type.  No two istr instances are equal unless they are
> identical.  I guess PyString_CheckExact would need to be changed to
> accept either String or InternedString.

What would the point be?  That is, instead of "why not?", why?  As to "why
not?", there's something about elevating what's basically an optimization
hack to a type that makes me squirm.