is implemented with id ?

Sat Nov 3 18:50:28 EDT 2012

On Sun, Nov 4, 2012 at 9:18 AM, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
> On Sat, 03 Nov 2012 22:49:07 +0100, Hans Mulder wrote:
> Actually, for many applications, the space "savings" may actually be
> *costs*, since interning forces Python to hold onto strings even after
> they would normally be garbage collected. CPython interns strings that
> look like identifiers. It really wouldn't be a good idea for it to
> automatically intern every string.

I don't know about that.

/* This dictionary holds all interned unicode strings.  Note that references
   to strings in this dictionary are *not* counted in the string's ob_refcnt.
   When the interned string reaches a refcnt of 0 the string deallocation
   function will delete the reference from this dictionary.

   Another way to look at this is that to say that the actual reference
   count of a string is:  s->ob_refcnt + (s->state ? 2 : 0)
*/
static PyObject *interned;

Empirical testing (on a Linux 3.3a0 that I had lying around) showed
the process's memory usage drop, but I closed the terminal before
copying and pasting (oops). Attempting to recreate in IDLE on 3.2 on
Windows.

>>> a="$"*1024*1024*256    # Make $$$....$$$ fast!
>>> import sys
>>> sys.getsizeof(a)    # Clearly this is a narrow build
536870942
>>> a="$"*1024*1024*256
--> MemoryError. Blah. This is what I get for only having a gig and a
half in this laptop. And I was working with 1024*1024*1024 on the
other box. Start over...

>>> import sys
>>> a="$"*1024*1024*128
>>> b="$"*1024*1024*128
>>> a is b
False
>>> a=sys.intern(a)
>>> b=sys.intern(b)
>>> c="$"*1024*1024*128
>>> c=sys.intern(c)

Memory usage (according to Task Mangler) goes up to ~512MB when I
create a new string (like c), then back down to ~256MB when I intern
it. So far so good.

>>> del a,b,c

Memory usage has dropped to 12MB. Unnecessarily-interned strings don't
cost anything. (The source does refer to immortal interned strings,
but AFAIK you can't create them in user-level code. At least, I didn't
find it in help(sys.intern) which is the obvious place to look.)

> You can make your own intern system with a simple dict:
>
> interned_strings = {}
>
> Then, for every string you care about, do:
>
> s = interned_strings.set_default(s, s)
>
> to ensure you are always working with a single string object for each
> unique value. In some applications that will save time at the expense of
> space.

Doing it manually like this _will_ leak like that, though, unless you
periodically check sys.getrefcount and dispose of unreferenced
entries.

> And there is no need to write "is" instead of "==", because string
> equality already optimizes the "strings are identical" case. By using ==,
> you don't get into bad habits, you defend against the odd un-interned
> string sneaking in, and you still have high speed equality tests.

This one I haven't checked the source for, but ISTR discussions on
this list about comparison of two unequal interned strings not being
optimized, so they'll end up being compared char-for-char. Using 'is'
guarantees that the check stops with identity. This may or may not be
significant, and as you say, defending against an uninterned string
slipping through is potentially critical.

ChrisA