interning strings

Sun Nov 7 08:08:55 EST 2004

The interning of strings has me puzzled.  Its seems to happen sometimes, 
but not others. I can't decern the pattern and I can't seem to find 
documentation regarding it.

I can find documentation of a builtin called 'intern' but its use seems 
frowned upon these days.

For example, using py2.3.3, I find that string interning does seem to 
happen sometimes ...

 >>> s1 = "the"
 >>> s2 = "the"
 >>> s1 is s2
True

And it even happens in this case ...

 >>> s = "aa"
 >>> s1 = s[:1]
 >>> s2 = s[-1:]
 >>> s1, s2
('a', 'a')
 >>> s1 is s2
True

But not in what appears an almost identical case ...

 >>> s = "the the"
 >>> s1 = s[:3]
 >>> s2 = s[-3:]
 >>> s1, s2
('the', 'the')
 >>> s1 is s2
False

BUT, oddly, it does seem to happen here ...

 >>> class X:
... 	pass
...
 >>> x = X()
 >>> y = "the"
 >>> x.the = 42
 >>> x.__dict__
{'the': 42}
 >>> y is x.__dict__.keys()[0]
True

Are there any language rules regarding when strings are interned and 
then they are not?  Should I be ignoring the apparent poor status of 
'intern' and using it anyway? At worst, are there any CPyton 'accidents 
of implementation' that I take advantage of?

Why do I need this?  Well, I have to read in a very large XML document
and convert it into objects, and within the document many attributes
have common string values. To reduce the memory footprint, I'd like to 
intern these commonly reference strings AND I'm wondering how much work 
I need to do, and how much will happen automatically.

Any insights appreciated.

BTW, I'm aware that I can do string interning myself using a dict cache 
(which is what ElementTree does internally).  But, this whole subject 
has got me curious now, and I'd like to understand a bit better. Would,
for example, using the builtin 'intern' give a better result than my 
hand coded interning?

--
Mike