Storing a big amount of path names

Steven D'Aprano steve at pearwood.info
Fri Feb 12 00:51:21 EST 2016


On Fri, 12 Feb 2016 04:02 pm, Chris Angelico wrote:

> On Fri, Feb 12, 2016 at 3:45 PM, Paulo da Silva
> <p_s_d_a_s_i_l_v_a_ns at netcabo.pt> wrote:
>>> Correct. Two equal strings, passed to sys.intern(), will come back as
>>> identical strings, which means they use the same memory. You can have
>>> a million references to the same string and it takes up no additional
>>> memory.
>> I have being playing with this and found that it is not always true!

It is true, but only for the lifetime of the string. Once the string is
garbage collected, it is removed from the cache as well. If you then add
the string again, you may not get the same id.

py> mystr = "hello world"
py> str2 = sys.intern(mystr)
py> str3 = "hello world"
py> mystr is str2  # same string object, as str2 is interned
True
py> mystr is str3  # not the same string object
False


But if we delete all references to the string objects, the intern cache is
also flushed, and we may not get the same id:

py> del str2, str3
py> id(mystr)  # remember this ID number
3079482600
py> del mystr
py> id(sys.intern("hello world"))  # a new entry in the cache
3079227624


This is the behaviour you want: if a string is completely deleted, you don't
want it remaining in the intern cache taking up memory.

> I'm not 100% sure of what's going on here, but my suspicion is that a
> string that isn't being used is allowed to be flushed from the
> dictionary. If you retain a reference to the string (not to its id,
> but to the string itself), you shouldn't see that change. By doing the
> dict yourself, you guarantee that ALL the strings will be retained,
> which can never be _less_ memory than interning them all, and can
> easily be _more_.


Yep. Back in the early days, interned strings were immortal and lasted
forever. That wasted memory, and is no longer the case.




-- 
Steven




More information about the Python-list mailing list