Storing a big amount of path names

Chris Angelico rosuav at gmail.com
Fri Feb 12 00:02:43 EST 2016


On Fri, Feb 12, 2016 at 3:45 PM, Paulo da Silva
<p_s_d_a_s_i_l_v_a_ns at netcabo.pt> wrote:
>> Correct. Two equal strings, passed to sys.intern(), will come back as
>> identical strings, which means they use the same memory. You can have
>> a million references to the same string and it takes up no additional
>> memory.
> I have being playing with this and found that it is not always true!
> For example:
>
> In [1]: def f(s):
>    ...:     print(id(sys.intern(s)))
>    ...:
>
> In [2]: import sys
>
> In [3]: f("12345")
> 139805480756480
>
> In [4]: f("12345")
> 139805480755640
>
> In [5]: f("12345")
> 139805480756480
>
> In [6]: f("12345")
> 139805480756480
>
> In [7]: f("12345")
> 139805480750864
>
> I think a dict, as MRAB suggested, is needed.
> At the end of the store process I may delete the dict.

I'm not 100% sure of what's going on here, but my suspicion is that a
string that isn't being used is allowed to be flushed from the
dictionary. If you retain a reference to the string (not to its id,
but to the string itself), you shouldn't see that change. By doing the
dict yourself, you guarantee that ALL the strings will be retained,
which can never be _less_ memory than interning them all, and can
easily be _more_.

>> But I reiterate: Don't even bother with this unless you know your
>> program is running short of memory.
>
> Yes, it is.
> This is part of a previous post (sets of equal files) and I need lots of
> memory for performance reasons. I only have 2G in this computer.

How many files, roughly? Do you ever look at the contents of the
files? Most likely, you'll be dwarfing the files' names with their
contents. Unless you actually have over two million unique files, each
one with over a thousand characters in the name, you can't use all
that 2GB with file names.

If virtual memory is active, all that'll happen is that you dip into
the swapper / page file a bit... and THAT is when you start looking at
reducing memory usage. Don't bother optimizing until you need to, and
even then, you measure first to see what part of the program actually
needs to be optimized.

> I already had implemented a solution. I used two dicts. One to map
> dirnames to an int handler and the other to map the handler to dir
> names. At the end I deleted the 1st. one because I only need to get the
> dirname from the handler. But I thought there should be a better choice.

If all your dir names are interned, their identities (approximately
the values returned by id(), but not quite) will be those handlers for
you, without any overhead and without any complexity.

ChrisA



More information about the Python-list mailing list