Memory Usage of Strings

John Gordon gordon at panix.com
Wed Mar 16 13:51:45 EDT 2011


In <mailman.988.1300289897.1189.python-list at python.org> Amit Dev <amitdev at gmail.com> writes:

> I'm observing a strange memory usage pattern with strings. Consider
> the following session. Idea is to create a list which holds some
> strings so that cumulative characters in the list is 100MB.

> >>> l = []
> >>> for i in xrange(100000):
> ...  l.append(str(i) * (1000/len(str(i))))

> This uses around 100MB of memory as expected and 'del l' will clear that.

> >>> for i in xrange(20000):
> ...  l.append(str(i) * (5000/len(str(i))))

> This is using 165MB of memory. I really don't understand where the
> additional memory usage is coming from.

> If I reduce the string size, it remains high till it reaches around
> 1000. In that case it is back to 100MB usage.

I don't know anything about the internals of python storage -- overhead,
possible merging of like strings, etc.  but some simple character counting
shows that these two loops do not produce the same number of characters.

The first loop produces:

Ten single-digit values of i which are repeated 1000 times for a total of
10000 characters;

Ninety two-digit values of i which are repeated 500 times for a total of
45000 characters;

Nine hundred three-digit values of i which are repeated 333 times for a
total of 299700 characters;

Nine thousand four-digit values of i which are repeated 250 times for a
total of 2250000 characters;

Ninety thousand five-digit values of i which are repeated 200 times for
a total of 18000000 characters.

All that adds up to a grand total of 20604700 characters.

Or, to condense the above long-winded text in table form:

range         num digits 1000/len(str(i))  total chars
0-9            10 1      1000                    10000
10-99          90 2       500                    45000
100-999       900 3       333                   299700
1000-9999    9000 4       250                  2250000
10000-99999 90000 5       200                 18000000
                                              ========
                          grand total chars   20604700

The second loop yields this table:

range         num digits 5000/len(str(i))  total bytes
0-9            10 1      5000                    50000
10-99          90 2      2500                   225000
100-999       900 3      1666                  1499400
1000-9999    9000 4      1250                 11250000
10000-19999 10000 5      1000                 10000000
                                              ========
                          grand total chars   23024400

The two loops do not produce the same numbers of characters, so I'm not
surprised they do not consume the same amount of storage.

P.S.: Please forgive me if I've made some basic math error somewhere.

-- 
John Gordon                   A is for Amy, who fell down the stairs
gordon at panix.com              B is for Basil, assaulted by bears
                                -- Edward Gorey, "The Gashlycrumb Tinies"




More information about the Python-list mailing list