[Python-Dev] PEP 393 review

Mon Aug 29 22:32:01 CEST 2011

tl;dr: PEP-393 reduces the memory usage for strings of a very small
Django app from 7.4MB to 4.4MB, all other objects taking about 1.9MB.

Am 26.08.2011 16:55, schrieb Guido van Rossum:
> It would be nice if someone wrote a test to roughly verify these
> numbers, e.v. by allocating lots of strings of a certain size and
> measuring the process size before and after (being careful to adjust
> for the list or other data structure required to keep those objects
> alive).

I have now written a Django application to measure the effect of PEP
393, using the debug mode (to find all strings), and sys.getsizeof:

https://bitbucket.org/t0rsten/pep-393/src/ad02e1b4cad9/pep393utils/djmemprof/count/views.py

The results for 3.3 and pep-393 are attached.

The Django app is small in every respect: trivial ORM, very few
objects (just for the sake of exercising the ORM at all),
no templating, short strings. The memory snapshot is taken in
the middle of a request.

The tests were run on a 64-bit Linux system with 32-bit Py_UNICODE.

The tally of strings by length confirms that both tests have indeed
comparable sets of objects (not surprising since it is identical Django
source code and the identical application). Most strings in this
benchmark are shorter than 16 characters, and a few have several
thousand characters. The tally of byte lengths shows that it's the
really long memory blocks that are gone with the PEP.

Digging into the internal representation, it's possibly to estimate
"unaccounted" bytes. For PEP 393:

   bytes - 80*strings - (chars+strings) = 190053

This is the total of the wchar_t and UTF-8 representations for objects
that have them, plus any 2-byte and four-byte strings accounted
incorrectly in above formula. Unfortunately, for "default"

   bytes + 56*strings - 4*(chars+strings) = 0

as unicode__sizeof__ doesn't account for the (separate) PyBytes
object that may carry the default encoding. So in practice, the 3.3
number should be somewhat larger.

In both cases, the app didn't cope for internal fragmentation;
this would be possible by rounding up each string size to the next
multiple of 8 (given that it's all allocated through the object
allocator).

It should be possible to squeeze a little bit out of the 190kB,
by finding objects for which the wchar_t or UTF-8 representations
are created unnecessarily.

Regards,
Martin
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 3k.txt
URL: <http://mail.python.org/pipermail/python-dev/attachments/20110829/6c37b94c/attachment.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 393.txt
URL: <http://mail.python.org/pipermail/python-dev/attachments/20110829/6c37b94c/attachment-0001.txt>