Is there any way to minimize str()/unicode() objects memory usage [Python 2.6.4] ?

Sat Aug 7 03:30:35 EDT 2010

On Aug 6, 11:50 pm, Peter Otten <__pete... at web.de> wrote:
> I don't know to what extent it still applys but switching off cyclic garbage
> collection with
>
> import gc
> gc.disable()

Haven't tried it on the real dataset. On the synthetic test it (and
sys.setcheckinterval(100000)) gave ~2% speedup and no change in memory
usage. Not significant. I'll try it on the real dataset though.

> while building large datastructures used to speed up things significantly.
> That's what I would try first with your real data.
>
> Encoding your unicode strings as UTF-8 could save some memory.

Yes...  In fact that's what I'm trying now... .encode('utf-8')
definitely creates some clutter in the code, but I guess I can
subclass dict... And it does saves memory! A lot of it. Seems to be a
bit faster too....

> When your integers fit into two bytes, say, you can use an array.array()
> instead of the tuple.

Excellent idea. Thanks!  And it seems to work too, at least for the
test code. Here are some benchmarks (x86 desktop):

Unicode key / tuple:
>>> for i in xrange(0, 1000000): d[unicode(i)] =  (i, i+1, i+2, i+3, i+4, i+5, i+6)
1000000 keys, ['VmPeak:\t  224704 kB', 'VmSize:\t  224704 kB'],
4.079240 seconds, 245143.698209 keys per second

>>> for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] =  array.array('i', (i, i+1, i+2, i+3, i+4, i+5, i+6))
1000000 keys, ['VmPeak:\t  201440 kB', 'VmSize:\t  201440 kB'],
4.985136 seconds, 200596.331486 keys per second

>>> for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] =  (i, i+1, i+2, i+3, i+4, i+5, i+6)
1000000 keys, ['VmPeak:\t  125652 kB', 'VmSize:\t  125652 kB'],
3.572301 seconds, 279931.625282 keys per second

Almost halved the memory usage. And faster too. Nice.

-- Dmitry