Is there any way to minimize str()/unicode() objects memory usage [Python 2.6.4] ?

Fri Aug 6 21:39:27 EDT 2010

Steven, thank you for answering. See my comments inline. Perhaps I
should have formulated my question a bit differently: Are there any
*compact* high performance containers for unicode()/str() objects in
Python? By *compact* I don't mean compression. Just optimized for
memory usage, rather than performance.

What I'm really looking for is a dict() that maps short unicode
strings into tuples with integers. But just having a *compact* list
container for unicode strings would help a lot (because I could add a
__dict__ and go from it).

> Yes, lots of ways. For example, do you *need* large lists? Often a better
> design is to use generators and iterators to lazily generate data when
> you need it, rather than creating a large list all at once.

Yes. I do need to be able to process large data sets.
No, there is no way I can use an iterator or lazily generate data when
I need it.

> An optimization that sometimes may help is to intern strings, so that
> there's only a single copy of common strings rather than multiple copies
> of the same one.

Unfortunately strings are unique (think usernames on facebook or
wikipedia). And I can't afford storing them in db/memcached/redis/
etc... Too slow.

> Can you compress the data and use that? Without knowing what you are
> trying to do, and why, it's really difficult to advise a better way to do
> it (other than vague suggestions like "use generators instead of lists").

Yes. I've tried. But I was unable to find a good, unobtrusive way to
do that. Every attempt either adds some unnecessary pesky code, or
slow, or something like that. See more at: http://bugs.python.org/issue9520

> Very often, it is cheaper and faster to just put more memory in the
> machine than to try optimizing memory use. Memory is cheap, your time and
> effort is not.

Well... I'd really prefer to use say 16 bytes for 10 chars strings and
fit data into 8Gb
Rather than paying extra $1k for 32Gb.

> > Well...  63 bytes per item for very short unicode strings... Is there
> > any way to do better than that? Perhaps some compact unicode objects?
>
> If you think that unicode objects are going to be *smaller* than byte
> strings, I think you're badly informed about the nature of unicode.

I don't think that that unicode objects are going to be *smaller*!
But AFAIK internally CPython uses UTF-8? No? And 63 bytes per item
seems a bit excessive.
My question was - is there any way to do better than that....

> Python is not a low-level language, and it trades off memory compactness
> for ease of use. Python strings are high-level rich objects, not merely a
> contiguous series of bytes. If all else fails, you might have to use
> something like the array module, or even implement your own data type in
> C.

Are there any *compact* high performance containers (with dict, list
interface) in Python?

-- Regards, Dmitry