optimizing memory usage of large in memory structures

Thu Apr 10 22:24:27 EDT 2003

On 9 Apr 2003 09:17:40 -0700, theory79 at yahoo.com (Hugo Liu) wrote:

>hi,
>
>i'm hoping that someone out there has a solution to hack this problem.
>
>working with corpus analysis and large knowledge bases, i often have
>to load large arrays into memory.  each array element is just a text
>string.  although the text is only 5mb, python consumes 40+ mb.  i
>would suspect that the overhead is from storing type info for each
>element.  if the strings were all fixed length, i'd just store them
>ordered and fixed width into a single string and use bisect, but they
>aren't fixed width.
>
>does anyone know if there is a way to fake static typing for this
>situation or any other way to make this more efficient (without
>sacrificing operating system interoperability or having to write
>external C code?)  any suggestions or pointers would be much
>appreciated!
>
If your strings were put in an efficient-storage magic black box,
how would you want to access the strings? One sequential sweep
is a lot different from zillions of random accesses, and if
you do random access, what are the selection criteria?

You could write a class that held your 5mb as a single string
and used the array module to hold a an array of 'l'-integers
for start positions (assuming fixed-length delimiters in the big string
-- if not, another array of 'B'-integers for string lengths
(assuming 0 to 255) could be used. Or look for the delimiter each time).

That would define storage that should take little more than 5mb+4*ns
(+ns if separate length array) bytes, where ns is the number of strings.

But then you need access methods. E.g., you could define __getitem__ and
access (read) your variable length strings by an index to get the nth string.
This would allow random or sequential read access of your variable-length strings
as if it were an ordinary list of ordinary Python strings.

If it is a linguistic corpus, the chances are you have many repetitions of words,
and you could represent the original 5mb as a sequence of integer word identifiers
and use a dictionary so as not to store word-strings more than once each.

And you could pack unique word-strings into a homogeneous array end to end together with
an offset array and use indices into the latter as keys for lookup, so you could throw away
the dictionary you used for speed while building the compact lookup data.

There may well be better ways to store and manipulate your data, depending on what you
really have to do with it.

Then again, if space is your only worry, why not just buy some more memory? ;-)

Regards,
Bengt Richter