how to store great array?

Mon Jan 29 15:13:58 EST 2001

On Mon, 29 Jan 2001 13:56:15 GMT, paulus132 at my-deja.com
<paulus132 at my-deja.com> wrote:
>For 2 mio Documents, I have a "Year" field (4 characters) associated
>with each of them. I try to store this in Python, to report him over
>other records.
>
>It is possible ? Wich is the faster method ?
>
>The easiest way is a dictionnary { doc_num : year } but how estimate the
>necessary memory ? With old Basic's, the FREE(0) Builtin returns the
>number of free memory. Is there comparable module/class/function to
>survey memory use in Python ?
>
>Another way is to construct a list doc_years = ["1934", "1942", "1897",
>....], and access it with doc_years[doc_num]. Is doc_years[2000000] as
>fast as doc_years[456] ?

Yes.

>Another way is to construct an unique string with all years
>concatenated, and access it with all_years[4*doc_num:4*doc_num+4]. But I
>think that 7.6 Mo length is too big ! How long can be a string ?

Array access is constant, and at the C level, arrays are accessed just like
your complicated subscript.  So just use the normal indexing, it'll turn into
that at the C / assembly level.  A string can be as long as you have memory
for.  Even if the years are all in a similar range and python interns the
strings (effectively compressing the data), the normal reference overhead will
take your memory back again.

Note that in python terminology, a list is what everyone else calls an array.
So you might have been confused into thinking python lists are linked-lists
aka cons cells, which have linear access (so doc_years[200000] would take
longer than doc_years[2]).  But python's lists are arrays.  Don't ask me why
this confusing terminology was chosen, probably because some Dutch people
doing usability tests thought 'list' sounded nicer for ABC a million years
ago.

Dict access is also constant, as is usual for a hash table (internally, it
gets turned into simple array access).  But if your indices will be inteeger,
then a plain array aka list will have less overhead.

Python experts could tell you the exact memory usage.  It will be something
like bytes_in_string+null + pointer + type_field + maybe_a_cached_hash +
interned_flag.  So with ~4 bytes overhead for each 5 byte string, you're
looking at 9*2_000_000 / 1024 / 1024 = 17MB for 2 million elements.

Since every platform has a different way of reporting free memory (and most
modern ones lie through their teeth about it), I'm not aware of any standard
way to see available memory.  For unix-likes, you could use the resource
module to see *consumed* memory.  Windows may have special functions in the
win32 modules.  Best bet, unless you're working on an embedden platform, is to
just try it and see what happens.  If you run out of memory, try using the
array module to eliminate the per-element overhead.  If you're *still* out of
memory, come up with a simple compression scheme so you don't lose your O(1)
access and write a C module :)  But we're not on Commodore 64s anymore, you
know :)