how to store great array?

Jeff Epler jepler at inetnebr.com
Mon Jan 29 22:14:35 EST 2001


This kind of problem should be within the grasp of any recent PC with
say 64 megs of memory.  The test program I wrote is crawling along on
my 36 meg laptop, but showed about 96 megs used for a 2 million item
dictionary when I ran it on my 256M desktop machine.

There isn't a portable routine to get the amount of free memory, but
Python will raise an exception (MemoryError?) in case you allocate all
available memory.

If your document numbers are contiguous and you have decided on in-memory
storage, [1934, 1942, ...], a list of integers rather than strings,
is one choice.  Another is the "array" module, which will offer the
lowest overhead.  For instance, you can select a type of "byte" (If
all your documents are expected to fall within a 256-year timespan)
or go all the way to "int" (4 billion year range).  This storage method is
as efficient as a C array, except for a few bytes of overhead.  2 million
entries at one byte each takes two million bytes to store, well within
the range of any machine where Python is likely to run these days.

With a dictionary, both keys and values of integers are probably the
best choice.

However, I might recommend using the "shelve" option.  It's a method of
transparently using a file on disk as though it were a dictionary, assuming
the keys are all strings.  In this way, you can avoid memory exhaustion
problems, and increase the startup speed of your program since there's no
need to read all the items at startup.

So, have a look at the documentation to the "shelve" and "array" modules,
and use whatever program your operating system provides to measure how
much memory your Python script is using. (In Linux, "ps", "free" and
"top" are good tools for this purpose)  I imagine that there is a solution
which will work in the available memory but still be accessible in a
reasonable fashion (for instance, if you choose the "array" method with bytes
so that the zero-year is 1700, you might wrap it in a class instance which
performs this step for you ...
	class Documents:
		def __init__(self, file, bias=1700, typecode = 'b'):
			self.bias = bias
			self.data = array.array(typecode)
			self.data.fromfile(file)

		def __getitem__(self, item):
			return self.data[item] + self.bias

		def __setitem__(self, item, value):
			self.data[item] = value - self.bias

		def tofile(self, file):
			self.data.tofile(file)

	documents = Documents(...)
)

Good luck!  I think this post has become a bit disorganized, but it should
give you a number of ideas to pursue in solving your problem.

Jeff



More information about the Python-list mailing list