Reclaiming (lots of) memory

Sat Oct 2 16:34:58 EDT 2004

Thomas Rast <foo.bar at freesurf.ch.invalid> writes:
> So, do I really have to do this in two separate processes?  Would it
> help if I implemented the data storage part as a C extension module?
> Or am I worrying too much about a problem that is better left to the
> OS's memory management (i.e. swapping)?
> 
> Of course I'd also appreciate ideas for more efficient data layouts

When faced with such problems I always find it worthwhile to remember
that the phone company managed to issue itemized long-distance bills
to each of its customers, even in the 1960's using mainframe computers
with a few hundred kb of memory and a room full of magtape drives like
you see in old movies.  I ask myself "how would a 1960's mainframe
programmer do this, without megabytes of ram?".

I'm not sure what your dictionary of dictionaries is supposed to do,
but from your dissociated press description it sounds like you want to
look at the pairs of adjacent pairs of words and build up a
probability table.  In that case I'd do something like:

1. Sequentially read the logs files, and for each pair of adjacent words,
   write that pair to a temp file tmp1.
2. Sort tmp1 with os.system("sort").
3. Make a sequential pass through tmp1 to build a second temp file tmp2,
   which is a compressed tmp1.  For example, if tmp1 has the lines
          apple pie
          apple computer
          apple pie
          apple sauce

   then make a single line in tmp2: 

          apple: pie:2,computer:1,sauce:1

Do that in the obvious way by building a temporary dict for the words
that follow "apple".  Also, while doing this, build a Python dict
giving the offset in tmp2 for the line for each word, i.e.

    d['apple'] = ofile.tell()
and similarly for all the other words (you have a Python dict with
325k entries which is each a single int; that shouldn't be too large).
There are of course lots of tradeoffs you can make to give a smaller
dict, if you need to.  You could also use the array module to implement
an int-valued hash table without the overhead of a Python object in each slot.

4. Use os.mmap to map the file tmp2 into memory.  Then to get the
successor words to 'apple', simply use the in-memory dict to get the
appropriate offset in tmp2, read and parse the line from that offset, 
and emit the appropriate word.

5. There's lots of other optimization and compression you can do to
make tmp2 smaller, which can reduce your paging load, but the above
should be enough to get you started.