optimizing memory utilization

Wed Sep 15 16:44:51 EDT 2004

On Tue, 14 Sep 2004 14:55:56 -0700, Wes Crucius wrote:
> Maybe someone has a better suggestion if I give little more info:  I
> want to iterate through about 10,000 strings (each ~256 characters or
> less) looking for occurances (within those 10,000 strings) of any one
> of about 500,000 smaller (~30-50 characters average) strings.  Then I
> generate an XML doc that records which of the 500,000 strings were
> found within each of the 10,000 strings.

There may be a good reason for this, but it should be asked: Why not
stream one set of strings or the other in, instead of loading them up
front?

The other obvious answer that isn't cute, but will probably be faster to
program and still run faster then a process that runs into swap is to
break the problem up: Either check some % of the 10,000 in one run and
start over, or check some % of the 500,000 and start over with a new
chunk. 

There's nothing wrong with such chunking; when you start asking for
results from datasets that challenge your machine there are many
techniques that seem wasteful but really aren't. One of my personal
canonical examples is something called "iterative deepening depth first
search" for large graphs that may have cycles (which can trap the
depth-first search). On the first run, you search your first level. On the
second, you search down two levels, and so on. By stopping at a certain
depth you ensure that you cover all nodes down to that depth without being
trapped. It sounds horridly wasteful compared to trying to keep track of
the visited nodes, but it turns out that it may only double or less the
execution time while saving vaste swathes of memory. It still boggles my
mind.