optimizing memory utilization

Tue Sep 14 22:20:38 EDT 2004

On 14 Sep 2004 14:55:56 -0700, wcrucius at sandc.com (Wes Crucius) wrote:

>"Frithiof Andreas Jensen" <frithiof.jensen at die_spammer_die.ericsson.com> wrote in message news:<ci6f73$921$1 at newstree.wise.edt.ericsson.se>...
>> "Anon" <anon at ymous.com> wrote in message
>> news:pan.2004.09.14.04.38.02.647096 at ymous.com...
>> 
>> > Any data structures suggestions for this application?  BTW, the later
>> > accesses to this database would not benefit in any way from being
>> > presorted,
>> 
>> You keep saying it yourself: Use a Database ;-).
>> 
>> Databases "knows" about stuff in memory, as well as any searching and
>> sorting you might dream up later.
>> 
>> Python supports several, the simplest is probably PySQLite:
>> http://sourceforge.net/projects/pysqlite/
>
>Thanks for the specific suggestion, but it's an approach I'd hoped to
>avoid.  I'd rather get 2Gig of RAM if I was confident I could make it
>work that way...  The problem with a file-based database (versus an
>in-memory data-structure-based database as I was meaning) is
>performance.
>
>Maybe someone has a better suggestion if I give little more info:  I
>want to iterate through about 10,000 strings (each ~256 characters or
>less) looking for occurances (within those 10,000 strings) of any one
>of about 500,000 smaller (~30-50 characters average) strings.  Then I
>generate an XML doc that records which of the 500,000 strings were
>found within each of the 10,000 strings.

Need example of raw input and desired output ;-)

Patterns starting at any character? What if you have overlapping matches? I.e.,
if you had a string 'abbcde123', and were looking for 'bb' and 'bc'
would it count as two matches? Would longer matches have priority?
E.g., matching 'bbcd' overrides 'bb and 'cd' as separate matches?

Anyway, let's see what a brute force simple approach does. E.g, just put the
500,000 small strings in a small tree of dictionaries based on say string length, then
the string (you could tier it again with some leading characters and the rest, but I'd
try the easy thing first. But anyway, the idea (practically untested) is something like:

This finds repeating and overlapping patterns, since it was easy. That may not be what
you want, but you didn't say ;-)

----< s500kin10k.py >---------------------------------------
def main(path500k, path10k):
    root = {}
    for line in file(path500k):
        line = line.rstrip()
        if not line: continue
        root.setdefault(len(line),{})[line] = None
    lengths = root.keys()
    lengths.sort()
    print lengths
    print root
    # and then walk through your 10k strings of ~256 characters

    for nl, line in enumerate(file(path10k)):
        line = line.rstrip()
        hdr = 'Line %s: %r\n' % (nl, line)
        nc = len(line)
        for length in lengths:
            if length > nc: break
            tier1 = root[length]
            for nstep in xrange(nc+1-length):  # walk a window of length chars
                if line[nstep:nstep+length] in tier1:
                    print '%s    %r' % (hdr, line[nstep:nstep+length])
                    hdr = ''

if __name__ == '__main__':
    import sys
    if len(sys.argv)!=3: raise SystemExit, """
    Usage: s500kin10k.py path500k path10k
        where path500k has 500k lines of one string each, and
        path10k has 10k lines to search for the 500k in any position.
"""
    main(*sys.argv[1:])
------------------------------------------------------------

>
>I guess I am hoping to optimize memory usage in order to avoid
>thrashing my drives to death (due to using a file based database).  I
>can't believe that python requires memory 200% "overhead" to store my
>data as lists of lists.  I gotta believe I've chosen the wrong
>data-type or have mis-applied it somehow??
>
>BTW, my illustration wasn't my input data, it was literally the output
>of "print Albums" (with only a couple albums worth of dummy data
>loaded).  Albums is a list containing a list for each album, which in
>turn contains another list for each track on the album, which in-turn
>contains a couple of items of data about the given track.
>
>Bottom line question here:  Is a 'list' the most efficient data type
>to use for storing "arrays" of "arrays" of "arrays" and strings",
>given that I'm happy to iterate and don't need any sort of lookup or
>search functionality??
>
>
I suspect that you won't be that happy iterating through just arrays,
unless they are arrays of patternmatching tables like IIRC flex generates
to do parsing. Basically doing tricky parallel pattern matching by keeping
track of which patterns are alive as you go from character to character.
But 500k patterns are a lot of patterns....
>Thanks again,

See if the above runs you out of memory. Or time ;-)
I'd try it on shorter files first. You didn't say what you wanted
your final output to look like. Small examples of input => output communicate well ;-)

Regards,
Bengt Richter