Advice on optimium data structure for billion long list?

Sun May 13 10:25:25 EDT 2001

Mark blobby Robinson <m.1.robinson at herts.ac.uk> wrote:

>Hey guys,
>
>I'd just like to pick the best of the worlds python brains if thats ok. 
>I am building a program that is pattern searching in DNA sequences and 
>generating a list of combinations of 3 patterns that meet certain 
>criteria. My problem is that this list could potentially get as large as 
>~1.4 billion entries. Now originally I was using a dictionary with the 
>key as a 15 length string (the patterns catted) and the value simply a 
>count of the number of hots for that pattern combination. Obviously that 
>hit memory problems pretty soon,  so started storing the information as 
>shelves which worked fine except that as the size of the shelf increased 
>it started taking forever update the shelf file (involved a check for 
>key entry and then either updated or created new entry).  This was 
>slowing things down to the extent that it would have taken literally 
>weeks to run. So next I have written an extension in C which basically 
>creates a huge c array on disk and the offset positions for each patt 
>combo is a hash generated from the unique string.  This is still way too 
>slow and I am just wondering if I am ignorant of some far easier or more 
>effective way of tackling this problem. Maybe I should be using some 
>kinda database, but I thought that basically the shelve operation did 
>that ....I was using with gdbm.
>

That's a lot of data but from your description there are too many
unknowns. Where is the source data coming from? What is an example of
the 15 character string? Can it be compressed at the source? What does
the data structure look like? 

Did you get a faster machine more RAM. Multiple CPU's etc...

Costas