Advice on optimium data structure for billion long list?

Alexandre Fayolle alf at orion.logilab.fr
Sun May 13 10:38:52 EDT 2001


On Sat, 12 May 2001 17:12:22 +0100, Mark blobby Robinson 
<m.1.robinson at herts.ac.uk> wrote:
>I'd just like to pick the best of the worlds python brains if thats ok. 

I'm not sure I qualify here, but I'll throw in a few considerations.

>it would take litterally weeks to run. 

Well, you want to process 1,4 billion elements. Assuming you can process 
1000 elements per second, this still leaves you with roughly 16 days of 
processing. There ain't no such thing as a free lunch. 

Regarding the storage concern you have, you may want to note the following
things:

 * real world DBMS (such as Postgresql, DB2, Oracle...) are made to 
handle tables with a size of the order of 10e7 rows. These are considered
'big' databases. Dealing with billions of rows is a really really big 
database, for which you generally need special support from the OS, and
and the DB vendor, not mentioning the hardware (RAID anyone) in order to 
get decent performance. Oracle and IBM will be happy to sell you support 
to help you create and tune the DB.

 * there is abolutely no way you'll be able to use gdbm on an average 
workstation to store 1.4 billion rows. According to my personal 
experience, Gdbm is OK up to about 10000 rows. I'm currently dealing 
with 300000 rows for an application and I use 26 gdbm files to 
pre-hash the data into reasonable sized files. If you want to go in this 
direction, be aware that it means about 150000 files. Using Postgres will
ease things, but you'll have to deal with DMBS specific problems (index
updates, datafile size...)
 

Nowadays, when you need to process such a huge amount of data, the way 
people go is by parallelizing the code and distribute the computation
across several machine. If the lab you're working in has several hundreds
of workstation doing nothing at night, you may want to use them to do the
computation for you. Otherwise, you can talk your manager into buying you
one hundred linux boxes and use them. Or you could try contacting 
distributed.net to see if they're interested by your project. 

Now whatever the path you choose to follow, it will take time (rewriting
the code, setting it on the machines, running it...) probably several 
weeks, so I have to say that you'll probably have to wait for some time 
before you get any result. 

Alexandre Fayolle
-- 
http://www.logilab.com 
Narval is the first software agent available as free software (GPL).
LOGILAB, Paris (France).



More information about the Python-list mailing list