Using Python for processing of large datasets (convincing managment)

Thomas Jensen spam at ob_scure.dk
Mon Jul 8 04:47:27 EDT 2002


William Park wrote:
> Thomas Jensen <spam at ob_scure.dk> wrote:
> 
>>We already have 2 DB Servers, a master replicating changes to a slave.
>>Our analysis shows that most database operations are/will be SELECTs.
>>Adding more DB servers is trivial, especially if we migrate to MySQL
>>(well, cheaper at least :-)
> 
> As I and others have said, deal with algorithm issues first.  Especially,
> since you already have something that is working.

Simply rewriting the current job to be distributed has never been the 
plan. I am very grateful for all the kind advices regarding algoritm 
design. I will assure You that a considerable amount of time has gone 
into algoritm design already.

> It may be that you are getting killed by overheads.  For example, if your
> situation goes something like
>     Given table of (a, b, x, y, z), 
> 	select a=1, b=1; then do something with x, y, z; insert it back.
> 	select a=1, b=2; then do something with x, y, z; insert it back.
> 	...
> 	select a=2, b=1; then do something with x, y, z; insert it back.
> 	select a=2, b=2; then do something with x, y, z; insert it back.
> 	...
> 	(1 million lines)
> Then, you have
>     1e6 x 2 x (connect time, search time, load time, disconnect time)
>     
> Can you dump the whole thing as text file in one-shot, do whatever with
> (a,b,x,y,z), and load it back in one-shot?

It's something like that, actually it's more like this:
     select a from T_A;
     select b from T_B;
     select c from T_C;
     calculate and update c;
The problem is, that most of the job is build around this model (there's 
also T_D, T_E, T_F, etc some alike, some not) so changing the general 
approach would require rewriting most of the program anyways (touching 
at least 75% of the code I estimate).

The current model looks more like this (this is a very simplified 
example that only shows a very small part of the calculation):
     select date, value from T_A where unitid = x order by date;
     calculate T_B values from T_A and perhaph external values
     select date, value from T_B where unitid = x order by date;
     update T_B where it differ from the calculated values

We are aware that there may be other models which would be faster, 
however apart from being fast, the calculations must *always* be 
correct. This is the model we have chosen to achieve this (since other 
factors than T_A may affect the value of T_B).

I don't think dumping anything to a text file will be nessesary, however 
I will consider it should there be problems.


-- 
Best Regards
Thomas Jensen
(remove underscore in email address to mail me)




More information about the Python-list mailing list