Using Python for processing of large datasets (convincing managment)
Thomas Jensen
spam at ob_scure.dk
Mon Jul 8 04:47:27 EDT 2002
William Park wrote:
> Thomas Jensen <spam at ob_scure.dk> wrote:
>
>>We already have 2 DB Servers, a master replicating changes to a slave.
>>Our analysis shows that most database operations are/will be SELECTs.
>>Adding more DB servers is trivial, especially if we migrate to MySQL
>>(well, cheaper at least :-)
>
> As I and others have said, deal with algorithm issues first. Especially,
> since you already have something that is working.
Simply rewriting the current job to be distributed has never been the
plan. I am very grateful for all the kind advices regarding algoritm
design. I will assure You that a considerable amount of time has gone
into algoritm design already.
> It may be that you are getting killed by overheads. For example, if your
> situation goes something like
> Given table of (a, b, x, y, z),
> select a=1, b=1; then do something with x, y, z; insert it back.
> select a=1, b=2; then do something with x, y, z; insert it back.
> ...
> select a=2, b=1; then do something with x, y, z; insert it back.
> select a=2, b=2; then do something with x, y, z; insert it back.
> ...
> (1 million lines)
> Then, you have
> 1e6 x 2 x (connect time, search time, load time, disconnect time)
>
> Can you dump the whole thing as text file in one-shot, do whatever with
> (a,b,x,y,z), and load it back in one-shot?
It's something like that, actually it's more like this:
select a from T_A;
select b from T_B;
select c from T_C;
calculate and update c;
The problem is, that most of the job is build around this model (there's
also T_D, T_E, T_F, etc some alike, some not) so changing the general
approach would require rewriting most of the program anyways (touching
at least 75% of the code I estimate).
The current model looks more like this (this is a very simplified
example that only shows a very small part of the calculation):
select date, value from T_A where unitid = x order by date;
calculate T_B values from T_A and perhaph external values
select date, value from T_B where unitid = x order by date;
update T_B where it differ from the calculated values
We are aware that there may be other models which would be faster,
however apart from being fast, the calculations must *always* be
correct. This is the model we have chosen to achieve this (since other
factors than T_A may affect the value of T_B).
I don't think dumping anything to a text file will be nessesary, however
I will consider it should there be problems.
--
Best Regards
Thomas Jensen
(remove underscore in email address to mail me)
More information about the Python-list
mailing list