Using Python for processing of large datasets (convincing managment)

Sun Jul 7 09:02:16 EDT 2002

Thomas Jensen <spam at ob_scure.dk> writes:
> Please read my original post again. I merely said, that one of the
> design goals was scalability.
> The current job takes about 5 hours to complete! 

Is 5 hours acceptable, if your data doesn't get any bigger?
It not, what's the maximum you can accept?

> I am absolutely certain that I would be able to write a new job in
> any language whatsoever (be it Python, C++ or even *shiver* Basic)
> that would complete the job in less than 30 minutes, given the right
> DB optimizations and program design. It could be written entirely in
> SQL for that matter (which would probably perform rather well!).

OK, you're certain you can do it in 30 minutes.  Are you certain
you CAN'T do it in 5 minutes?  If you can do it in 5 minutes, maybe
you can stop worrying about scaling.

> "Spend your time trying to understand your problem better rather than
> throwing fancy technology at it"
> 
> This is exactly what I'm trying to do. Distributed computing is far
> from the number one priority in this project!
> However, since noone known excactly how much data we will be handling
> in a year or two, I have been asked to make sure the job is written in
> a scalable manner.

In another post you said you wanted to handle 10 times as much data
as you currently handle.  Now you say it's not known exactly--do you
have an idea or not?

If it's acceptable for the program to need 3 hours, and you can handle
the current data size in 10 minutes, then you can handle 10x the data
size with plenty of speed to spare (assuming no
seriously-worse-than-linear-time processes).

> Once the algoritms and database design has been optimized, there is
> still an upper bound as to how much data one CPU can handle, don't you
> agree?

I think the bottleneck is going to be the database.  You might not get
better throughput with multiple client CPU's than with just one.  If
you do, maybe your client application needs more optimization.

> I expect it to be much easier to build the job around a distributed
> core now, rather than adding support later.

First gather some evidence that a distributed client will be better
than a single one for large datasets.  It could well be that you'll
never have reason to add distributed support.

What is the application?  What is the data and what do you REALLY need
to do with it?  How much is there ever REALLY likely to be?  Is an SQL
database even really the best way to store and access it?  If there's
not multiple processes updating it, maybe you don't need its overhead.
Could a 1960's mainframe programmer deal with your problem, and if
s/he could deal with it at all, why do you need multiple CPU's when
each one is 1000 times faster than the 1960's computer?

Inside most complicated programs there's a simple program struggling
to get out.