Using Python for processing of large datasets (convincing managment)

Sun Jul 7 08:11:36 EDT 2002

Paul Rubin wrote:
> Thomas Jensen <spam at ob_scure.dk> writes:
> 
>>I am quite certain that scaling (well) to multiple CPU requires one to
>>use threading at least. Scaling to several physical machines might be
>>(relativly) easy with Java, but imagine it must require some coding,
>>or?
> 
> You are obsessed with this multi-CPU stuff but have not given the
> slightest bit of evidence that you need that complexity to meet your
> performance goals.  Spend your time trying to understand your problem
> better rather than throwing fancy technology at it.  Chances are you
> can do what you need with a simple, single-CPU approach.

:-)

Please read my original post again. I merely said, that one of the 
design goals was scalability.
The current job takes about 5 hours to complete! I am absolutely certain 
that I would be able to write a new job in any language whatsoever (be 
it Python, C++ or even *shiver* Basic) that would complete the job in 
less than 30 minutes, given the right DB optimizations and program 
design. It could be written entirely in SQL for that matter (which would 
probably perform rather well!).

"Spend your time trying to understand your problem better rather than 
throwing fancy technology at it"

This is exactly what I'm trying to do. Distributed computing is far from 
the number one priority in this project!
However, since noone known excactly how much data we will be handling in 
a year or two, I have been asked to make sure the job is written in a 
scalable manner.
Once the algoritms and database design has been optimized, there is 
still an upper bound as to how much data one CPU can handle, don't you 
agree?
I expect it to be much easier to build the job around a distributed core 
now, rather than adding support later.

-- 
Best Regards
Thomas Jensen
(remove underscore in email address to mail me)