Using Python for processing of large datasets (convincing managment)
Thomas Jensen
spam at ob_scure.dk
Sun Jul 7 08:11:36 EDT 2002
Paul Rubin wrote:
> Thomas Jensen <spam at ob_scure.dk> writes:
>
>>I am quite certain that scaling (well) to multiple CPU requires one to
>>use threading at least. Scaling to several physical machines might be
>>(relativly) easy with Java, but imagine it must require some coding,
>>or?
>
> You are obsessed with this multi-CPU stuff but have not given the
> slightest bit of evidence that you need that complexity to meet your
> performance goals. Spend your time trying to understand your problem
> better rather than throwing fancy technology at it. Chances are you
> can do what you need with a simple, single-CPU approach.
:-)
Please read my original post again. I merely said, that one of the
design goals was scalability.
The current job takes about 5 hours to complete! I am absolutely certain
that I would be able to write a new job in any language whatsoever (be
it Python, C++ or even *shiver* Basic) that would complete the job in
less than 30 minutes, given the right DB optimizations and program
design. It could be written entirely in SQL for that matter (which would
probably perform rather well!).
"Spend your time trying to understand your problem better rather than
throwing fancy technology at it"
This is exactly what I'm trying to do. Distributed computing is far from
the number one priority in this project!
However, since noone known excactly how much data we will be handling in
a year or two, I have been asked to make sure the job is written in a
scalable manner.
Once the algoritms and database design has been optimized, there is
still an upper bound as to how much data one CPU can handle, don't you
agree?
I expect it to be much easier to build the job around a distributed core
now, rather than adding support later.
--
Best Regards
Thomas Jensen
(remove underscore in email address to mail me)
More information about the Python-list
mailing list