Using Python for processing of large datasets (convincing managment)

Mon Jul 8 00:23:51 EDT 2002

Thomas Jensen <spam at ob_scure.dk> writes:
> > Is 5 hours acceptable, if your data doesn't get any bigger?
> > It not, what's the maximum you can accept?
> 
> 5 hours is about the maximum acceptable. Usually the time is a little
> shorter, but 5 hours happen when much new data is added.
> However, this is a case of "faster is better". 5 hours is acceptable
> but 1 minute would open up new business oppertunities.

OK, that make sense.

> > I think the bottleneck is going to be the database.  You might not get
> > better throughput with multiple client CPU's than with just one.  If
> > you do, maybe your client application needs more optimization.
> 
> We already have 2 DB Servers, a master replicating changes to a slave.
> Our analysis shows that most database operations are/will be SELECTs.
> Adding more DB servers is trivial, especially if we migrate to MySQL
> (well, cheaper at least :-)

If you're doing all these single row selects without many updates,
and you're not doing many joins, it really sounds more and more like
an SQL database isn't the best tool for your task.

> Before going on with the distributed approach, I will probably write a
> "proof of concept" demo. Should this demo show, that it is not worth
> the effort, I will put it aside for now.

Fair enough.  You could just check the CPU load on your SQL server
right now, as your single client runs.

> But all that apart - the distributed part is not really the hard or
> complex part about this project. I understand that as soon as the
> calculations take place in more than one thread (be it on one or more
> CPUs/machines) it adds some complexity. However, designing the
> application in such a way that parralell computations are
> possible/plausible, can't be that bad I think.

What kinds of calculations are these really?  The only one you've
described so far is selecting a bunch of rows and computing the SD of
the numbers on one column.  It may be fastest to do that with a server
sided stored procedure.

> I really see all this distribution talk as one among several
> optimization strategies.

OK, that's good, as long as you see there's a range of approaches.
Sometimes all someone will have is a hammer and everything looks like
a thumb ;-).

> An extreme example of another strategy: Develop the entire thing in
> assembler, using flat files or entirely bypassing the file-system.
> I done correctly, it would probably outperform other strategies by
> far, but it would also be:
> * Less maintainable
> * Less readable
> * a lot harder to use from ASP/PHP
> * etc

If your data layout is simple enough you might just store it in a
fixed-width record format, then mmap() it into memory and crunch it
with a C program (or even a Python program).  That approach is
generally simple and fast.  It will probably outperform any SQL
approach by orders of magnitude.