Using Python for processing of large datasets (convincing managment)

Thomas Jensen spam at ob_scure.dk
Mon Jul 8 19:51:34 EDT 2002


Cameron Laird wrote:
> In article <3D2A078A.7040502 at ob_scure.dk>,

> I call SQL noodling "scalable" in the sense that good
> SQL queries can be hosted on bigger and bigger servers.
> We know how to do that--it's a commercial reality.

Ok, I understand.
I think it's often a question of choosing the right tool for the job. 
Consider the following example: find the average of a series of values 
found in a table. Of course(?) doing a "SELECT AVG(value) FROM T_MyTable 
WHERE ..." would be much faster that retriving all the values and doing 
the calculations on the client/app-server side. However if, for some 
reason, the contents of T_MyTable was already in the clients memory 
(perhaps it was calculated there), calculating the average on the client 
would perhaps be faster.

Be assured though, that for each calculation, both SQL and 
Python(/C++/VB or wathever it ends up being) solutions will be written 
and the fastest chosen. As it have been noted, the result might be that 
the SQL approach is the fastest, only time will tell.

> I *like* distributed computing.  I've spent much of the
> last eighteen months promoting SOAP, XML-RPC, and CORBA.
> Your mention of Linda and its descendants, including
> T-Spaces, thrilled me.  HOWEVER, I rarely recommend
> distribution for performance objectives, for reasons
> that have mostly appeared already in this thread.  Com-
> mercial applications (as opposed to scientific ones)
> just don't find success that way.

Well, you might be rigth, I don't know.
I'm a little scared though about using SQL too extensivly.
I might be too much of an SQL newbie, but there's just some stuff that's 
hard to write in (portable) SQL.
For example I've done some quite fancy calculations using multiple 
"DECLARE CURSOR", etc in MSSQL. However, trying to run these thru MySQL 
is, well problematic.

> Your situation might be an exception.  It's hard to know.
> The computations you describe--DB retrievals, elementary
> statistics, ...--sound to me like ones that I've seen
> most successfully hosted on conventional architectures.

I think I'm currently planning on a 90% conventional with possibility of 
later expansion to distributed computing :-)

The last 6 months I've been working almost exclusivly on a (commercial) 
project heavily based on SOAP (not for performance objectives though :-).
That part really doesn't scare me :-)

-- 
Best Regards
Thomas Jensen
(remove underscore in email address to mail me)




More information about the Python-list mailing list