Using Python for processing of large datasets (convincing managment)

Mon Jul 8 05:12:11 EDT 2002

Paul Rubin wrote:
> Thomas Jensen <spam at ob_scure.dk> writes:

> If you're doing all these single row selects without many updates,
> and you're not doing many joins, it really sounds more and more like
> an SQL database isn't the best tool for your task.

I agree, however the data needs to end up in a SQL database, since we 
have quite a lot of code (ASP code and COM components) depending on it 
beeing a SQL database (some doing quite complex joins, etc).

>>Before going on with the distributed approach, I will probably write a
>>"proof of concept" demo. Should this demo show, that it is not worth
>>the effort, I will put it aside for now.
> 
> Fair enough.  You could just check the CPU load on your SQL server
> right now, as your single client runs.

25-50% but thats not a realistic measure, since the client makes a huge 
amount of small selects, which probably makes network latency play some 
role.

> What kinds of calculations are these really?  The only one you've

All kinds, ranging from complex financial calculations to simple AVG, 
MIN, MAX. Sometimes the output of a calculation is smaller (byte-wice) 
than the input, sometimes larger.

> described so far is selecting a bunch of rows and computing the SD of
> the numbers on one column.  It may be fastest to do that with a server
> sided stored procedure.

I agree, and indeed we considered writing the entire job using Stored 
Procs. We ditched the idea because:
* MSSQL7 is not integrated with SourceSafe (AFAIK?)
* We generally like the idea of seperating data and code.
* (eh, other stuff I don't remember right now :-)
Actually part of the calculations are already written in SPs (for some 
realtime calculations), but to our surprise it didn't show the 
performance one would expect.

>>I really see all this distribution talk as one among several
>>optimization strategies.
> 
> OK, that's good, as long as you see there's a range of approaches.
> Sometimes all someone will have is a hammer and everything looks like
> a thumb ;-).

Hehe, that's right :-)

> If your data layout is simple enough you might just store it in a
> fixed-width record format, then mmap() it into memory and crunch it
> with a C program (or even a Python program).  That approach is
> generally simple and fast.  It will probably outperform any SQL
> approach by orders of magnitude.

Please see above why this wouldn't work.

-- 
Best Regards
Thomas Jensen
(remove underscore in email address to mail me)