Using Python for processing of large datasets (convincing managment)
Thomas Jensen
spam at ob_scure.dk
Mon Jul 8 05:12:11 EDT 2002
Paul Rubin wrote:
> Thomas Jensen <spam at ob_scure.dk> writes:
> If you're doing all these single row selects without many updates,
> and you're not doing many joins, it really sounds more and more like
> an SQL database isn't the best tool for your task.
I agree, however the data needs to end up in a SQL database, since we
have quite a lot of code (ASP code and COM components) depending on it
beeing a SQL database (some doing quite complex joins, etc).
>>Before going on with the distributed approach, I will probably write a
>>"proof of concept" demo. Should this demo show, that it is not worth
>>the effort, I will put it aside for now.
>
> Fair enough. You could just check the CPU load on your SQL server
> right now, as your single client runs.
25-50% but thats not a realistic measure, since the client makes a huge
amount of small selects, which probably makes network latency play some
role.
> What kinds of calculations are these really? The only one you've
All kinds, ranging from complex financial calculations to simple AVG,
MIN, MAX. Sometimes the output of a calculation is smaller (byte-wice)
than the input, sometimes larger.
> described so far is selecting a bunch of rows and computing the SD of
> the numbers on one column. It may be fastest to do that with a server
> sided stored procedure.
I agree, and indeed we considered writing the entire job using Stored
Procs. We ditched the idea because:
* MSSQL7 is not integrated with SourceSafe (AFAIK?)
* We generally like the idea of seperating data and code.
* (eh, other stuff I don't remember right now :-)
Actually part of the calculations are already written in SPs (for some
realtime calculations), but to our surprise it didn't show the
performance one would expect.
>>I really see all this distribution talk as one among several
>>optimization strategies.
>
> OK, that's good, as long as you see there's a range of approaches.
> Sometimes all someone will have is a hammer and everything looks like
> a thumb ;-).
Hehe, that's right :-)
> If your data layout is simple enough you might just store it in a
> fixed-width record format, then mmap() it into memory and crunch it
> with a C program (or even a Python program). That approach is
> generally simple and fast. It will probably outperform any SQL
> approach by orders of magnitude.
Please see above why this wouldn't work.
--
Best Regards
Thomas Jensen
(remove underscore in email address to mail me)
More information about the Python-list
mailing list