[Tutor] toy program to find standard deviation of 2 columns of a sqlite3 database

Peter Otten __peter__ at web.de
Sun Jul 3 05:02:19 EDT 2022


On 03/07/2022 06:23, Manprit Singh wrote:
> Yes it is obviously a homework kind of thing ....I do create problems for
> myself , try to solve and  try to find better ways .
>
> I was trying to learn the use of create- aggregate() .
>
> Thank you for the hint that is given by you . Let me try .
>
> My purpose is not to find the std dev. It is actually to learn how to use
> the functions
>
>
> On Sun, 3 Jul, 2022, 09:28 , <avi.e.gross at gmail.com> wrote:
>
>> Maybe a dumb question but why the need to do a calculation of a standard
>> deviation so indirectly in SQL and in the database but starting from
>> Python?
>>
>> R has a built-in function that calculates a standard deviation. You can
>> easily save it where you want after.
>>
>> As for memory use in general, there are several ways to calculate a
>> standard
>> deviation but there is a tradeoff. You could read in an entry at a time and
>> add it to a continuing sum while keeping track of the number of entries.
>> You
>> then calculate the mean. Then you can read it al in AGAIN and calculate the
>> difference between each number and the mean, and do the rest of the
>> calculation by squaring that and so on as you sum that and finally play
>> with
>> a division and a square root.
>>
>> But that may not be needed except with large amounts of data.
>>
>> What am I missing? Is this an artificial HW situation?
>>
>>
>> -----Original Message-----
>> From: Tutor <tutor-bounces+avi.e.gross=gmail.com at python.org> On Behalf Of
>> Manprit Singh
>> Sent: Saturday, July 2, 2022 10:50 PM
>> To: tutor at python.org
>> Subject: [Tutor] toy program to find standard deviation of 2 columns of a
>> sqlite3 database
>>
>> Dear sir ,
>>
>> I have tried writing a program in which I am calculating the population
>> standard deviation of two columns X1  & X2 of a table of sqlite3  in -
>> memory database .
>> import sqlite3
>> import statistics
>>
>> class StdDev:
>>      def __init__(self):
>>          self.lst = []
>>
>>      def step(self, value):
>>          self.lst.append(value)
>>
>>      def finalize(self):
>>          return statistics.pstdev(self.lst)
>>
>>
>> con = sqlite3.connect(":memory:")
>> cur = con.cursor()
>> cur.execute("create table table1(X1 int, X2 int)") ls = [(2, 4),
>>        (3, 5),
>>        (4, 7),
>>        (5, 8)]
>> cur.executemany("insert into table1 values(?, ?)", ls)
>> con.commit()
>> con.create_aggregate("stddev", 1, StdDev) cur.execute("select stddev(X1),
>> stddev(X2) from table1")
>> print(cur.fetchone())
>> cur.close()
>> con.close()
>>
>> prints the output as :
>>
>> (1.118033988749895, 1.5811388300841898)
>>
>> which is correct .
>>
>> My question is, as you can see i have used list inside the class StdDev,
>> which
>>
>> I think is an inefficient way to do this kind of problem because there may
>> be
>>
>> a large number of values in a column and it can take a huge amount of
>> memory.
>>
>> Can this problem be solved with the use of iterators ?

I don't think that you can convert the callback into a generator here;
and if you could it probably wouldn't help as the statistics module uses
a two-pass algorithm.

You could switch to a simpler algorithm like the one used by some old
calculators that only keep track of sum(xi), sum(xi*xi) and count, and
calculate stddev from these in the finalize() method.

  What would be the
>> best
>>
>> approach to do it ?
>>
>> Regards
>>
>> Manprit Singh
>> _______________________________________________
>> Tutor maillist  -  Tutor at python.org
>> To unsubscribe or change subscription options:
>> https://mail.python.org/mailman/listinfo/tutor
>>
>>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>



More information about the Tutor mailing list