[Tutor] toy program to find standard deviation of 2 columns of a sqlite3 database

Mon Jul 4 01:20:45 EDT 2022

Manprit,

Did you just violate your condition to not keep the entire list of results
in memory with your use of pandas? Yes, what you want is straightforward
when you are not trying to do it some other way.

Your earlier comment though suggested you were more interested in just using
a database as some kind of portable format. Pandas can equally trivially
open a CSV file which strikes me as even more portable.  Other formats with
more flexibility that you can bring in fairly portably can be in JSON format
or XML and so on. No reason not to use a database but a great strength of a
database is that it can be easy to do much more complex queries such as only
getting the data selectively such as the numbers for group A or those with
dates in some range. You can of course do the same thing with data you load
into pandas or numpy or any other format, but would need to do it within
python.

As mentioned earlier, if the size of memory is a concern, then asking the
database to calculate things like a standard deviation simply pushes the
memory usage elsewhere. If it is inside your own computer, that does not
sound like serious savings unless SQLLITE uses a method that does not keep
everything in memory. 

-----Original Message-----
From: Tutor <tutor-bounces+avi.e.gross=gmail.com at python.org> On Behalf Of
Manprit Singh
Sent: Sunday, July 3, 2022 9:44 PM
To: tutor at python.org
Subject: Re: [Tutor] toy program to find standard deviation of 2 columns of
a sqlite3 database

Dear Sir,

In Pandas, handling an sql query is so simple as given below:
import sqlite3
import pandas as pd

con = sqlite3.connect(":memory:")
cur = con.cursor()
cur.execute("create table test(i, j)")
ls = [(2, 4), (3, 5), (4, 2), (7, 9)]
cur.executemany("insert into test(i, j) values (?, ?)", ls)

pd.read_sql("select i, j from test", con).std(ddof=0)

will give the desired result:

i    1.870829
j    2.549510
dtype: float64

On Mon, Jul 4, 2022 at 2:13 AM Dennis Lee Bieber <wlfraed at ix.netcom.com>
wrote:

> On Sun, 3 Jul 2022 22:29:41 +0530, Manprit Singh 
> <manpritsinghece at gmail.com> declaimed the following:
>
> >Now as sum is an aggregate function, same way population standard 
> >deviation is also an aggregate function. We should be able to make a 
> >user defined function
> >
>
>         It is also superfluous: SQLite3 already has count(), sum() and 
> even
> avg() built-in (though it lacks many of the bigger statistical 
> computations
> -- variance, std. dev, covariance, correlation, linear regression -- 
> that many of the bigger client/server RDBMs support).
>
> >
> >for getting mean.I am not getting the mechanism to subtract the mean 
> >from each value of the column in the same step method or by any other 
> >way
> >
>         Note that the definition for creating aggregates includes 
> something for number of arguments. Figure out how to specify multiple 
> arguments and you might be able to have SQLite3 provide "current item"
> and "mean" (avg) to the step() method. I'm not going to take the time 
> to experiment (for the most part, I'd consider it simpler to just grab 
> the entire dataset from the database, and run the number crunching in 
> Python, rather than the overhead of having SQLite3 invoke a Python 
> "callback" method for each item, just to be able to have the SQLite3 
> return a single computed value.
>
>
> >Btw I would like to write a one liner to calculate population std 
> >deviation  of a list:
> >lst = [2, 5, 7, 9, 10]
> >mean = sum(lst)/len(lst)
> >std_dev = (sum((ele-mean)**2 for ele in lst)/len(lst))**0.5
> >print(std_dev)
>
>         Literally, except for the imports, that is just...
>
> print(statistics.pstdev(lst))
>
> >>> import math as m
> >>> import statistics as s
> >>> lst = [2, 5, 7, 9, 10]
> >>> print(s.pstdev(lst))
> 2.870540018881465
> >>>
>
>         Going up a level in complexity (IE -- not using the imported
> pstdev())
>
> >>> print("Population Std. Dev.: %s" % m.sqrt( s.mean( (ele -
> >>> s.mean(lst))
> ** 2 for ele in lst)))
> Population Std. Dev.: 2.870540018881465
> >>>
>
>         This has the problem  that it invokes mean(lst) for each 
> element, so may be slower for large data sets (that problem will also 
> exist if you manage a multi-argument step() for SQLite3).
>
>         Anytime you have
>
>                 sum(equation-with-elements-of-data) / len(data)
>
> you can replace it with just
>
>                 mean(equation...)
>
>
> --
>         Wulfraed                 Dennis Lee Bieber         AF6VN
>         wlfraed at ix.netcom.com
> http://wlfraed.microdiversity.freeddns.org/
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>
_______________________________________________
Tutor maillist  -  Tutor at python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor