[Tutor] toy program to find standard deviation of 2 columns of a sqlite3 database

Sun Jul 3 12:59:41 EDT 2022

Dear Sir,

Leaving all that standard deviation thing. I would say something about
python's sqlite3 module . There is a
create_aggregate(*name*, *n_arg*, *aggregate_class*)  function there, which
can create a user defined aggregate function .
This function requires an aggregate class as argument, this class must
contain a step method and finalize method, as
per written in Python documentation .

I just want to learn about using this create_aggregate().
So far what i have concluded is the step method of the class is called for
each element of the column and finalize is to
return the final result of the aggregate.

Again coming to the example given in the documentation :

import sqlite3
class MySum:
    def __init__(self):
        self.count = 0

    def step(self, value):
        self.count += value

    def finalize(self):
        return self.count
con = sqlite3.connect(":memory:")con.create_aggregate("mysum", 1,
MySum)cur = con.cursor()cur.execute("create table
test(i)")cur.execute("insert into test(i) values
(1)")cur.execute("insert into test(i) values (2)")cur.execute("select
mysum(i) from test")print(cur.fetchone()[0])
con.close()

The answer is 3 , which is the correct answer and is the sum of values
in the column named i in the table test.It was easy to implement as
the need only

is to add all values of the column. For each value of the column i,
the step is called and each value of i gets added to self.count. at
last self.count

will represent the sum of all values of the column and is returned
through the finalize method.

Now as sum is an aggregate function, same way population standard
deviation is also an aggregate function. We should be able to make a
user defined function

to find the population standard deviation of a column or multiple
columns of a sqlite3 database . Hopefully you agree ?

Now to find this if i am going to write the class, in the step method,
i can only count the values in the column and find the sum of all
values

for getting mean.I am not getting the mechanism to subtract the mean
from each value of the column in the same step method or by any other
way

in the class.

Hopefully my question is more clear now.

Btw I would like to write a one liner to calculate population std
deviation  of a list:
lst = [2, 5, 7, 9, 10]
mean = sum(lst)/len(lst)
std_dev = (sum((ele-mean)**2 for ele in lst)/len(lst))**0.5
print(std_dev)

prints
2.870540018881465 which is the right answer.

Regards

Manprit Singh

On Sun, Jul 3, 2022 at 9:10 PM <avi.e.gross at gmail.com> wrote:

> Manprit,
>
> At some point, questions are not about Python but are about a specialized
> package or even about SQL statements.
>
> I can understand your wanting to learn by experimentation and the gist of
> your programming ideas seems to be to use Python and a set of functions to
> drive an SQL database when, as has been pointed out, it can be done
> directly
> in the database.
>
> But I think you have made your question clearer and you seem focused on
> ways
> to minimize memory use such as you might find on a device like Raspberry
> Pi.
>
> So if you asked without the SQL part, people might get into the question.
>
> If I understand it, you wish to make your own class, StdDev, to somehow
> manage getting the data incrementally and calculating a standard deviation,
> rather than reading it all at once. Is that the question?
>
> Your code, of course, makes absolutely no sense as written as all it does
> is
> create a few items and stores them in a new table, just so it can get them
> again! So of course you already fill your memory with the list you made.
> The
> code you want to share with us does not need any of these steps. It needs
> to
> start with a database out there that you want to read from. There is
> nothing
> wrong with your code just that it gets in the way of seeing what you want
> to
> do.
>
> You then leave some of us (meaning me) having to do research to look up
> what
> the heck create_aggregate() does and when I found out, I stopped wanting to
> continue.
>
> Someone else may want to help you but I am heading our for a long drive and
> already answered you in a way that I find reasonable.
>
> Look up the definition for a variance and then standard deviation. Decide
> which version to use and note some divide by N and some by N-1 depending on
> whether you have all of the population or a sample. Overall the scheme is
> to
> take each number minus the mean and square it and sum that and finally
> divide by N-1 for the variance and then take the square root of that for
> the
> standard deviation. It can be done trivially using functions already
> available but that does use memory all at once.
>
> If you want to fetch one pair of numbers at a time, the algorithm is fairly
> simple.
>
> Start with two accumulator variables, one for each of the numbers you want
> to calculate the standard deviation for.
>
> In a loop, read one row at a time, or some small number of rows like 100.
> Add the right numbers to the right accumulator, handling any NA values if
> needed, and keeping track of how many valid items in each you processed.
>
> When done, calculate the mean of each.
>
> Restart a new query and again in a loop get your numbers one at a time (or
> a
> hundred) and this time use a new accumulator in which you keep adding the
> current number minus the mean calculated and squared.
>
> When all the data is calculated, you have the sums. Divide by N-1 then take
> the square root.
>
> So you want to know how to use something that only gets one row of data at
> a
> time. That is here not really a Python issue as you are able to call some
> function in your module that presumably gets the next row of an active
> query. If you can call that directly, fine. If you want to hide it in an
> iterator, also fine.
>
> Given your current class, your question for this part of the exercise seems
> to be how to rewrite the class. But my perspective may not match yours as
> you seem to want to hand the object to a function that calls on it somehow
> repeatedly to get the task done. You show no code as to how you might do it
> in a way I am thinking of.
>
> The dunder init section creates an empty list. You no longer want the list.
> So what do you want? My guess is this could be the place you open a
> connection to the database or a specific query.
>
> Your step() function presumably no longer wants to append a value to the no
> longer existent list. What makes sense for you here? Is the calculation
> happening inside the class? If so, the step not only gets a row of data but
> is perhaps working on summing to eventually calculate a mean. As noted
> above, what I am talking about requires two passes through the data so two
> sets of steps like this. Maybe you want a step1() that is summing the
> original data and another function called step2() that is in some ways
> similar and sums the second part once it has a mean.
>
> And what does your finalize() method do? Right now it returns the
> calculation on the list that you no longer want to use. In the outline I am
> sketching, you might want to use it just to return the computed values and
> maybe close the database connection.
>
> Maybe I am confused but you seem to ask to calculate the standard deviation
> for TWO columns of data but your code seems to also work on one set of
> numbers. You need to get that straight.
>
> Have fun.
>
> I am dropping out of this one.
>
>
>
>
> -----Original Message-----
> From: Tutor <tutor-bounces+avi.e.gross=gmail.com at python.org> On Behalf Of
> Manprit Singh
> Sent: Sunday, July 3, 2022 3:55 AM
> To: tutor at python.org
> Subject: Re: [Tutor] toy program to find standard deviation of 2 columns of
> a sqlite3 database
>
> Dear Sir,
>
> I have chosen this standard deviation as an exercise because there are two
> steps: first you have to find the mean, then subtract the mean from each
> value of the column .
>
> Writing an aggregate function for this using python's sqlite3 seems a
> little
> difficult as there is only single step function inside the class, used to
> make that . Kindly put some light .
>
> This is just an exercise to understand how this create_aggregate() works .
> Kindly help
> Regards
> Manprit Singh
>
>
>
> On Sun, Jul 3, 2022 at 9:53 AM Manprit Singh <manpritsinghece at gmail.com>
> wrote:
>
> > Yes it is obviously a homework kind of thing ....I do create problems
> > for myself , try to solve and  try to find better ways .
> >
> > I was trying to learn the use of create- aggregate() .
> >
> > Thank you for the hint that is given by you . Let me try .
> >
> > My purpose is not to find the std dev. It is actually to learn how to
> > use the functions
> >
> >
> > On Sun, 3 Jul, 2022, 09:28 , <avi.e.gross at gmail.com> wrote:
> >
> >> Maybe a dumb question but why the need to do a calculation of a
> >> standard deviation so indirectly in SQL and in the database but
> >> starting from Python?
> >>
> >> R has a built-in function that calculates a standard deviation. You
> >> can easily save it where you want after.
> >>
> >> As for memory use in general, there are several ways to calculate a
> >> standard deviation but there is a tradeoff. You could read in an
> >> entry at a time and add it to a continuing sum while keeping track of
> >> the number of entries.
> >> You
> >> then calculate the mean. Then you can read it al in AGAIN and
> >> calculate the difference between each number and the mean, and do the
> >> rest of the calculation by squaring that and so on as you sum that
> >> and finally play with a division and a square root.
> >>
> >> But that may not be needed except with large amounts of data.
> >>
> >> What am I missing? Is this an artificial HW situation?
> >>
> >>
> >> -----Original Message-----
> >> From: Tutor <tutor-bounces+avi.e.gross=gmail.com at python.org> On
> >> Behalf Of Manprit Singh
> >> Sent: Saturday, July 2, 2022 10:50 PM
> >> To: tutor at python.org
> >> Subject: [Tutor] toy program to find standard deviation of 2 columns
> >> of a
> >> sqlite3 database
> >>
> >> Dear sir ,
> >>
> >> I have tried writing a program in which I am calculating the
> >> population standard deviation of two columns X1  & X2 of a table of
> >> sqlite3  in - memory database .
> >> import sqlite3
> >> import statistics
> >>
> >> class StdDev:
> >>     def __init__(self):
> >>         self.lst = []
> >>
> >>     def step(self, value):
> >>         self.lst.append(value)
> >>
> >>     def finalize(self):
> >>         return statistics.pstdev(self.lst)
> >>
> >>
> >> con = sqlite3.connect(":memory:")
> >> cur = con.cursor()
> >> cur.execute("create table table1(X1 int, X2 int)") ls = [(2, 4),
> >>       (3, 5),
> >>       (4, 7),
> >>       (5, 8)]
> >> cur.executemany("insert into table1 values(?, ?)", ls)
> >> con.commit()
> >> con.create_aggregate("stddev", 1, StdDev) cur.execute("select
> >> stddev(X1),
> >> stddev(X2) from table1")
> >> print(cur.fetchone())
> >> cur.close()
> >> con.close()
> >>
> >> prints the output as :
> >>
> >> (1.118033988749895, 1.5811388300841898)
> >>
> >> which is correct .
> >>
> >> My question is, as you can see i have used list inside the class
> >> StdDev, which
> >>
> >> I think is an inefficient way to do this kind of problem because
> >> there may be
> >>
> >> a large number of values in a column and it can take a huge amount of
> >> memory.
> >>
> >> Can this problem be solved with the use of iterators ? What would be
> >> the best
> >>
> >> approach to do it ?
> >>
> >> Regards
> >>
> >> Manprit Singh
> >> _______________________________________________
> >> Tutor maillist  -  Tutor at python.org
> >> To unsubscribe or change subscription options:
> >> https://mail.python.org/mailman/listinfo/tutor
> >>
> >>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>
>