[Tutor] toy program to find standard deviation of 2 columns of a sqlite3 database

Manprit Singh manpritsinghece at gmail.com
Sun Jul 3 14:12:26 EDT 2022


Dear Sir,

Many Many thanks to Dennis Lee Bieber. (Not for the code he wrote ) But for
the last lines after the code in which he mentions various ways to
calculate Std. Dev (Naive method, two pass method & Welford's Algo) - That
clearly shows I need to study more  . I found out various ways to calculate
Standard deviation. My Question is finally answered.  Will be back with the
implementation once gone through all.

Regards
Manprit Singh


On Sun, Jul 3, 2022 at 10:37 PM Dennis Lee Bieber <wlfraed at ix.netcom.com>
wrote:

> On Sun, 3 Jul 2022 08:19:43 +0530, Manprit Singh
> <manpritsinghece at gmail.com> declaimed the following:
>
>         {Seems gmane wasn't updating yesterday}
>
>
> >
> >My question is, as you can see i have used list inside the class StdDev,
> which
> >
> >I think is an inefficient way to do this kind of problem because there
> may be
> >
> >a large number of values in a column and it can take a huge amount of
> memory.
> >
> >Can this problem be solved with the use of iterators ? What would be the
> best
> >
> >approach to do it ?
> >
>
>         First off... What do you consider a "huge amount of memory"?
>
> >>> lrglst = list(range(100000000))
> >>> sys.getsizeof(lrglst)
> 800000056
> >>>
>
>         That is a list of 100 MILLION integers. In Python, it consumes 800
> Mbytes (plus some overhead).
>
>         Unless you are running your application on a Raspberry-Pi you
> shouldn't
> have any concern for memory (and even then, if you have a spinning disk on
> USB with a swap file it should only slow you down, not crash -- R-PI 4B can
> be had with up to 8GB of RAM, so might not need swap at all). Who is
> running a computer these days with less than 8GB (my decade old system has
> 12GB, and rarely activates swap).
>
>         Granted, each invocation (you have two in the example) will add
> another
> such list... Still, that would only come to 1.6GB of RAM. Which, ideally,
> would be freed up just as soon as the SQL statement finished processing and
> returned the results.
>
>          SQLite3 is already doing iteration -- it invokes the .step()
> method
> for each record in the data set. The only way to avoid collecting that
> (large?) list is to change the algorithm for standard deviation itself --
> and not use the one provided by the statistics module.
>
>         The "naive" algorithm has been mentioned (in association with
> calculators -- which work with single data point entry at a time (well, the
> better ones handled X and Y on each step).
>
>
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Na%C3%AFve_algorithm
>
>         UNTESTED
>
> class sd_Pop:
>         def __init__(self):
>                 self.cnt = 0
>                 self.sum = 0.0
>                 self.sum_squares = 0.0
>
>         def step(self, x):
>                 self.cnt +=1
>                 self.sum += x
>                 self.sum_squares += (x * x)
>
>         def finalize(self):
>                 return sqrt(self.sum_squares -
>                         ((self.sum * self.sum) / self.cnt)) /  self.cnt)
>
>         There... No accumulation of a long list, just one integer and two
> floating point values.
>
>         Note that the Wikipedia link under "Computing shifted data" is a
> hypothetical improvement over the pure "naive" algorithm, and (if you made
> the sample code a class so all those "global" references change to
> self.###) could also fit into the SQLite3 aggregate.
>
>
> --
>         Wulfraed                 Dennis Lee Bieber         AF6VN
>         wlfraed at ix.netcom.com
> http://wlfraed.microdiversity.freeddns.org/
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>


More information about the Tutor mailing list