[Tutor] toy program to find standard deviation of 2 columns of a sqlite3 database

Sun Jul 3 13:05:56 EDT 2022

On Sun, 3 Jul 2022 08:19:43 +0530, Manprit Singh
<manpritsinghece at gmail.com> declaimed the following:

	{Seems gmane wasn't updating yesterday}

>
>My question is, as you can see i have used list inside the class StdDev, which
>
>I think is an inefficient way to do this kind of problem because there may be
>
>a large number of values in a column and it can take a huge amount of memory.
>
>Can this problem be solved with the use of iterators ? What would be the best
>
>approach to do it ?
>

	First off... What do you consider a "huge amount of memory"? 

>>> lrglst = list(range(100000000))
>>> sys.getsizeof(lrglst)
800000056
>>> 

	That is a list of 100 MILLION integers. In Python, it consumes 800
Mbytes (plus some overhead). 

	Unless you are running your application on a Raspberry-Pi you shouldn't
have any concern for memory (and even then, if you have a spinning disk on
USB with a swap file it should only slow you down, not crash -- R-PI 4B can
be had with up to 8GB of RAM, so might not need swap at all). Who is
running a computer these days with less than 8GB (my decade old system has
12GB, and rarely activates swap).

	Granted, each invocation (you have two in the example) will add another
such list... Still, that would only come to 1.6GB of RAM. Which, ideally,
would be freed up just as soon as the SQL statement finished processing and
returned the results.

	 SQLite3 is already doing iteration -- it invokes the .step() method
for each record in the data set. The only way to avoid collecting that
(large?) list is to change the algorithm for standard deviation itself --
and not use the one provided by the statistics module.

	The "naive" algorithm has been mentioned (in association with
calculators -- which work with single data point entry at a time (well, the
better ones handled X and Y on each step).

https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Na%C3%AFve_algorithm

	UNTESTED

class sd_Pop:
	def __init__(self):
		self.cnt = 0
		self.sum = 0.0
		self.sum_squares = 0.0

	def step(self, x):
		self.cnt +=1
		self.sum += x
		self.sum_squares += (x * x)

	def finalize(self):
		return sqrt(self.sum_squares - 
			((self.sum * self.sum) / self.cnt)) /  self.cnt)

	There... No accumulation of a long list, just one integer and two
floating point values.

	Note that the Wikipedia link under "Computing shifted data" is a
hypothetical improvement over the pure "naive" algorithm, and (if you made
the sample code a class so all those "global" references change to
self.###) could also fit into the SQLite3 aggregate.

-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
	wlfraed at ix.netcom.com    http://wlfraed.microdiversity.freeddns.org/