[issue33084] Computing median, median_high an median_low in statistics library

Luc report at bugs.python.org
Fri Mar 16 17:14:39 EDT 2018


Luc <ouaganet at gmail.com> added the comment:

If we are trying to fix this, the behavior should be like computing the mean or harmonic mean with the statistics library when there are missing values in the data.  At least that way, it is consistent with how the statistics library works when computing with NaNs in the data.  Then again, it should be mentioned somewhere in the docs.

import statistics as stats
import numpy as np
import pandas as pd
data = [75, 90,85, 92, 95, 80, np.nan]
stats.mean(data)
nan
stats.harmonic_mean(data)
nan
stats.stdev(data)
nan
As you can see, when there is a missing value, computing the mean, harmonic mean and sample standard deviation with the statistics library 
return a nan.
However, with the median, median_high and median_low, it computes those statistics incorrectly with the missing values present in the data.
It is better to return a nan, then let the user drop (or resolve) any missing values before computing.
## Another example using pandas serie
df = pd.DataFrame(data, columns=['data'])
df.head()
        data
0	75.0
1	90.0
2	85.0
3	92.0
4	95.0
5	80.0
6	NaN

### Use the statistics library to compute the median of the serie
stats.median(df1['data'])
90
 
## Pandas returns the correct median by dropping the missing values
## Now use pandas to compute the median of the serie with missing value
df['data'].median()
87.5

I did not test the median_grouped in statistics library, but will let you know afterwards if its affected as well.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue33084>
_______________________________________


More information about the Python-bugs-list mailing list