[SciPy-Dev] two new scipy.stats requests code included:

Mon Oct 15 20:25:44 EDT 2018

Thank you for both your replies.  I am very grateful.  

There is no basic z-test or confidence interval in StatsModels or Scipy.stats.  I was forced to create this to pass two recent statistics courses I took online with Standford and Carnegie Mellon.

The first is for quantitative data.  The second is for Qualitative (categorical) data.

def ztest(array1, sample_size, array2.mean(), array2.std(), confidence_desired):
    z = (array2.mean() - array2.mean()) / (array2.std() / math.sqrt(len(array)))
    p = (st.norm.cdf(z_stat))
    standard_error = array2.std() / math.sqrt(len(array1))
    Margin_of_Error = st.norm.ppf(confidence_desired) * array2.std() / math.sqrt(len(array1))
    return(z, p, array1.mean() - Margin_of_Error, array1.mean() + Margin_of_Error)

def ztest_categorical(proportion1, proportion2, proportion1_sample_size):
    z = (proportion1 - proportion2) / math.sqrt(proportion2 * (1 - proportion2)) / proportion1_sample_size)
    p = st.norm.cdf(z)
    return(z,p)

let me know what you think.
Jon Stein

On Oct 15, 2018, at 01:11 PM, josef.pktd at gmail.com wrote:

On Mon, Oct 15, 2018 at 12:45 PM Paul Hobson <pmhobson at gmail.com> wrote:
Hey Jon,

To incorporate this into scipy, you'll need to open a pull request on GitHub: 
https://github.com/scipy/scipy

I'm not a scipy contributor, but I can tell you that you'll also need to include tests that preferably use a (small) published dataset and confirm that your function reproduce the published results.

Also, I don't think your return statements are behaving the way you think they are. I believe that the preference is now to return a NamedTuple.

Hope that helps,
-Paul

On Mon, Oct 15, 2018 at 2:54 AM Jon Stein <oneday2one at icloud.com> wrote:
Scipy-dev,

Two additions to the scipy.stats module are missing and needed:

One addition is needed for a one sample z-test including confidence interval when the population mean and standard deviation are known:

def ztest(array_A, population_mean, population_stdv, level_of_confidence~example: .95):
    z_statistic = (array_A.mean() - population_stdv) / (population_stdv / math.sqrt(len(array_A)))
    p_value = (st.norm.cdf(z_stat))
    standard_error = population_stdv / math.sqrt(len(array_A))
    margin_of_error = st.norm.ppf(level_of_confidence) * standard_error
    MoE = margin_of_error
    return('z statistic =', z_statistic, 'p-value =', p_value, array_A.mean() - MoE, array_A.mean() + MoE)

And one addition is needed for a one-sample z-test for a categorical sample (*not quantitative*):

def ztest_1sample_categorical(sample_proportion, population_proportion, sample_size):
    sp, pp = sample_proportion, population_proportion
    z = (sp - pp) / math.sqrt((pp * (1 - pp)) / sample_size)
    p = st.norm.cdf(z)
    return('z statistic =', z, 'p value =', p)

Let me know what you think.
Jon Stein

I think some discussion and decisions are needed for whether and how to add this.

None of the hypothesis test currently returns a confidence interval.
Tuples are a pain because we cannot just return additional results without breaking backwards compatibility.
Both ztests are based on summary statistics, for which scipy.stats has already some cases.

Adding special cases like ztest_1sample_categorical opens up a large set of statistical functions that could similarly be added, e.g. for poisson rates.
Additionally some tests have a choice of methods across stats package, e.g. using pp corresponds to a score test (variance under the Null). And alternative is to use variance based on sp, which corresponds to a Wald test.
In the statsmodels version there is an extra option, but it doesn't have the correct default.
For a two sample version for comparing proportions, the number of options and available methods becomes much larger.
(Development for this in statsmodels is slow because I only find time every once in a while to review or prepare PRs
https://github.com/statsmodels/statsmodels/pull/4829 )

I think some overlap in basic statistics functions between scipy.stats and statsmodels is useful. However, the question where to draw the boundary is always open.

Josef

_______________________________________________
SciPy-Dev mailing list
SciPy-Dev at python.org
https://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________
SciPy-Dev mailing list
SciPy-Dev at python.org
https://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________
SciPy-Dev mailing list
SciPy-Dev at python.org
https://mail.python.org/mailman/listinfo/scipy-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20181016/d4f2f696/attachment-0001.html>