[SciPy-Dev] Proposal for a new function nanpdist that treats NaNs as missing values

Fri Aug 22 13:06:06 EDT 2014

Thank you for your response Nathaniel.

I was a bit concerned that by going into the application this would turn
into a discussion about the method rather than whether this is a desirable
concept for scipy. I suppose it's not possible to fully separate the two
issues so I will indulge you.

On Fri, Aug 22, 2014 at 4:33 PM, Nathaniel Smith <njs at pobox.com> wrote:

>
> Just as a scientific issue this seems very odd to me and not at all
> what statisticians usually mean by missing data. Surely if you want to
> determine "which treatments introduce similar gene expression
> patterns" then two treatments that both produce no effect on the
> expression of the same gene should be counted as more similar to each
> other? If you've measured an expression change to be near 0 then
> that's a known measured value that happens to be near 0 -- not an
> unknown value that could be arbitrarily large or small and you have no
> idea which. (Obviously I don't know any of the details about your
> setting, but in particular I worry that your reasoning sounds similar
> to common misconceptions about what "significant" actually means. "Not
> significantly different from zero" might well be "significantly
> different from 1000".)
>

Since I didn't want the discussion to be about the method I tried to
describe the situation briefly and did not give you the whole story. My
apologies.

The real situation is the following: The gene expression data are mapped
onto pathways using information on links between proteins and coding genes.
The pathway definitions come from a multitude of source databases and were
collected in a single database (http://consensuspathdb.org/). Only pathways
that have five or more available scores are considered (this is somewhat
arbitrary, I suppose). Each pathway is then assigned a mean score. Pathways
that have too few scores are not considered. You can read up on more
specifics in [1]. So I consider those pathways that did not make the
cut-off of 5 scores as "missing values". If all the treatments had missing
values at the same pathways, I'd be tempted to just throw those out. We are
considering treatments from different studies, however, and the studies
report gene expression changes for different genes and consequently
different pathways end up having no scores. I still want to be able to
compare treatments between different studies. One approach could be to
rethink the scoring of pathways and introduce an uncertainty that is larger
for pathways with missing scores but since I'm sitting at the end of a
pipeline that lands the treatments and pathway response scores in my lap,
my preferred way of dealing with this is to simply scale up the distance
between treatments where one has a pathway score and it's missing for the
other.

If this seems unreasonable to you, I'm all ears. It does make sense in my
mind.

Cheers,
Moritz

[1] http://toxsci.oxfordjournals.org/content/124/2/278.full in particular
in the subsection "pathway response analysis"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20140822/e7a9a528/attachment.html>