[SciPy-User] Probability Density Estimation

Fri Apr 29 17:46:25 EDT 2011

On Fri, Apr 29, 2011 at 04:06:46PM -0400, josef.pktd at gmail.com wrote:
> I need to play with it to see where it differs from the scipy or my
> older version (except for the exclude trimming).
> I didn't see yet how you set the bandwidth from 2 samples to use for
> each individual, but it sounds like an interesting idea that I need to
> check for my case in mutual information.

I do this:
  fC = xkde(C)
  H = fC.getBandwidth()
  fA = xkde(A,bandwidth=H)
  fB = xkde(B,bandwidth=H)
where A and B are the samples of X where Y=0 and Y=1 respectively,
and C is the complete sample of X (i.e. the union of A and B).
This calculates the three KDE-s with the same bandwidth.

I experimented a bit using the exclude argument without finding
any evidence that it is useful.

> You didn't specify a license, which would be good so that readers know
> what they are allowed to do with the code.

I didn't.  And I will look into this before I publish any larger
quantities of codes.  I simply have not got around to it.

I see no distinction between source code and other text; in other words,
I expect proper citations/acknowledgement/etc... 

> And as an observation: I find the literate programming comments pretty
> distracting (at least reading it in the browser without a highlighting
> editor, wrong intend for my taste) and would prefer numpy style
> docstrings, e.g. I didn't find a description of exclude.

Agreed.  It is really written to be compiled and read from PDF.

I am not satisfied with the solution, but I really need this feature
of literate programming.  The docstrings can certainly be improved.
Do you have a good pointer on good practice?  You talk about
numpy-style, should I look in the numpy doc's?

> some details I'm not sure about when reading it, for example
> 
> idx = [ i for i in xrange(self.n) if B[:,i].all() ]
> isn't this the same as
> idx = np.nonzero(B.all(0))

Probably.  I am not sure how numpy.nonzero deals with Boolean
values.

> ahmadlin looks like it would be a nice extension to scipy.stats also
> as a standalone function (like your contEntropy), if the automatic
> bandwidth choice can be made robust enough.

Sure.  I am really not sure how to go about testing the robustness.
Are you doing research in a similar area?  Information Theory?
There is not an awful lot in the literature about multivariate
continuous entropy, so there might be room for publishing some
theory here (or even experimental evaluations of the little 
theory there is), if we can develop it.

The reason for the two identifiers: ahmadlin and contEntropy
was that I was considering using other estimates, other than
that of Ahmad and Lin.  Thus contEntropy may change drastically
while ahmadlin should not.  I am not sure how useful it would be
to have more than one estimate in a published version.

But, well, if some scipy developers want to push some of 
it into scipy, I won't object as long as I am consulted.  
I am not, however, going to find the right channel to make 
the proposal myself in the foreseeable future.

> Was the outlier trimming enough to solve your problem with estimating
> a kernel density, or did you try also try other kernels?

Actually, the effect of the outliers was extreme values for
the bandwidth.  Thus, using consistent bandwidth in the three
calculations was sufficient to give a plausible result.  In the
end, I kept the outliers.

As I mentioned, the purpose was feature selection for machine
learning.  The features with the outliers used to show up with
implausibly high heuristics.  With the new approach to bandwidth,
they show up as worthless, which they can be confirmed to be.
Thus the problem is gone, and I need to find more trouble before
I know anything more about outliers.

-- 
:-- Hans Georg