looking for "optimal weighting" algorithm

Thu Apr 10 11:29:41 EDT 2003

"Alex Martelli" <aleaxit at yahoo.com> wrote in message
news:b73ql8011te at enews4.newsguy.com...
> I _know_ there's a decent algorithm to solve the following problem,
> but I don't recall its name and thus can't rapidly google for the
> details... can somebody help?

Linear discriminant analysis.  I have done several over my career.

> The problem: I need to design a "decision criterion" to classify
> "observations".  For each observation I measure the values of a
> number N of features, x1, x2, ... xN; the desired design criterion
> is a set of weights w1, w2, ... wN such that for any observation I
> will then just compute a weighted sum
>   S = w1*x1 + w2*x2 + ... + wN*xN

+ constant term C

> and classify the observation as Black if S<=1, White if S>1.

For k groups, the procedure results in k classifier functions
(weighted sums + constant). To classify an unknown, calculate all k
sums and assign to group corresponding to highest.  For two groups,
you can subtract functions to get one function and then compare to 0.

>  To
> train my classifier I have a large corpus of observations already
> made and annotated with "ground-truth" data about whether each
> given observation should have been classified as B or W, and an
> error-cost value for each kind of classification (Ebw is the cost
> of erroneously classifying a feature as W when it should be B,
> Ewb is that of classifying it as B when it should be W).  So,
> what's the algorithm to estimate the weights given the corpus of
> observation and ground-truth data, and the error-costs?

The added constant(s) is/are the fudge factor(s).  They are usually
adjusted according to specified a priori probabilities for being in
each group.  I do not know how to go from error costs to adjusted a
prioris, but perhaps someone has worked this out and published it.

The basic matrix equations are available in several statistics books.
However, there is a practical problem of over-fitting.  Some of your
features may be highly correlated and redundant.  Others may be
useless for discrimination even when the effect other variables is
accounted for.  Wasting input information to estimate
redundant/useless coefficients (which thereby constitute noise) can
and likely will result in more classification errors, not less.   So,
for best performance in classifying unknowns, you want a minimal set
of jointly significant features.  This can be developed by a stepwise
procedure, which is *much* messier.  The Computation Methods appendix
of the BMDP Statistical Software Manual is one source of the gory
details.  If this is not a toy project and you actually care about the
results, you will probably do better with professionally written
statistics software that has this procedure.

Terry J. Reedy