Any Neural Net code in Python? I want to filter out spam email

Alex Martelli aleaxit at yahoo.com
Thu Apr 19 03:23:44 EDT 2001


"Ken Seehof" <kens at sightreader.com> wrote in message
news:mailman.987656068.4191.python-list at python.org...
    [snip -- some quoting-level problems -- Ken's quoting Dan]
"""
> How about this - apply a whole set of tests to the message. Each test
> gives a "spammness" score - e.g. 10 points for being all caps, 50 points
> for having the word 'viagara', 100 points for having a suspicious From:
> address like *@yahoo.com. Add the scores from the different tests, and
> if the sum exceeds, say, 200 points, then call it "spam."
>
> So, how do you figure out a good value for each test score? This is where
> you could use a neural network or genetic algorithm. Pick a set of
> scores, feed the program lots of messages (both spam and non-spam), and
> see how accurate it is. Iterate until it rejects every spam email and
> accepts every non-spam...
"""

There may not exist a vector of feature weights that performs
perfectly, of course.  What one generally wants is a vector of
feature weights that _optimizes_ some performance score.

"""
Excellent idea, Dan.  That's conveniently sidesteps the most difficult
issue: getting the neural network to actually come up with linguistic
rules.  Once an intelligent human specifies the set of rules, the neural
"""

Right.  Extracting the features for classification is an order
of magnitude harder that weighing them optimally.


My old-fashioned approach to such feature-weighting problems is
to apply a general-purpose optimization algorithm (simulated
annealing, for choice).  That's easy to code/test/tune and lets
me experiment with all sort of "weird" nonlinearities in the
classification engine, as long as I can get a classifier that
takes a vector of N real parameters and can be run on the training
set to produce a classification whose 'cost' is then measurable.

False-positives and false-negatives can of course easily be
given different costs in this approach, and in some cases being
able to get a three-way classifier (yes/no/dunno, with some
cost for each dunno answer of course) can be important.

A faithful Python transcription of Goffe's Fortran tutorial
program for simulated annealing (the Fortran original is at
http://emlab.berkeley.edu/Software/abstracts/goffe895.html)
turns out to be less than 600 lines, over half of which are
docstrings, comments and printing-functions that only exist
to help gain understanding about the algorithm, the function
one is studying, etc.  Unfortunately, I'm not sure I can
redistribute that transcription, given Goffe's copyright --
it IS a derived work of his copyrighted one.  It could of
course be redone in a more Pythonical mold, and to use some
underlying extension module if available (I am not aware of
other Simulated Annealing implementations in Python, or as
Python extension modules, at this time, although of course
it's likely that many exist -- but I can't find them on the
net!).  I have written Dr Goffe asking for permission, and
I think I can in the meantime email sa.py privately (though
not "publish" it) if requested.


Alex








More information about the Python-list mailing list