[spambayes-dev] Another incremental training idea...

Tim Peters tim.one at comcast.net
Wed Jan 14 22:13:45 EST 2004


[Barry Warsaw]
> ...
> A generalization might be to score each attachment (or possibly just
> each message/rfc822 type attachment) separately.  Then choose an
> algorithm for combining the scores, e.g. outer-only, inner-only,
> combined, etc.

That should simplify things <wink>.

Or you could upgrade to Outlook:  I don't think we have any real idea which
attachments we do and don't get back from Outlook when we synthesize a
plain-text message for your picky email parser to chew on ("standards" --
what a stupid idea that was <wink>), but I know for a fact that we *don't*
get the body of messages attached to things I get from Mailman in my
capacity as list admin.  So I routinely train on Mailman-wrapped spam and
ham, meaning that I've trained on a grand total of about two of them, and
all wrapped msgs from Mailman have scored 0% for me thereafter.

Something to note:  my personal classifier is using the experimental bigrams
gimmick, and bigram Mailmanisms like

    Confirmation succeeded
    list administrator,
    list posting:
    List:    PSF-Board at python.org
    Reason:  Post
    following mailing

act like strong lexical fingerprints for Mailman-generated administrivia,
never appearing in ham or spam other than the Mailman stuff.  This is one
clear way in which bigrams can generate a killer-strong collection of
hapaxes sufficient to nail an entire large class of messages from just one
training example.

Of course, that also sets me up for a spectacularly bad false negative
someday.




More information about the spambayes-dev mailing list