[spambayes-dev] A spambayes-esque approach for 4 email categories

Tim Peters tim.peters at gmail.com
Thu Apr 25 00:59:08 EDT 2024


[Amedee Van Gasse <amedee.vangasse at gmail.com>]

> ...
>

> The emails are labelled thus:
> * Phishing (3k) -> these are emails with a direct security threat, like
> password stealing
> * Spam (2k) --> typical junk mail that is not a direct security threat
> * Graymail (3k) --> newsletters, mails from sales people, invitations for
> conferences... all somewhat relevant for our industry, but recipients just
> aren't interested. This is "it's not actually spam because I subscribed a
> long time ago and now I am too lazy to unsubscribe"
> * False positive (0.5k) --> emails that were mistakenly reported as spam
> * Uncategorized (1.5k) --> these emails have not yet been manually reviewed
>
> I know that Spambayes works with just two buckets: spam and not-spam.
> Given the number of manually categorized emails I already have, how
> feasible would it be to write something similar but with 4 buckets, and to
> have the emails as training data? I am not concerned with 100% accuracy,
> even 80% is good enough.
> Maybe I could use 4 separate databases instead of just one?
>

I haven't done any work in this field for over a decade either, alas. Way
back when, I tried adopting the approach to a multi-category system, but
with scant success. But I didn't have much time to give to it either.

Wikipedia has a good overview of strategies of ways to _try_ tu use a
2-category system to do N-way classification:

https://en.wikipedia.org/wiki/Multiclass_classification

While it says "naive Bayes" "is naturally extensible to the case of having
more than two classes", the math behind Spambayes is more advanced than in
"naive Bayes", and deeply believes it's a 2-category world.

But the best way to find out about anything is to try it :-) Good luck!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/spambayes-dev/attachments/20240424/5e754b7a/attachment.html>


More information about the spambayes-dev mailing list