From amedee.vangasse at gmail.com  Mon Apr 15 04:29:57 2024
From: amedee.vangasse at gmail.com (Amedee Van Gasse)
Date: Mon, 15 Apr 2024 10:29:57 +0200
Subject: [spambayes-dev] A spambayes-esque approach for 4 email categories
Message-ID: <CALhozi-mumPjaZw_EjSXEL4AnBKmtUV7=r_1r6thE8VmyWM75Q@mail.gmail.com>

It's been over a decade since I last used (or needed) Spambayes, but I have
good memories of it and I really liked it a lot.

I'm currently working on an idea where I think Spambayes, or a
Spambayes-like approach, may come in help.

The software I'm working with, is a local Outlook client, whatever the
current version is now in Office 365, and an on-premise Exchange server.
The OS is Windows 10.
The mailbox is a shared mailbox, not a local PST file.

There is a mail folder with 10k emails. Almost all emails have been
manually categorized and labelled:
All emails are actually a "container" email, with the original email as a
.msg attachment. As if you were doing "forward as attachment", so that all
original email headers are preserved.
Additionally there is a second attachment, headers.txt, which contains the
email headers of the original email.

The emails are labelled thus:
* Phishing (3k) -> these are emails with a direct security threat, like
password stealing
* Spam (2k) --> typical junk mail that is not a direct security threat
* Graymail (3k) --> newsletters, mails from sales people, invitations for
conferences... all somewhat relevant for our industry, but recipients just
aren't interested. This is "it's not actually spam because I subscribed a
long time ago and now I am too lazy to unsubscribe"
* False positive (0.5k) --> emails that were mistakenly reported as spam
* Uncategorized (1.5k) --> these emails have not yet been manually reviewed

I know that Spambayes works with just two buckets: spam and not-spam.
Given the number of manually categorized emails I already have, how
feasible would it be to write something similar but with 4 buckets, and to
have the emails as training data? I am not concerned with 100% accuracy,
even 80% is good enough.
Maybe I could use 4 separate databases instead of just one?

Also good to know: I haven't written anything more than Hello World in
Python, but I'm not afraid to learn.
The machine I'm working on also doesn't have any development tools and I
have no permission to install Python.
I do have another machine where I can do whatever. It is Windows 11, also
has Office, but because of security reasons it is not allowed to access
that Exchange mailbox. I guess I could export the folder to a PST and copy
that over, but that wouldn't be allowed either - not technically, but
because of policy reasons. (PII and such)

Please let me pick your brains!
If anything comes from it, I'll post my code on GitHub.

-- 
Met vriendelijke groeten / Kind regards / Med v?nliga h?lsningar
Amedee Van Gasse
amedee at vangasse.eu
amedee.be - in/amedee <https://linkedin.com/in/amedee>
+32 485 805 674
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/spambayes-dev/attachments/20240415/443d5939/attachment.html>

From tim.peters at gmail.com  Thu Apr 25 00:59:08 2024
From: tim.peters at gmail.com (Tim Peters)
Date: Wed, 24 Apr 2024 23:59:08 -0500
Subject: [spambayes-dev] A spambayes-esque approach for 4 email
 categories
In-Reply-To: <CALhozi-mumPjaZw_EjSXEL4AnBKmtUV7=r_1r6thE8VmyWM75Q@mail.gmail.com>
References: <CALhozi-mumPjaZw_EjSXEL4AnBKmtUV7=r_1r6thE8VmyWM75Q@mail.gmail.com>
Message-ID: <CAExdVNmn5oBS0r4YwQmezj3JM3GT_Osn_YSamF2Fv8ZLxWrejg@mail.gmail.com>

[Amedee Van Gasse <amedee.vangasse at gmail.com>]

> ...
>

> The emails are labelled thus:
> * Phishing (3k) -> these are emails with a direct security threat, like
> password stealing
> * Spam (2k) --> typical junk mail that is not a direct security threat
> * Graymail (3k) --> newsletters, mails from sales people, invitations for
> conferences... all somewhat relevant for our industry, but recipients just
> aren't interested. This is "it's not actually spam because I subscribed a
> long time ago and now I am too lazy to unsubscribe"
> * False positive (0.5k) --> emails that were mistakenly reported as spam
> * Uncategorized (1.5k) --> these emails have not yet been manually reviewed
>
> I know that Spambayes works with just two buckets: spam and not-spam.
> Given the number of manually categorized emails I already have, how
> feasible would it be to write something similar but with 4 buckets, and to
> have the emails as training data? I am not concerned with 100% accuracy,
> even 80% is good enough.
> Maybe I could use 4 separate databases instead of just one?
>

I haven't done any work in this field for over a decade either, alas. Way
back when, I tried adopting the approach to a multi-category system, but
with scant success. But I didn't have much time to give to it either.

Wikipedia has a good overview of strategies of ways to _try_ tu use a
2-category system to do N-way classification:

https://en.wikipedia.org/wiki/Multiclass_classification

While it says "naive Bayes" "is naturally extensible to the case of having
more than two classes", the math behind Spambayes is more advanced than in
"naive Bayes", and deeply believes it's a 2-category world.

But the best way to find out about anything is to try it :-) Good luck!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/spambayes-dev/attachments/20240424/5e754b7a/attachment.html>

From skip.montanaro at gmail.com  Thu Apr 25 10:42:28 2024
From: skip.montanaro at gmail.com (Skip Montanaro)
Date: Thu, 25 Apr 2024 09:42:28 -0500
Subject: [spambayes-dev] A spambayes-esque approach for 4 email
 categories
In-Reply-To: <CAExdVNmn5oBS0r4YwQmezj3JM3GT_Osn_YSamF2Fv8ZLxWrejg@mail.gmail.com>
References: <CALhozi-mumPjaZw_EjSXEL4AnBKmtUV7=r_1r6thE8VmyWM75Q@mail.gmail.com>
 <CAExdVNmn5oBS0r4YwQmezj3JM3GT_Osn_YSamF2Fv8ZLxWrejg@mail.gmail.com>
Message-ID: <CANc-5Uwc6CphwBiGpFBcAX5HLqrrAwR_Jse71dWRqppcmV-6Jw@mail.gmail.com>

Just stumbled upon a statement the other day which indicated the
scikit.learn package has a classification subsystem. Might be worth reading
up on it, even if it turns out not to be useful here.

https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

Skip
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/spambayes-dev/attachments/20240425/0d6218d8/attachment.html>