[spambayes-dev] A spambayes-esque approach for 4 email categories

Amedee Van Gasse amedee.vangasse at gmail.com
Mon Apr 15 04:29:57 EDT 2024


It's been over a decade since I last used (or needed) Spambayes, but I have
good memories of it and I really liked it a lot.

I'm currently working on an idea where I think Spambayes, or a
Spambayes-like approach, may come in help.

The software I'm working with, is a local Outlook client, whatever the
current version is now in Office 365, and an on-premise Exchange server.
The OS is Windows 10.
The mailbox is a shared mailbox, not a local PST file.

There is a mail folder with 10k emails. Almost all emails have been
manually categorized and labelled:
All emails are actually a "container" email, with the original email as a
.msg attachment. As if you were doing "forward as attachment", so that all
original email headers are preserved.
Additionally there is a second attachment, headers.txt, which contains the
email headers of the original email.

The emails are labelled thus:
* Phishing (3k) -> these are emails with a direct security threat, like
password stealing
* Spam (2k) --> typical junk mail that is not a direct security threat
* Graymail (3k) --> newsletters, mails from sales people, invitations for
conferences... all somewhat relevant for our industry, but recipients just
aren't interested. This is "it's not actually spam because I subscribed a
long time ago and now I am too lazy to unsubscribe"
* False positive (0.5k) --> emails that were mistakenly reported as spam
* Uncategorized (1.5k) --> these emails have not yet been manually reviewed

I know that Spambayes works with just two buckets: spam and not-spam.
Given the number of manually categorized emails I already have, how
feasible would it be to write something similar but with 4 buckets, and to
have the emails as training data? I am not concerned with 100% accuracy,
even 80% is good enough.
Maybe I could use 4 separate databases instead of just one?

Also good to know: I haven't written anything more than Hello World in
Python, but I'm not afraid to learn.
The machine I'm working on also doesn't have any development tools and I
have no permission to install Python.
I do have another machine where I can do whatever. It is Windows 11, also
has Office, but because of security reasons it is not allowed to access
that Exchange mailbox. I guess I could export the folder to a PST and copy
that over, but that wouldn't be allowed either - not technically, but
because of policy reasons. (PII and such)

Please let me pick your brains!
If anything comes from it, I'll post my code on GitHub.

-- 
Met vriendelijke groeten / Kind regards / Med vänliga hälsningar
Amedee Van Gasse
amedee at vangasse.eu
amedee.be - in/amedee <https://linkedin.com/in/amedee>
+32 485 805 674
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/spambayes-dev/attachments/20240415/443d5939/attachment.html>


More information about the spambayes-dev mailing list