From amedee.vangasse at gmail.com Mon Apr 15 04:29:57 2024 From: amedee.vangasse at gmail.com (Amedee Van Gasse) Date: Mon, 15 Apr 2024 10:29:57 +0200 Subject: [spambayes-dev] A spambayes-esque approach for 4 email categories Message-ID: It's been over a decade since I last used (or needed) Spambayes, but I have good memories of it and I really liked it a lot. I'm currently working on an idea where I think Spambayes, or a Spambayes-like approach, may come in help. The software I'm working with, is a local Outlook client, whatever the current version is now in Office 365, and an on-premise Exchange server. The OS is Windows 10. The mailbox is a shared mailbox, not a local PST file. There is a mail folder with 10k emails. Almost all emails have been manually categorized and labelled: All emails are actually a "container" email, with the original email as a .msg attachment. As if you were doing "forward as attachment", so that all original email headers are preserved. Additionally there is a second attachment, headers.txt, which contains the email headers of the original email. The emails are labelled thus: * Phishing (3k) -> these are emails with a direct security threat, like password stealing * Spam (2k) --> typical junk mail that is not a direct security threat * Graymail (3k) --> newsletters, mails from sales people, invitations for conferences... all somewhat relevant for our industry, but recipients just aren't interested. This is "it's not actually spam because I subscribed a long time ago and now I am too lazy to unsubscribe" * False positive (0.5k) --> emails that were mistakenly reported as spam * Uncategorized (1.5k) --> these emails have not yet been manually reviewed I know that Spambayes works with just two buckets: spam and not-spam. Given the number of manually categorized emails I already have, how feasible would it be to write something similar but with 4 buckets, and to have the emails as training data? I am not concerned with 100% accuracy, even 80% is good enough. Maybe I could use 4 separate databases instead of just one? Also good to know: I haven't written anything more than Hello World in Python, but I'm not afraid to learn. The machine I'm working on also doesn't have any development tools and I have no permission to install Python. I do have another machine where I can do whatever. It is Windows 11, also has Office, but because of security reasons it is not allowed to access that Exchange mailbox. I guess I could export the folder to a PST and copy that over, but that wouldn't be allowed either - not technically, but because of policy reasons. (PII and such) Please let me pick your brains! If anything comes from it, I'll post my code on GitHub. -- Met vriendelijke groeten / Kind regards / Med v?nliga h?lsningar Amedee Van Gasse amedee at vangasse.eu amedee.be - in/amedee +32 485 805 674 -------------- next part -------------- An HTML attachment was scrubbed... URL: From tim.peters at gmail.com Thu Apr 25 00:59:08 2024 From: tim.peters at gmail.com (Tim Peters) Date: Wed, 24 Apr 2024 23:59:08 -0500 Subject: [spambayes-dev] A spambayes-esque approach for 4 email categories In-Reply-To: References: Message-ID: [Amedee Van Gasse ] > ... > > The emails are labelled thus: > * Phishing (3k) -> these are emails with a direct security threat, like > password stealing > * Spam (2k) --> typical junk mail that is not a direct security threat > * Graymail (3k) --> newsletters, mails from sales people, invitations for > conferences... all somewhat relevant for our industry, but recipients just > aren't interested. This is "it's not actually spam because I subscribed a > long time ago and now I am too lazy to unsubscribe" > * False positive (0.5k) --> emails that were mistakenly reported as spam > * Uncategorized (1.5k) --> these emails have not yet been manually reviewed > > I know that Spambayes works with just two buckets: spam and not-spam. > Given the number of manually categorized emails I already have, how > feasible would it be to write something similar but with 4 buckets, and to > have the emails as training data? I am not concerned with 100% accuracy, > even 80% is good enough. > Maybe I could use 4 separate databases instead of just one? > I haven't done any work in this field for over a decade either, alas. Way back when, I tried adopting the approach to a multi-category system, but with scant success. But I didn't have much time to give to it either. Wikipedia has a good overview of strategies of ways to _try_ tu use a 2-category system to do N-way classification: https://en.wikipedia.org/wiki/Multiclass_classification While it says "naive Bayes" "is naturally extensible to the case of having more than two classes", the math behind Spambayes is more advanced than in "naive Bayes", and deeply believes it's a 2-category world. But the best way to find out about anything is to try it :-) Good luck! -------------- next part -------------- An HTML attachment was scrubbed... URL: From skip.montanaro at gmail.com Thu Apr 25 10:42:28 2024 From: skip.montanaro at gmail.com (Skip Montanaro) Date: Thu, 25 Apr 2024 09:42:28 -0500 Subject: [spambayes-dev] A spambayes-esque approach for 4 email categories In-Reply-To: References: Message-ID: Just stumbled upon a statement the other day which indicated the scikit.learn package has a classification subsystem. Might be worth reading up on it, even if it turns out not to be useful here. https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html Skip -------------- next part -------------- An HTML attachment was scrubbed... URL: