From skip.montanaro at gmail.com Tue Jul 10 20:03:46 2018 From: skip.montanaro at gmail.com (Skip Montanaro) Date: Tue, 10 Jul 2018 19:03:46 -0500 Subject: [spambayes-dev] Anybody still have a test ham/spam database? Message-ID: I'm going to take a crack at porting SpamBayes to Python 3. For that I should probably have some test data. My goal is to replicate existing behavior, not improve the breed. I long ago deleted what I used BITD. Does anyone still have their setup? If so, let me know. I can provide a writable folder on my Google Drive you can upload to. Thanks, Skip From matt at mondoinfo.com Tue Jul 10 20:18:43 2018 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Tue, 10 Jul 2018 19:18:43 -0500 (CDT) Subject: [spambayes-dev] Anybody still have a test ham/spam database? In-Reply-To: References: Message-ID: <1531267857.82.965@mint-julep.mondoinfo.com> Skip, > Does anyone still have their setup? If so, let me know. I can > provide a writable folder on my Google Drive you can upload to. I have both the Ham and Spam directories that I think were Tim's original data and also a corpus of (as of today) 38,509 spam emails. I would be happy to give you any of that. Though you'll have to explain to me how to use this newfangled Google Drive. The 38,000 are spam messages sent to me and so almost certainly have unusual characteristics. Regards, Matt From tim.peters at gmail.com Tue Jul 10 20:26:32 2018 From: tim.peters at gmail.com (Tim Peters) Date: Tue, 10 Jul 2018 19:26:32 -0500 Subject: [spambayes-dev] Anybody still have a test ham/spam database? In-Reply-To: References: Message-ID: Sorry, Skip - I don't. And I was surprised just now to see that we apparently never checked test data files into the Sourceforge source tree either! But it shouldn't matter. SB learns pretty quickly, and it would be better to use _current_ examples of spam and ham anyway (their characteristics change over time). On Tue, Jul 10, 2018 at 7:04 PM Skip Montanaro wrote: > I'm going to take a crack at porting SpamBayes to Python 3. For that I > should probably have some test data. My goal is to replicate existing > behavior, not improve the breed. > > I long ago deleted what I used BITD. Does anyone still have their > setup? If so, let me know. I can provide a writable folder on my > Google Drive you can upload to. > > Thanks, > > Skip > _______________________________________________ > spambayes-dev mailing list > spambayes-dev at python.org > https://mail.python.org/mailman/listinfo/spambayes-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From skip.montanaro at gmail.com Tue Jul 10 20:29:54 2018 From: skip.montanaro at gmail.com (Skip Montanaro) Date: Tue, 10 Jul 2018 19:29:54 -0500 Subject: [spambayes-dev] Anybody still have a test ham/spam database? In-Reply-To: <1531267857.82.965@mint-julep.mondoinfo.com> References: <1531267857.82.965@mint-julep.mondoinfo.com> Message-ID: > I have both the Ham and Spam directories that I think were Tim's > original data and also a corpus of (as of today) 38,509 spam emails. > I would be happy to give you any of that. Though you'll have to > explain to me how to use this newfangled Google Drive. Thanks, Matt. Sharing email sent. Let me know if you need help. Skip From skip.montanaro at gmail.com Tue Jul 10 20:34:36 2018 From: skip.montanaro at gmail.com (Skip Montanaro) Date: Tue, 10 Jul 2018 19:34:36 -0500 Subject: [spambayes-dev] Anybody still have a test ham/spam database? In-Reply-To: References: Message-ID: > Sorry, Skip - I don't. And I was surprised just now to see that we apparently never checked test data files into the Sourceforge source tree either! > > But it shouldn't matter. SB learns pretty quickly, and it would be better to use _current_ examples of spam and ham anyway (their characteristics change over time). Sure, but constructing a suitable ham/spam corpus from scratch is a non-trivial task, as you no doubt remember. I could start with the collection on mail.python.org, but I suspect I would probably let a personal email or three leak through into what's ostensibly a public database. (SpamBayes has been doing a pretty good job over the years at its original assigned task.) I am looking to insure that a Py3 port of SpamBayes works the same as the Py2 code. Skip From tim.peters at gmail.com Tue Jul 10 22:57:24 2018 From: tim.peters at gmail.com (Tim Peters) Date: Tue, 10 Jul 2018 21:57:24 -0500 Subject: [spambayes-dev] Anybody still have a test ham/spam database? In-Reply-To: References: Message-ID: [Skip Montanaro] > > Sure, but constructing a suitable ham/spam corpus > from scratch is a non-trivial task, as you no doubt > remember. Ah - but we had a much subtler task then: trying to construct a classifier that was _useful_. Your current task is much clearer: > ... I am looking to insure that a Py3 port of SpamBayes > works the same as the Py2 code. For _that_ purpose, you can take any pile of email at all; split it into "ham" and "spam" at random, and "just" ensure you get the same results from the older and newer code. Your criterion for success isn't "closeness to human value judgment", but "same output". For that purpose, you could synthesize gibberish email from random header & sentence generators. Although it would be easier to use real email ;-) The point is that you don't have to worry at all about whether this or that is "really ham" or "really spam" or "really unsure" - it was making those value judgments that consumed lots of human time when building the old curated data sets. -------------- next part -------------- An HTML attachment was scrubbed... URL: From kirebrow at yahoo.com Wed Jul 11 11:38:29 2018 From: kirebrow at yahoo.com (Erik M. Brown) Date: Wed, 11 Jul 2018 11:38:29 -0400 Subject: [spambayes-dev] Anybody still have a test ham/spam database? In-Reply-To: References: Message-ID: <000001d4192d$385786f0$a90694d0$@yahoo.com> I look forward to this porting development Skip, thank you! = ) I still use SB with Windows 7 Pro and 2010 Outlook. Current stats below: Database has 2047 good and 4629 spam. Messages classified: 182,390 Good: 55,249 (30.3%) Spam: 123,777 (67.9%) Unsure: 3364 (1.8%) 6 false positives...LOL! Incredible..... Please let me know how I can help as well, regarding testing in various environments. Take care, Erik -----Original Message----- From: spambayes-dev [mailto:spambayes-dev-bounces+kirebrow=yahoo.com at python.org] On Behalf Of Skip Montanaro Sent: Tuesday, July 10, 2018 8:35 PM To: Tim Peters Cc: spambayes-dev at python.org Subject: Re: [spambayes-dev] Anybody still have a test ham/spam database? > Sorry, Skip - I don't. And I was surprised just now to see that we apparently never checked test data files into the Sourceforge source tree either! > > But it shouldn't matter. SB learns pretty quickly, and it would be better to use _current_ examples of spam and ham anyway (their characteristics change over time). Sure, but constructing a suitable ham/spam corpus from scratch is a non-trivial task, as you no doubt remember. I could start with the collection on mail.python.org, but I suspect I would probably let a personal email or three leak through into what's ostensibly a public database. (SpamBayes has been doing a pretty good job over the years at its original assigned task.) I am looking to insure that a Py3 port of SpamBayes works the same as the Py2 code. Skip _______________________________________________ spambayes-dev mailing list spambayes-dev at python.org https://mail.python.org/mailman/listinfo/spambayes-dev From tony.meyer at gmail.com Thu Jul 12 03:42:01 2018 From: tony.meyer at gmail.com (Tony Meyer) Date: Thu, 12 Jul 2018 19:42:01 +1200 Subject: [spambayes-dev] Anybody still have a test ham/spam database? In-Reply-To: References: Message-ID: Hi Skip, I have various corpora still but they're on drives that I would have to dig out of storage. If the existing offers fall through let me know and I can pull them out. I have a very large amount of mail that I can't share but can definitely run tests against, that's quite varied so should manage to trigger all the different tokens. Let me know if I can help with that, or anything else. Ng? mihi, Tony -------------- next part -------------- An HTML attachment was scrubbed... URL: From skip.montanaro at gmail.com Thu Jul 12 21:30:09 2018 From: skip.montanaro at gmail.com (Skip Montanaro) Date: Thu, 12 Jul 2018 20:30:09 -0500 Subject: [spambayes-dev] Anybody still have a test ham/spam database? In-Reply-To: References: Message-ID: > > Your current task is much clearer: > > > ... I am looking to insure that a Py3 port of SpamBayes > > works the same as the Py2 code. > > For _that_ purpose, you can take any pile of email at all; split it into > "ham" and "spam" at random, and "just" ensure you get the same results from > the older and newer code. Your criterion for success isn't "closeness to > human value judgment", but "same output". > Right you are, of course. Skip -------------- next part -------------- An HTML attachment was scrubbed... URL: From skip.montanaro at gmail.com Thu Jul 12 21:39:41 2018 From: skip.montanaro at gmail.com (Skip Montanaro) Date: Thu, 12 Jul 2018 20:39:41 -0500 Subject: [spambayes-dev] Anybody still have a test ham/spam database? In-Reply-To: References: Message-ID: > > I have various corpora still but they're on drives that I would have to > dig out of storage. If the existing offers fall through let me know and I > can pull them out. > I think I'm good, at least for now. Between what I had already and what Matt had, I should have enough. Of course, if I need a ton more, I can always use Tim's suggestion of just dividing a collection of emails up randomly into had and spam. I have a very large amount of mail that I can't share but can definitely > run tests against, that's quite varied so should manage to trigger all the > different tokens. Let me know if I can help with that, or anything else. > If I get to the point where I push to the python3 branch, I will announce here... Skip > -------------- next part -------------- An HTML attachment was scrubbed... URL: