[spambayes-bugs] [ spambayes-Bugs-1053223 ] Blackberry faster than
SpamBayes
SourceForge.net
noreply at sourceforge.net
Tue Oct 26 03:14:35 CEST 2004
Bugs item #1053223, was opened at 2004-10-24 08:06
Message generated for change (Comment added) made by benslivka
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=1053223&group_id=61702
Category: Outlook
Group: Binary 1.0
Status: Open
Resolution: None
Priority: 5
Submitted By: Benjamin W. Slivka (benslivka)
Assigned to: Nobody/Anonymous (nobody)
Summary: Blackberry faster than SpamBayes
Initial Comment:
I've been running SpamBayes for over a year on Outlook
(now XP SP3). I've also had a BlackBerry for three years
(have had a 7780 since the summer). Over time, I've
noticed ever more Spam hitting my BlackBerry, even
though when I get home and check my Outlook Inbox it
is not there, but (correctly) in the Spam folder.
And recently I've noticed that when I select an email in
the "maybe SPAM" folder and click the "Delete as Spam"
button, it takes several seconds to move it to the SPAM
folder.
So I'm guessing SpamBayes has a linear (or worse than
linear) time algorithm for updating the spam database.
SpamBayes Outlook Addin Binary Version 1.0 (July 2004)
reports that this training database status: Database has
30326 good and 16916 spam.
I assume one solution is for me to delete the databases
and retrain.
Have you also considered some kind of automatic
pruning to keep the databases manageable?
Here are the SB DB file sizes:
1,316 default_bayes_customize.ini
42,049,536 default_bayes_database.db
2,506,752 default_message_database.db
3,521 MS Exchange Settings.ini
Of course the other possibility is to modify your DB
algorithms so their running time is O(LogN) or faster...
Thank you!
--Ben Slivka
----------------------------------------------------------------------
>Comment By: Benjamin W. Slivka (benslivka)
Date: 2004-10-25 18:14
Message:
Logged In: YES
user_id=856287
Dear Tony,
Thank you for the thorough response.
1) I didn't create the database -- that is happening "under
the covers". I assume SpamBayes built up that database all
on its own? I'm just an "end user" and I get a lot of email
(and, unfortunately, a lot of spam -- ben **at** slivka
**dot** com is too easy to figure out, I guess). And there is
no obvious way in the user interface for me to shrink or prune
the database.
2) I deleted the database files -- it didn't change the
classification speed at all -- it's still noticeably slower than it
was when I first installed SpamBayes (in 2003). Are their
some other files or registry settings I should reset/delete?
3) So you would have to redesign your database schema -- I
get that. But what advice to you have for users when the DB
gets "too large"?
4) I'm all for you all experimenting with new techniques and
training regimines!
5) I installed the Windows binary -- I'm not looking to
experiment with other database choices!
Thank you!
--Ben
----------------------------------------------------------------------
Comment By: Tony Meyer (anadelonbrin)
Date: 2004-10-25 18:03
Message:
Logged In: YES
user_id=552329
1. ~47000 messages is a very large database. Generally, it
seems that the best results can be obtained from quite small
(under 1000) databases, which would remove this problem.
The wiki has a lot of stuff about training strategies.
2. Classification time shouldn't be particularly related to
the db size (training certainly is). I don't know what the
system is that sends mail to the BlackBerry, but perhaps
adjusting the background filtering options could help with
this problem?
3. There has been a little investigation into expiring
messages, but the research hasn't shown that it's
particularly helpful. One major problem is that SpamBayes
relies on bags of tokens being added/removed as a set. This
means that if we were to prune the database we would want to
remove whole messages, not individual tokens. At the moment
we don't store this information, so it would mean a whole
new database/table.
4. Alternate training regimes, which keep the database size
small, like 'train to exhaustion', are likely to be the best
solution for this sort of problem. The 1.1 release of
SpamBayes will almost certainly have some sort of support
for (more easily) trying out different training regimes with
the Outlook plug-in/sb_server.
5. The database is user-selected. By default bsddb is used
(I have no idea what the access times for bsddb are meant to
be, but I'm google could pull up something). You can,
however, use a pickle, MySQL, or Postgres SQL. Any of these
might help, depending on your exact requirements. A pickle
takes a lot more memory, and will be slow to load/save, but
very fast otherwise.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=1053223&group_id=61702
More information about the Spambayes-bugs
mailing list