[Spambayes] sharing wordlists - better numbers

Tue May 27 10:32:29 EDT 2003

Brad Clements wrote:
> You expressed this sentiment last week, so I think you're up to 4 cents now. ;-)
ok, i'll give you three :)

> My excuse continues to be, lets pass the first stage before worrying about the 
> technical issues of deployment. We may never get that far anyway.

hey, in terms of problem definition i am with you all the way. there are 
many that would love to see a centralized solution that provides 
acceptable results. however, can you determine this via word count 
direct redundancy comparison? it would seem that the only way to 
effectively test for this would be to test an 'averaged' db against 
numerous spam/ham profiles. and see if it works well enough.

the reason that i brought this up is that it seems like there is an 
increasing amount of work now going on to determine commonality, yet i 
cannot fathom an outcome that will solve your original question (a 
shallow thought pool to be sure ;-)

> Another thought.. In the case of 7000 users, how many are really going to bother to 
> train? We know that a single person's weights probably don't speak for the whole 
> community, but does an average of weights of a few members of the community 
> represent the average of the weights of the entire community?
> 
> In other words, for those orgs who want some control over their spam, could the 
> average weighting of 10 members out of 1000 reasonably represent the average of 
> all 1000 members?
 >
> Heh, I know there's a technical name for this.. the mean of a sub-sample approaches 
> the mean of the entire sample .. something like that.
 >
> So I'm thinking .. suppose you allow people to keep their private weights, but for 
> those who just want "good enough" filtering, they use a "synthesized database" which 
> represents the "average" of the private database weights.
> 
> Do you average the word weights across private databases before scoring, or do you 
> average the scores?

<caveat>i am a stats numbskull</caveat> i think you are going to find 
some unexpected results when you start using averages as the basis for 
decision making (average = dilution). i'd be willing to bet a peanut 
butter sandwich that as your 'sub sample' grows your results 
deteriorate, and that one person may actually be able to offer the best 
representation. :-P

taken one step further, it may make sense to have a 'profile db' for a 
variety of user types (student, teacher, staff, IS, guest, etc.) whereby 
a single user db is used to make decisions for those who are not 
inclined to train but are of similar interests. then again maybe not :) 
as has been pointed out to me on a number of occasions 'only testing 
will tell'. ;-)

bottom line: i am not trying to disuade anyone from pursuing the 
solution, just openly 'musing' as well.

b