[spambayes-dev] Help with research project

Thu Aug 14 19:52:00 CEST 2008

Hi,
I'm doing a research project on bayesian spam filtering and I had a few
questions regarding spambayes. I'm trying to write a script that creates a
db in which all the words that I give it as input are put into the db with
nham=0 and nspam=0 set for each of the word's wordinfos. Currently, my
plan to do this is to take the set of words and put them in an mbox with
the "to" and "subject" headers set to some arbitrary value and the message
set to the words I gave it as input. I then pass this mbox to sbmboxtrain
as the spam/ham file, creating the db. Then I iterate through each of the
words and set each of the word's nham and nspam to 0, remembering to get
rid of the arbitrary to and subject header tokens. Would this work? Is
there an easier way to do this? I'm pretty sure that using "h", the output
of hammie.open() and could probably make this much easier but tracing
through the code is a bit hard. Is there an easy way to create a blank db
and add new wordinfos into them? Further, I'm not sure how header files
are tokenized. From the output I usually see, it seems that they're
tokenized as header:headername:headercontent. If in the body of the
message has the same header:headername:headercontent, would this be seen
to spambayes as the same as a header with the same header name and header
content?
Thanks for your time,
Anthony