[Tutor] about a program

Abel Daniel abli@freemail.hu
Sun Jun 8 12:23:10 2003


Abdirizak abdi wrote:
> I was working on a program that verifies whether a given message is spam or
> not. the program uses statistical analysis based on Paul Graham's plan for
> spam. 
I hope you know that others are already working on similar projects.
Google for "spambayes" for example.

[...snipped code ...]

I didn't really check the mathematical part where you calculate the
probabilities.

The code you posted has an obvious problem in that  build_corpus()
creates an instace of Classifier, fills it with the data, and returns
it.
However, in main(), where you call build_corpus(), you don't store the
return value anywhere, so it is lost. Then, when you are testing the
message, you create a new instance of Classifier, which won't have any
data about the probabilities. This in itself will render the program
useless.

Other remarks:
1) The code for  save_Data() is missing. Classifier inherits from
ClassifierI, which is also missing. (These two problems most likely have
nothing to do with your current problem.)

2) Why don't you make build_corpus a method of Classifier? That way you
could use it like:
c=Classifier('spam/*.txt', 'non-spam/*.txt') #<- __init__ calls build_corpus
c.isSpam('file.txt')

3) There is no reason to use that ugly hack of storing the number of
words as the number of the occurences of '*'. Put it somewhere else,
like Classifier.num_spam_words . ( I would guess that the odds of
finding a * as a word is pretty high, so that will break you numbers,
not to mention using such special cases is simply stupid.)

4) I you put the word-counting in a seperate function, you could have
also placed the iteration over files there to.
(The "for file in glob("NonSpam/1000*-*.txt"):" loop )

5) if you use split() in build_corpus(), why do you use a regex in
isSpam? Wouldnt you want to use the same method?

Thats all for now,
Abel Daniel