[Mailman-Developers] GSOC 2013 project discussion
Avik Pal
avikpal.me at gmail.com
Thu Apr 18 06:03:46 CEST 2013
thanks a lot Stephen for all the suggestions :)
Avik Pal
Bengal Engineering & Scieence University,Shibpur
github:https://github.com/avikpal
IRC:- irc://freenode/avikp,isnick
twitter:-https://twitter.com/avikpalme
On 17 April 2013 22:36, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> Avik Pal writes:
>
> > Meanwhile It would be much appreciated if someone can direct me to
> > an labeled dataset available on line.
>
> By "labelled" you mean pre-classified into spam vs ham? I see you
> already found one, but you could also check the SpamBayes and
> SpamAssassin distributions.
>
> > Here I have a suggestion, after submitting, whenever an email is
> > classified as Spam, we store it in a separate archive and after the
> > end of the day send them a mail telling "this is the digest for all
> > the mails that Mailman thinks to be Spam" the subscriber may go
> > there and can view them and also can mark them as not Spam,
>
> I suggest that you present this as an option for users who want to
> tune the filters, and as something that can be used pre-release to
> develop the initial parameters for the distributed classifier.
> Although Bayesian classifiers do offer the option to train or tune
> your personal classifier on a local corpus, most users just stick with
> the distribution parameters plus self-training. It's pretty effective
> (surprisingly so to me). I guess the logic is that spammers aren't
> terribly creative.
>
> > Emails which stays as Spam will be dropped after a month
>
> Let's think carefully about that. Everybody deletes the spam; that's
> why you started by asking for a labelled dataset, because nobody keeps
> one around. Somebody really ought to do the public service of
> collecting a corpus. Of course, if you do arrange to keep it around,
> it's going to need to be an option that sites and list owners can
> disable.
>
>
More information about the Mailman-Developers
mailing list