[spambayes-dev] effective tokenizer for wiki text
Tony Meyer
tameyer at ihug.co.nz
Tue Oct 31 01:37:17 CET 2006
[Skip]
> Why not just create an "email message" out of the input? If the
> headers are
> identical in every message they won't generate any useful tokens
> and the
> message body will be all that yields useful clues. OTOH, if you
> have login
> or IP address information for the spammers, you might suitably
> populate the
> From: field.
ISTM that it would be just as little work to write a "wiki-page to
email" module as to create a Tokenizer subclass that tokenizes wiki
pages. You can then skip all of the header tokenization (and any
email-specific tokenization in the body, if there is any, but I can't
think of any) and generate any additional tokens out of any metadata
that might be available (maybe comment, author, etc?).
[Matt]
>> Are there examples from other people that have written custom
>> tokenizers
>> that may be helpful, or do you have any hints on what to take into
>> account for writing an effective tokenizer for Wiki text?
What exactly gets passed to the tokenizer? Anything more than just
the content (complete? diff?) of the wiki page? If it's just the
content/diff then other than the words themselves, URLs are probably
the most useful content. You could try enabling (or improving) the
URL slurping code, perhaps.
> So far, I think most of us have bent our input to look like email.
> I think
> that would be a lot easier than writing and debugging a new tokenizer.
A tokenizer's pretty simple, really - all it has to do is take the
object you want to tokenize and yield a series of strings. It's been
a couple of years, but I wrote some non-email tokenizers at one point.
=Tony.Meyer
More information about the spambayes-dev
mailing list