[spambayes-dev] Results for DNS lookup in tokenizer

Kenny Pitt kennypitt at hotmail.com
Tue Apr 13 10:28:39 EDT 2004


Tony Meyer wrote:
> Have you tried using the x-slurp_urls option as a solution for this
> problem? (I'm not saying it's a better solution, just curious if you
> have, and if so, what the results were).
> 
>> In case anyone would like to play with it, I'll append my trivial
>> patch. It requires pydns from:
>> 
>> http://sourceforge.net/projects/pydns/
> 
> This concerns me a bit.  I'd want to see really dramatic results
> before something in the core distribution required non-standard
> libraries to be installed.

Any reason why socket.gethostbyname(hostname) wouldn't work?  I wrote a
patch a while back using that function to do DNS queries against a DNSBL
blacklist server and create additional tokens based on the results.

There are two problems with doing DNS queries during tokenization.  The
first is performance because you're having to wait for the result of
network operations instead of just manipulating local data.  My DNSBL
queries worked well, but didn't improve the overall accuracy enough to
justify the performance hit.

The second is training.  DNS lookups are by nature dynamic, so the
results generated are not necessarily the same every time you do it.
Training (in particular, correcting the training of a message that was
previously trained incorrectly) relies on the tokens that get generated
for a particular message being identical every time the message is
tokenized.  If some of the tokens rely on additional data from a DNS
query, those tokens may be different when the user gets around to
retraining the message.

-- 
Kenny Pitt




More information about the spambayes-dev mailing list