[spambayes-dev] Results for DNS lookup in tokenizer

Tue Apr 13 13:52:41 EDT 2004

>>> http://sourceforge.net/projects/pydns/

[Tony Meyer]
>> This concerns me a bit.  I'd want to see really dramatic results
>> before something in the core distribution required non-standard
>> libraries to be installed.

I don't necessarily disagree. Still, even if it went into the core
distribution, it would surely be sensible to have it turned off by
default and distutils makes installing PyDNS pretty simple.

I've thought for a while that it would be good to get some DNS module
into Python's standard library but I've never thought that I had a
strong enough argument to bring it up publicly. Using it in SpamBayes
might be a start.

[Kenny Pitt]
> Any reason why socket.gethostbyname(hostname) wouldn't work?  I
> wrote a patch a while back using that function to do DNS queries
> against a DNSBL blacklist server and create additional tokens based
> on the results.

As far as I can tell, socket.gethostbyname() doesn't respect the
timeout set by socket.setdefaulttimeout(). That's apt to make the
performance hit rather worse.

> There are two problems with doing DNS queries during tokenization.
> The first is performance because you're having to wait for the
> result of network operations instead of just manipulating local
> data.  My DNSBL queries worked well, but didn't improve the overall
> accuracy enough to justify the performance hit.

Personally, as long as I set the timeout pretty low, I barely notice
the difference. When my mail client fetches a couple of emails,
they're scored quickly enough that I don't notice an additional
delay. If it fetches 100 or so, that's going to take a while in
either case. No doubt, other people would have different experiences.

> The second is training.  DNS lookups are by nature dynamic, so the
> results generated are not necessarily the same every time you do
> it. Training (in particular, correcting the training of a message
> that was previously trained incorrectly) relies on the tokens that
> get generated for a particular message being identical every time
> the message is tokenized.  If some of the tokens rely on additional
> data from a DNS query, those tokens may be different when the user
> gets around to retraining the message.

That's certainly a disadvantage. I think that legitimate servers
don't move around all that much, so it may turn out to be a
relatively small one but it would be nice to know for sure.

Regards,
Matt