[Spambayes] URL parsing improvement ideas
Meyer, Tony
T.A.Meyer at massey.ac.nz
Fri Aug 29 14:44:32 EDT 2003
> I wonder why 1) was a loss.
I suspect it is because the comment is right, and the presence of %
escapes is a clue.
> Perhaps it should add a special
> token when it finds any % escapes, and then replace them.
> Care to try this as well? :-)
I'm not sure exactly what you mean. If I have "read%20me.html", do you
mean there is a token "read", a token "me", a token "html", and a "url:
has_escape" token?
> Please send me the source code
Note that this might not be (and for 2 *is* not) the fastest/best way to
do these things. I was just going for a quick implementation to test
the concepts.
> > 1) Replace % escapes.
I added this after line 985 of tokenizer.py.
"""
import urllib
piece = urllib.unquote(piece)
"""
> > 2) Find server names for ip addresses.
(Results are still coming). I added this after line 984 of tokenizer.py.
"""
if '.' in piece:
import socket
try:
piece = socket.gethostbyaddr(piece)[0]
except:
pass
"""
> > 3) Remove numbers from the end of domain names (experimental).
> > www.buythis123.com => url:buythis
I added this after line 985 of tokenizer.py.
"""
while chunk and chunk[-1] in '0123456789':
chunk = chunk[:-1]
"""
> > Or add a special token for domains ending with a number.
I added this after line 985 of tokenizer.py.
"""
if chunk and chunk[-1] in '0123456789':
pushclue("url: ends_in_number")
"""
=Tony Meyer
More information about the Spambayes
mailing list