[Spambayes] URL parsing improvement ideas

Fri Aug 29 14:44:32 EDT 2003

> I wonder why 1) was a loss.

I suspect it is because the comment is right, and the presence of %
escapes is a clue.

> Perhaps it should add a special 
> token when it finds any % escapes, and then replace them. 
> Care to try this as well? :-)

I'm not sure exactly what you mean.  If I have "read%20me.html", do you
mean there is a token "read", a token "me", a token "html", and a "url:
has_escape" token?

> Please send me the source code

Note that this might not be (and for 2 *is* not) the fastest/best way to
do these things.  I was just going for a quick implementation to test
the concepts.

> > 1) Replace % escapes.

I added this after line 985 of tokenizer.py.
"""
            import urllib
            piece = urllib.unquote(piece)
"""

> > 2) Find server names for ip addresses.

(Results are still coming). I added this after line 984 of tokenizer.py.
"""
            if '.' in piece:
                import socket
                try:
                    piece = socket.gethostbyaddr(piece)[0]
                except:
                    pass
"""

> > 3) Remove numbers from the end of domain names (experimental).
> >    www.buythis123.com => url:buythis

I added this after line 985 of tokenizer.py.
"""
                while chunk and chunk[-1] in '0123456789':
                    chunk = chunk[:-1]
"""

> > Or add a special token for domains ending with a number.

I added this after line 985 of tokenizer.py.
"""
                if chunk and chunk[-1] in '0123456789':
                    pushclue("url: ends_in_number")
"""

=Tony Meyer