[Spambayes] Latest spammer trick stymied - QUESTION

T. Alexander Popiel popiel at wolfskeep.com
Mon Mar 31 16:06:06 EST 2003


In message:  <3E88CD74.4050405 at parducci.net>
             bill parducci <bill at parducci.net> writes:

>currently, does spambayes treat a URL as a single token or is it parsed 
>somehow?

URLs are parsed with the following code:

| urlsep_re = re.compile(r"[;?:@&=+,$.]")
| 
| class URLStripper(Stripper):
|     def __init__(self):
|         # The empty regexp matches anything at once.
|         Stripper.__init__(self, url_re.search, re.compile("").search)
| 
|     def tokenize(self, m):
|         proto, guts = m.groups()
|         tokens = ["proto:" + proto]
|         pushclue = tokens.append
| 
|         # Lose the trailing punctuation for casual embedding, like:
|         #     The code is at http://mystuff.org/here?  Didn't resolve.
|         # or
|         #     I found it at http://mystuff.org/there/.  Thanks!
|         assert guts
|         while guts and guts[-1] in '.:?!/':
|             guts = guts[:-1]
|         for piece in guts.split('/'):
|             for chunk in urlsep_re.split(piece):
|                 pushclue("url:" + chunk)
|         return tokens

>take the example: http://check.myspam.com/ad/junk?random=fsldkjflksj

That example would yield the tokens:

  proto:http
  url:check
  url:myspam
  url:com
  url:ad
  url:junk
  url:random
  url:fsldkjflksj

>it would seem that the most accurate way to evaluate this would be to 
>parse using '/' (starting after 'http://'). that would allow spambayes 
>to evaluate the domain (check.mypam.com) while giving it the ability to 
>differentiate between directories (which may map to users on ISP 
>systems: http://user.aol.com/niceguy vs. http://user.aol.com/spammer).

This already happens to some extent, though the I think there could
be better handling of the composite hostname and directory path...
to wit, I suspect that adding the following tokens would help:

  url:myspam.com
  url:check.myspam.com
  url:check.myspam.com/ad
  url:check.myspam.com/ad/junk

I haven't tested this yet, but I further suspect that I will have
Tim Peters' problem: my results are already good enough that I won't
be able to say anything conclusive about it.

- Alex



More information about the Spambayes mailing list