[Spambayes] idea for tokenizer.crack_filename change

Neale Pickett neale@woozle.org
18 Sep 2002 16:34:31 -0700


In going over some of my spam, I was surprised to see that the following
wasn't penalized:

  ------=_NextPart_000_0039_0173A692.99A692D0
  Content-Type: application/octet-stream; name="Video.pif"
  Content-Transfer-Encoding: base64
  Content-Disposition: attachment; filename="Video.pif"

I can guarantee you that I've never been emailed a single .pif file from
an actual human being :)  But tokenizer.crack_filename only splits up
filenames by path elements, so ".pif" never got scored.

I suggest changing fname_sep_re to include ".", like so:

  fname_sep_re = re.compile(r'[./\\:]')

Unfortunately, I can't back up my suspicion that this is a good idea, as
it results in an across-the-board tie on my corpora.  Maybe someone with
larger corpora could try it out.  (Tim?)

Neale