[Spambayes] idea for tokenizer.crack_filename change

Wed, 18 Sep 2002 20:38:32 -0400

[Neale Pickett]
> In going over some of my spam, I was surprised to see that the following
> wasn't penalized:
>
>   ------=_NextPart_000_0039_0173A692.99A692D0
>   Content-Type: application/octet-stream; name="Video.pif"
>   Content-Transfer-Encoding: base64
>   Content-Disposition: attachment; filename="Video.pif"

The mostly likely reason for this is that you simply don't have many pif
files in your spam training set.  Check it out.

> I can guarantee you that I've never been emailed a single .pif file
> from an actual human being :)  But tokenizer.crack_filename only
> splits up filenames by path elements, so ".pif" never got scored.

Not so:

def crack_filename(fname):
    yield "fname:" + fname
    components = fname_sep_re.split(fname)
    morethan1 = len(components) > 1
    for component in components:
        if morethan1:
            yield "fname comp:" + component
        pieces = urlsep_re.split(component)
        if len(pieces) > 1:
            for piece in pieces:
                yield "fname piece:" + piece

fname_sep_re only splits on *path* components:  forward slash, backward
slash, and colon.  Each component in turn is then split on urlsep_re, which
includes a wide variety of de jure and de facto URL metacharacters.  '.' is
among them, and the pif here should be yielding a

    'fname piece:pif'

token.  It it isn't, there's some sort of bug.  The filename as a whole
should have been extracted via the

    fname = msg.get_filename()
    if fname is not None:
        for x in crack_filename(fname):
            yield 'filename:' + x

portion of crack_content_xyz().  A

    'content-disposition:attachment'

token should have been produced by the code just before that.  Similarly for
Content-Type.  As the comments say, though, Content-Transfer-Encoding is
ignored because test results showed that including it changed results in
minor ways, for both better and worse, across distinct test runs.

>
> I suggest changing fname_sep_re to include ".", like so:
>
>   fname_sep_re = re.compile(r'[./\\:]')

Nope.  That's not what this regexp is for.  If you're not seeing the tokens
mentioned above, there *is* a bug here, and I'd like to know about that.
But the mere fact that a pif token didn't make into the list of best
discriminators for this message doesn't mean anything.

> Unfortunately, I can't back up my suspicion that this is a good idea, as
> it results in an across-the-board tie on my corpora.  Maybe someone with
> larger corpora could try it out.  (Tim?)

I did find value in what crack_filename did else I wouldn't have added the
code <wink>.  I don't know how much value I got specifically from finding
pif tokens, but I noticde once that finding .exe extensions seemed valuable
on a test run.  As always, though, the idea is tokenize everything and not
think too much <wink -- this is ironic given how much sweat has gone into
tokenizing in effective ways!>.