[spambayes-dev] saving attachments

Mon Mar 8 16:05:30 EST 2004

> -----Original Message-----
> From: spambayes-dev-bounces at python.org
> [mailto:spambayes-dev-bounces at python.org]On Behalf Of Tim Peters
> Sent: Monday, March 08, 2004 2:11 PM
> To: sethg at GoodmanAssociates.com
> Cc: SpamBayes-dev Forum
> Subject: RE: [spambayes-dev] saving attachments
>
>
> [Seth Goodman]
> > I have been accumulating a message corpus for testing that is now
> > becoming alarmingly large.  My cup doth runneth over.  AFAIK,
> > SpamBayes does nothing with attachments.  Neither the existence of
> > one nor its name, size or contents are considered.
>
> That's unique to the Outlook addin, and is due to that
> Outlook destroys the
> original MIME structure.  In other ways of using spambayes,
> all and only
> attachments of MIME type text/* are tokenized, and tokens are
> synthesized
> for all MIME sections, recording (from a comment in tokenizer.py):
>
> # Generate tokens for:
> #    Content-Type
> #        and its type= param
> #    Content-Dispostion
> #        and its filename= param
> #    all the charsets
> #
> # This has huge benefit for the f-n rate, and virtually no effect on
> # the f-p rate, although it does reduce the variance of the f-p rate
> # across different training sets (really marginal msgs, like a brief
> # HTML msg saying just "unsubscribe me", are almost always tagged as
> # spam now; before they were right on the edge, and now the
> # multipart/alternative pushes them over it more consistently).
>
>
> > While most of the spam in my corpus is attachment-free, the ham has
> > lots of them and many are quite large (engineering drawing packages
> > for review).
>
> They wouldn't have MIME type text/*, so only the synthesized
> tokens above
> would be generated for them.

Unless I am mistaken, most of these synthesized tokens are not generated
by the Outlook plug-in.  I did an experiment with a message that had an
html attachment.  I copied the message, deleted the attachment, marked
it as unread and filtered it again (I wasn't sure if "show spam clues"
retokenizes and reclassifies each time).  It had the same number of
total tokens and significant tokens as the copy with the attachment.
The only token that I noticed that relates to message structure was:

'content-disposition:inline'

Perhaps I missed the others.  I've zipped up the message with and
without the html attachment and the spam clues page for each one.  The
message headers still seem to include the multi-part structure after
removing the attachment, but I'm not sure if it is still good enough for
other uses of SpamBayes.  Could someone peruse these and offer an
opinion?

>
> > It would reduce the size of the corpus .pst file considerably if I
> > could delete all attachments.  I have an inexpensive commercial tool
> > that can do this, however, I don't want to if anyone is considering
> > using attachments in future versions.
> >
> > FWIW, I don't see attachments as having much potential for spam
> > detection.
>
> Tests before said that their MIME types, file names, and
> charsets did help.

I stand corrected.  In that case, it's a pity that the Outlook plug-in
can't at least take advantage of those items, though if Outlook destroys
them, that's impossible unless I switch over to the proxy.

>
> > The number of tokens could easily dwarf the original
> > message and need
> > not be related to it in any way.  The last thing we want to do is to
> > encourage spammers to tack on huge attachments,
>
> They won't -- bandwidth is a primary cost for bulk emailers, and big
> messages limit the rate at which they can send spam out.

This makes sense.  Thanks for correcting my misconceptions.

>
> > though word salad attacks have been totally ineffective on
> > my machine
> > and most others who mentioned it on this list.  However, including
> > the full text of actual natural language works might have better
> > luck, and I wouldn't want to be responsible for encouraging that
> > practice, i.e. really bad Karma, hate mail and death threats, so I
> > would think that continuing to ignore attachments is a good
> > strategy.
>
> The Outlook addin ignores them only because nobody has
> endured the pain
> necessary to try to guess what the original MIME structure
> might have been.

That does sound painful, especially since the Outlook internals are not
documented.  I guess the only realistic way around this is a proxy.  I
actually wouldn't mind using a proxy if I could keep some semblance of
the present Outlook integration, but that's a whole separate project.  I
don't know if tokenizing and classifying the messages before Outlook
mangles them has other advantages.  I can see a mountain of problems,
though, such as storing an RFC-compliant copy of the message in addition
to Outlook's .pst store and keeping track of both.  It could also cause
inconsistencies if you want to train on a message at some point in the
future when you only have the .pst version available.  Sounds like a bad
idea the more I think about it.

Thanks for the replies.

--

Seth Goodman
-------------- next part --------------
A non-text attachment was scrubbed...
Name: messages.zip
Type: application/x-zip-compressed
Size: 31067 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040308/8f94701a/messages-0001.bin