[spambayes-dev] saving attachments

Tim Peters tim.one at comcast.net
Mon Mar 15 15:34:33 EST 2004


[Mark Hammond]
> Note that another, possibly more sane way of approaching this would
> be to manually synthesize tokens with the relelvant information.

Unless we cut a special back door for Outlook, and propagate that throughout
the code, the only thing the Outlook addin can deliver to the tokenizer is a
vanilla email-package message, so the only info it can communicate must live
in the email headers, the synthesized MIME armor, or the message body.  We
can put anything into the message body, but since it  *looks* like the
message body then, it's subject to the limitations of any token in the
message body (split on whitespace, replaced with a "skip" token if it's "too
long", and so on).

> ie, if it really is as simple as "is there an attachment?" (or even
> tokens for the filenames/extensions),

It's everything reachable from this part of tokenizer.py's tokenize_headers:

        # Content-{Type, Disposition} and their params, and charsets.
        # This is done for all MIME sections.
        for x in msg.walk():
            for w in crack_content_xyz(x):
                yield w

That synthesizes tokens for all the MIME sections throughout the email,
covering their

    content-type
    content-type/type
    charset
    content-disposition

params, and fancier tokenization of any target file names (the latter is
where a token is normally generated for (among other things) "and this email
had an attachment with a .pif extension").

> I expect it would be quite simple to implement a) without attempting
> to re-create the MIME just so the tokenizer can unpack it and
> b) without needing to extract the attachments themselves.
>
> I'm happy to offer guidance with this...

Even better, just do it <wink>.




More information about the spambayes-dev mailing list