[Email-SIG] Does anyone need an un-parseable message <wink>?

Matthew Dixon Cowles matt at mondoinfo.com
Mon Jul 12 21:29:08 CEST 2004


Dear Tony,

> It's fairly likely, IMO, that it already includes an example of
> this, if the malformation is the one I suspect it is.

I'm sure that it is; it's just a:

Content-Type: multipart/alternative;

at the top of a message that's not MIME-y at all.

>> In playing around with SpamBayes, I've come across a spam that
>> SpamBayes can't deal with because the email module can't parse it.

> What version of SpamBayes is this?  There was a reasonably common
> malformed message like this recently, but 1.0rc2 ought to handle it
> fine.

I shouldn't have implied that it was SpamBayes's problem. In my
application, I'm creating email.Message objects and handing them over
to SpamBayes, so the parsing error stopped me before SpamBayes saw
the message.

> It's included in Python 2.4, so you could download the 2.4a1
> release and use that.  I believe that as a result, using SpamBayes
> with Python 2.4 will eliminate malformation problems (I have still
> to test that theory).  I'm also fairly sure that we (SpamBayes) can
> use the information that it generates from malformed messages to
> generate additional clues (haven't tested this yet, either).

The FeedParser does indeed parse the message correctly. (Thanks to
Anthony too for pointing to it.)

The problem that remains is that the multipart/alternative
content-type header is preserved and so SpamBayes's tokenizer ignores
the text in the payload. The relevant part of SpamBaye's tokenizer.py
is:

return Set(filter(lambda part: part.get_content_maintype() == 'text',
  msg.walk()))

To come back to the email package, I wonder if it would make sense
for the FeedParser to set the content-type to text/something since
that's what it is in the message that's returned.

It seems possible to me that trying to fix the content-type would
turn into a mess of heuristics. On the other hand, if we suppose that
the people who are using the FeedParser are coders who are trying to
make some sense of a message that's likely to be incorrectly formed,
and we leave an incorrect content-type, we're just pushing the mess
of heuristics onto the user code.

Regards,
Matt



More information about the Email-SIG mailing list