[spambayes-dev] bug in imap filter or in email package

Mon Aug 2 22:32:30 CEST 2004

-----BEGIN PGP SIGNED MESSAGE-----

I noticed that I had way too many Unsures so I did some investigating.
One message I looked at carefully was a pure HTML message (i.e. not a
multipart/alternative) which was encoded with base64.  Ordinarily
Spambayes should decode that and tokenize the decoded message.
However, I noticed that this message had a bunch of tokens of the form
	'skip:d 60': 0.01; 'skip:l 60': 0.01; 'skip:m 60': 0.03;
and no tokens that came from the decoded message.
But when I use the web interface of sb_imapfilter.py and tokenize a
locally saved copy of the message, I don't get these tokens, but instead
I get tokens which come from the decoded message.
I went through the steps of what sb_imapfilter.py does by hand and I
noticed a few things:

Message.asTokens is defined as follows:
~    def asTokens(self):
~        return tokenize(self.as_string())
and tokenize (which is really Tokenizer.tokenize does this:
~    def tokenize(self, obj):
~        msg = self.get_message(obj)
	[...]
and finally, self.get_message (which is really get_message in
tokenizer.py) creates a Message instance of the argument string.

I have the feeling that this can be made more efficient by having
~    def asTokens(self):
~        return tokenize(self)
instead.  get_message just returns its argument if it is a Message
instance (which self in Message.asTokens is).

But this is not the bug.

tokenize calls tokenize_body which goes through the text parts (only one
here) and calls part.get_payload(decode=True) (where part is a Message
instance as returned by msg.walk()).  get_payload in email.Message.py
gets the content-transfer-encoding header, but this (and here the bug
manifests itself) returns the string 'base64\r', i.e. with \r.  Since
this is not equal to any of the known encodings, get_payload doesn't
decode and just returns the base64-encoded data.

The question is, is this a bug in the email package in that it should
convert \r\n to \n, or is this a bug somewhere else in that the message
given to the email package should never have included those \r\n?

The message instance is created with email.Parser.Parser().parsestr(...)
where the argument to parsestr is the data as returned by the IMAP
server (which of course uses \r\n line endings).

By the way, Windows is not involved anywhere in the process, so the \r\n
aren't OS artifacts.

My Python is almost fully up-to-date, the email package is completely
up-to-date (my last cvs update was after the last change to the email
component).

- --
Sjoerd Mullender <sjoerd at acm.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iQCVAwUBQQ6k3j7g04AjvIQpAQG65QQAiEzw2wFqnn3TnF1QrnBhaDKuiyIpXo/x
0GxyFztoX29c3us9Yost8Satf4pw2wKmSmHaj6ENkT0bRHhlf+DrqkkPDR/S4rPL
DDh9nRXaVMfsRT2v4QZWOmfjeDadwJsXtV0toiTKlRQ4eT68fZkjwBePmMgw+aDv
NpXJO4LQX4U=
=BhHX
-----END PGP SIGNATURE-----