From esj at harvee.org Sat May 1 08:05:50 2004 From: esj at harvee.org (Eric S. Johansson) Date: Sat May 1 08:05:03 2004 Subject: [Email-SIG] Maybe a bug, maybe not Message-ID: <4093929E.7000909@harvee.org> found a very common form of spam that triggers an exception. don't know if you considered a bug or not. I've enclosed a sample message and a very simple program to trigger the bug. From my limited understanding, the payload type is correct but somehow it is dispatched to the wrong handler. When I was writing the test program, I also copied some of the generator code so I could see what method was being requested etc. then I ran into limits of my knowledge and time. let me know if this should be a bug report. In the meantime, I need to go figure out where to add exception traps and how to handle them [root@harvee emailbug]# python test_bug.py Traceback (most recent call last): File "test_bug.py", line 12, in ? print message.as_string() File "/usr/lib/python2.2/site-packages/email/Message.py", line 113, in as_string g.flatten(self, unixfrom=unixfrom) File "/usr/lib/python2.2/site-packages/email/Generator.py", line 103, in flatten self._write(msg) File "/usr/lib/python2.2/site-packages/email/Generator.py", line 131, in _write self._dispatch(msg) File "/usr/lib/python2.2/site-packages/email/Generator.py", line 157, in _dispatch meth(msg) File "/usr/lib/python2.2/site-packages/email/Generator.py", line 201, in _handle_text raise TypeError, 'string payload expected: %s' % (type(payload)) TypeError: string payload expected: -------------- next part -------------- import sys sys.path.insert(1,"/usr/local/camram/modules/") sys.path.insert(1,"/usr/local/camram/web-ui/cgi-exec/") #sys.path.insert(1,"../modules/") UNDERSCORE = "_" import email handle_2 = file("broken_format.msg") message = email.message_from_file(handle_2) print message.as_string() main = message.get_content_maintype() sub = message.get_content_subtype() specific = UNDERSCORE.join((main, sub)).replace('-', '_') meth = getattr(message, '_handle_' + specific, None) print meth -------------- next part -------------- Return-Path: Received: from red.harvee.home (red [192.168.25.1] (may be forged)) by harvee.org (8.12.8/8.12.8) with ESMTP id i3Q59FCe002252 for ; Mon, 26 Apr 2004 01:09:15 -0400 Received: from c-a80873d5.026-12-6f736c4.cust.bredband.no (c-a80873d5.026-12-6f736c4.cust.bredband.no [213.115.8.168]) by red.harvee.home (8.11.6/8.11.6) with SMTP id i3Q59Ep29530 for ; Mon, 26 Apr 2004 01:09:14 -0400 Received: from 194.155.112.200 by 213.115.8.168; Mon, 26 Apr 2004 03:10:23 -0300 Message-ID: From: "Noelle Field" Reply-To: "Noelle Field" To: esj@inguide.com Subject: Mort.gage ra.tes are decreasing Date: Mon, 26 Apr 2004 08:08:23 +0200 MIME-Version: 1.0 Content-Type: text/html; boundary="--359297558671409" X-Originating-IP: 66.93.191.107 ----359297558671409 Content-Type: text/html; Content-Transfer-Encoding: 7Bit

You're PRE-APPROVED for a for a F.ree Mort.gage Quote. We know that you are paying over 6% and we can reduce it down to 3% which would save you thousands of dollars.

Apply Now

dialogue twentieth humanitarian alcestis cranberry blank succubus sweater emulate irrational sunflower waller desiderata honeydew explicable volterra rook barnabas atypic powerhouse helmut marc kalmuk barberry formal golden greece machine pueblo sushi triumphant twirl froze navy olympic arcane cranford treatise vasectomy boon systemic bonaparte savannah dragon

re.move

----359297558671409-- From alex at gabuzomeu.net Mon May 3 15:45:55 2004 From: alex at gabuzomeu.net (Alexandre Ratti) Date: Mon May 3 15:43:37 2004 Subject: [Email-SIG] Maybe a bug, maybe not Message-ID: <4096A173.4060704@gabuzomeu.net> [Resent because first message was bounced by the email-sig list.] Hi Eric, [Eric S. Johansson wrote] > found a very common form of spam that triggers an exception. don't know > if you considered a bug or not. I've enclosed a sample message and a > very simple program to trigger the bug. From my limited understanding, > the payload type is correct but somehow it is dispatched to the wrong > handler. When I was writing the test program, I also copied some of the > generator code so I could see what method was being requested etc. then > I ran into limits of my knowledge and time [http://mail.python.org/pipermail/email-sig/2004-May/000101.html] I also received several junk emails that crash the email package. They are a pain because they also crash spambayes since it uses this package. I'm copying the spambayes list since people started reporting this problem on this list too. I suspect that the crash occur because these messages have multipart boundaries but have a text content type header. This cause the "_handle_text" method of the Generator class (in email/Generator.py) to be called. This method expects get_payload() to return a string, which doesn't happen since the message is multipart. This seems to similar to a know issue: http://sourceforge.net/tracker/index.php?func=detail&aid=846938&group_id=5470&atid=105470 I'm not sure at which levels in the email package this problem should be fixed. For now, I applied this simple fix in the Generator.py module: replace the _handle_text method with this code: def _handle_text(self, msg): payload = msg.get_payload() if payload is None: return cset = msg.get_charset() if cset is not None: payload = cset.body_encode(payload) if not _isstring(payload): # Changed to handle malformed messages with a text base # type and a multipart content. if type(payload) == type([]) and msg.is_multipart(): return self._handle_multipart(msg) else: raise TypeError, 'string payload expected: %s' % type(payload) if self._mangle_from_: payload = fcre.sub('>From ', payload) self._fp.write(payload) or use this diff (against the 2.5.4 version of the email package): --- Generator.orig.py Mon May 3 20:41:27 2004 +++ Generator.py Mon May 3 20:43:46 2004 @@ -197,7 +197,12 @@ if cset is not None: payload = cset.body_encode(payload) if not _isstring(payload): - raise TypeError, 'string payload expected: %s' % type(payload) + # Changed to handle malformed messages with a text base + # type and a multipart content. + if type(payload) == type([]) and msg.is_multipart(): + return self._handle_multipart(msg) + else: + raise TypeError, 'string payload expected: %s' % type(payload) if self._mangle_from_: payload = fcre.sub('>From ', payload) self._fp.write(payload) This change seems to fix the problem. I fed a mailbox with several of these messages to spambayes and they were parsed OK and flagged as spam as expected. Cheers. Alexandre From esj at harvee.org Mon May 3 16:33:41 2004 From: esj at harvee.org (Eric S. Johansson) Date: Mon May 3 16:35:15 2004 Subject: [Email-SIG] Maybe a bug, maybe not In-Reply-To: <4096A173.4060704@gabuzomeu.net> References: <4096A173.4060704@gabuzomeu.net> Message-ID: <4096ACA5.1020503@harvee.org> Alexandre Ratti wrote: > This change seems to fix the problem. I fed a mailbox with several of > these messages to spambayes and they were parsed OK and flagged as spam > as expected. thank you very much for the fix. I may just make a part of the installation process for camram. Seems to me that since the message body is of type list and not straying when it enters the_handle_text method, that the problem lies further upstream in the dispatch method. Unfortunately, I haven't had enough time to sit down a puzzle out how it works. I think a better solution would be one further upstream directing the message to the appropriate type of handler. But, for now, a less than ideal solution that works is a far sight better than all these exceptions popping up. Thank you again for the effort. ---eric From alex at gabuzomeu.net Mon May 3 16:55:18 2004 From: alex at gabuzomeu.net (Alexandre Ratti) Date: Mon May 3 16:52:59 2004 Subject: [Email-SIG] Maybe a bug, maybe not In-Reply-To: <4096ACA5.1020503@harvee.org> References: <4096A173.4060704@gabuzomeu.net> <4096ACA5.1020503@harvee.org> Message-ID: <4096B1B6.7050202@gabuzomeu.net> Hi Eric, Eric S. Johansson wrote: > thank you very much for the fix. You're welcome. > Seems to me that since the message body is of type list and not straying > when it enters the_handle_text method, that the problem lies further > upstream in the dispatch method. Unfortunately, I haven't had enough > time to sit down a puzzle out how it works. I think a better solution > would be one further upstream directing the message to the appropriate > type of handler. I agree, but I'm not sure how far upstream the fix should be applied. Fixing the dispatch method should be simple. However, maybe we should change the message parser instead so that no such message is generated in the first place. I don't understand email formats and this package well enough to decide which solution makes more sense. > But, for now, a less than ideal solution that works is a far sight > better than all these exceptions popping up. Yes, this should do as a stopgap. I'll add data to the bug report I quoted before so that the problem can be fixed properly in the Python library. Cheers. Alexandre From t-meyer at ihug.co.nz Mon May 3 21:29:34 2004 From: t-meyer at ihug.co.nz (Tony Meyer) Date: Mon May 3 21:29:46 2004 Subject: [Email-SIG] Maybe a bug, maybe not In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13062699B7@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2BE4@its-xchg4.massey.ac.nz> > I'm copying the spambayes list > since people started reporting this problem on this list too. I've moved this to cc spambayes-dev instead, because we're already discussing this there, and it'll just get lost in the bug reports on the main list. > I suspect that the crash occur because these messages have > multipart boundaries but have a text content type header. That seems to be correct. Two additional notes: Skip Montanaro thinks that he had a message like this fail with Python 2.2.3 and email 2.5.3, but work fine with Python from CVS and version 2.5.5 of the email package, so that might be worth looking into. He's going to check whether this is the case or not. For SpamBayes (and so presumably other apps that use the email package like this) we're either going to (again) include a more up-to-date/patched version of the email package, or handle the exception in our code. Adding something like this: >>> try: ... print msg.as_string() ... except TypeError: ... parts = [] ... for part in msg.get_payload(): ... parts.append(part.as_string()) ... print "\n".join(parts) ... works for me (obviously msg is an email.Message or similar, and you change print to whatever you want it to be). Adding this to the two spambayes modules that need it may be simpler for us than including a patched email package. =Tony Meyer From anthony at interlink.com.au Tue May 4 01:01:59 2004 From: anthony at interlink.com.au (Anthony Baxter) Date: Tue May 4 01:02:31 2004 Subject: [Email-SIG] Maybe a bug, maybe not In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2BE4@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13026F2BE4@its-xchg4.massey.ac.nz> Message-ID: <409723C7.6050606@interlink.com.au> Please try out the new FeedParser in the current Python-CVS. It should be considerably more robust than the old parser. In addition, it can be fed the message text "in chunks" and it will do the correct thing. http://cvs.sourceforge.net/viewcvs.py/python/python/dist/src/Lib/email/FeedParser.py -- Anthony Baxter It's never too late to have a happy childhood. From alex at gabuzomeu.net Tue May 4 04:02:41 2004 From: alex at gabuzomeu.net (Alexandre Ratti) Date: Tue May 4 04:00:21 2004 Subject: [spambayes-dev] RE: [Email-SIG] Maybe a bug, maybe not In-Reply-To: <16535.1393.304440.918139@montanaro.dyndns.org> References: <1ED4ECF91CDED24C8D012BCF2B034F13062699B7@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F13026F2BE4@its-xchg4.massey.ac.nz> <16535.1393.304440.918139@montanaro.dyndns.org> Message-ID: <40974E21.5090905@gabuzomeu.net> Skip Montanaro wrote: > Better yet, I have a message (attached) which works w/ Python CVS (email > 2.5.5), fails w/ Python 2.3.3 (email 2.5.4), and prints as expected with > your loop-over-get_payload trick. I'm offline at the moment but will try to > get a change checked in later this evening or tomorrow morning. In case you need more test data, I have saved 3 messages that crashed Spambayes and the email package (2.5.4): http://alexandre.ratti.free.fr/python/email/ From anthony at interlink.com.au Tue May 4 06:44:41 2004 From: anthony at interlink.com.au (Anthony Baxter) Date: Tue May 4 06:47:26 2004 Subject: [spambayes-dev] RE: [Email-SIG] Maybe a bug, maybe not In-Reply-To: <40974E21.5090905@gabuzomeu.net> References: <1ED4ECF91CDED24C8D012BCF2B034F13062699B7@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F13026F2BE4@its-xchg4.massey.ac.nz> <16535.1393.304440.918139@montanaro.dyndns.org> <40974E21.5090905@gabuzomeu.net> Message-ID: <40977419.1050709@interlink.com.au> Alexandre Ratti wrote: > > Skip Montanaro wrote: > >> Better yet, I have a message (attached) which works w/ Python CVS (email >> 2.5.5), fails w/ Python 2.3.3 (email 2.5.4), and prints as expected with >> your loop-over-get_payload trick. I'm offline at the moment but will >> try to >> get a change checked in later this evening or tomorrow morning. > > > In case you need more test data, I have saved 3 messages that crashed > Spambayes and the email package (2.5.4): > > http://alexandre.ratti.free.fr/python/email/ These are all correctly parsed by the current-CVS version of the email package. Well, "correct" in this case means that they're considered a single text/html part. The boundary tag is (correctly) ignored. I'll be making a release of my email-torture-test package this evening with these tests and more. Anthony From anthony at interlink.com.au Tue May 4 08:01:48 2004 From: anthony at interlink.com.au (Anthony Baxter) Date: Tue May 4 08:02:59 2004 Subject: [spambayes-dev] RE: [Email-SIG] Maybe a bug, maybe not In-Reply-To: <40974E21.5090905@gabuzomeu.net> References: <1ED4ECF91CDED24C8D012BCF2B034F13062699B7@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F13026F2BE4@its-xchg4.massey.ac.nz> <16535.1393.304440.918139@montanaro.dyndns.org> <40974E21.5090905@gabuzomeu.net> Message-ID: <4097862C.40004@interlink.com.au> Ok, I've made a tarball up of my MIME torture tests. They're available from http://www.interlink.com.au/anthony/tech/mime/ See the ABOUT.txt file there for more details. If you have examples of horror that aren't already covered (in particular, anything that breaks the current-CVS python parser!) please send them my way. If you'd prefer I sanitise them to remove your email addresses, let me know. -- Anthony Baxter It's never too late to have a happy childhood. From barry at python.org Sun May 9 11:51:57 2004 From: barry at python.org (Barry Warsaw) Date: Sun May 9 11:52:07 2004 Subject: [Email-SIG] New FeedParser and other updates Message-ID: <1084117916.1735.336.camel@anthem.wooz.org> Last night I checked in a new FeedParser, along with tons of other updates to rip out compatibility with older Pythons. email 3.0 will only support Python 2.3 and above. I spent a lot of time pouring over RFC 2046 and the BNF grammar and tried to get our Parser, Message, and Generator classes to be more compliant. The trickiest bit of all this is caused by the RFCs assertion that the newline preceding a multipart boundary actually belongs to the boundary, not to the body. This is tricky because the FeedParser really wants a read-a-line-at-a-time abstraction, so by the time you've seen the boundary, you've already consumed the preceding newline. The hack then is to try to track where that newline lives and clean it up afterward. I think it's mostly going to show up in the encapsulated message body, or in the case where the inner message is a multipart, in the epilogue. For leading boundaries, you need to clean the newline out of the preamble. Another big change is that the FeedParser will not throw parsing errors any more. Instead, if it finds a problem, it will populate a .defects attribute on the current message. This will be a list of instances of subclasses of the new email.Errors.MessageDefect class. The Generator isn't currently set up to consult .defects, so that should be added. You should also check out the BufferedSubFile abstraction in FeedParser.py (this used to be called FeedableLumpOfText :). In any event, the FeedParser passes all the test_email.py tests, although some had to be modified. I also added a bunch more tests to flex the current semantics of the .preamble and .epilogue. Everything's checked in now so please feel free to test it yourself. The old Parser class was rewritten in to use the FeedParser, so now it's basically just a backward compatible front-end. I haven't thrown Anthony's huge stress test at it yet, but I hope I'll find time soon to do so. One thing I know won't 'work' is parsing of a nested multipart with the same boundary on the inner and outer messages. That's because of the BufferedSubFile abstraction, since the outer boundary matching regexp will cause it to return EOF on the first inner boundary. The message will get a StartBoundaryNotFound defect and the rest of the message will be parsed as its body. I think a better solution can be found, along the lines of unreadline() what's read up until then, pushing a different EOF matcher onto the BufferedSubFile and trying again. You'd still want to push a .defect onto the message so that you knew the inner and outer messages had the same boundary. Also, the Generator would have to be modified to look for that defect and calculate a different inner boundary for the generated message (meaning it wouldn't be idempotent). Enough babbling, enjoy. -Barry From anthony at interlink.com.au Tue May 11 01:08:20 2004 From: anthony at interlink.com.au (Anthony Baxter) Date: Tue May 11 01:09:24 2004 Subject: [Email-SIG] New FeedParser and other updates In-Reply-To: <1084117916.1735.336.camel@anthem.wooz.org> References: <1084117916.1735.336.camel@anthem.wooz.org> Message-ID: <40A05FC4.7070903@interlink.com.au> Barry Warsaw wrote: > Another big change is that the FeedParser will not throw parsing errors > any more. Not quite ;) Failed to parse: test_spam20103 test_spam20121 test_spam20124 test_spam20127 test_spam20130 test_spam203 test_spam2072 test_spam2073 test_spam_no_trailing_nl test_zero-length-boundary These all raised exceptions. Parsed incorrectly: test_multi-text+ext test_multi-text+ext.2 test_multiple_same_boundary test_spam13 test_spam4 test_spam5 test_spam8 test_spam9 These were just parsed wrong. -- Anthony Baxter It's never too late to have a happy childhood. From barry at python.org Tue May 11 14:12:53 2004 From: barry at python.org (Barry Warsaw) Date: Tue May 11 14:13:00 2004 Subject: [Email-SIG] New FeedParser and other updates In-Reply-To: <40A05FC4.7070903@interlink.com.au> References: <1084117916.1735.336.camel@anthem.wooz.org> <40A05FC4.7070903@interlink.com.au> Message-ID: <1084299172.28228.269.camel@anthem.wooz.org> On Tue, 2004-05-11 at 01:08, Anthony Baxter wrote: > Barry Warsaw wrote: > > Another big change is that the FeedParser will not throw parsing errors > > any more. > > Not quite ;) > > Failed to parse: > test_spam20103 > test_spam20121 > test_spam20124 > test_spam20127 > test_spam20130 > test_spam203 > test_spam2072 > test_spam2073 > test_spam_no_trailing_nl > test_zero-length-boundary > > These all raised exceptions. Okay, that was shallow. > Parsed incorrectly: > test_multi-text+ext > test_multi-text+ext.2 > test_multiple_same_boundary > test_spam13 > test_spam4 > test_spam5 > test_spam8 > test_spam9 > > These were just parsed wrong. Using email package, version 3.0a0 Out of 75 files: FeedParser parsed 75, 67 correctly Parsed incorrectly: test_multi-text+ext test_multi-text+ext.2 test_multiple_same_boundary test_spam13 test_spam4 test_spam5 test_spam8 test_spam9 Bastard. -Barry P.S. I'm going to check this stuff into nondist somewhere. From barry at python.org Tue May 11 17:06:54 2004 From: barry at python.org (Barry Warsaw) Date: Tue May 11 17:07:01 2004 Subject: [Email-SIG] Interesting requirement of RFC 2046 Message-ID: <1084309614.28228.340.camel@anthem.wooz.org> Here's the quote: 5.1.2. Handling Nested Messages and Multiparts The "message/rfc822" subtype defined in a subsequent section of this document has no terminating condition other than running out of data. Similarly, an improperly truncated "multipart" entity may not have any terminating boundary marker, and can turn up operationally due to mail system malfunctions. It is essential that such entities be handled correctly when they are themselves imbedded inside of another "multipart" structure. MIME implementations are therefore required to recognize outer level boundary markers at ANY level of inner nesting. It is not sufficient to only check for the next expected marker or other terminating condition. Which tells me that BufferedSubFile's implementation isn't quite right, since it only matches the next line against the top of the EOF stack. It should probably check against /every/ predicate on that stack, returning EOF if any of them match. I think it should also keep the current line in the buffer so that for nested parts, it'll return from the recursion and possibly immediately get another EOF. I'll have to work out a test case for that, after I get closer on the torture tests. -Barry From barry at python.org Tue May 11 17:25:43 2004 From: barry at python.org (Barry Warsaw) Date: Tue May 11 17:25:54 2004 Subject: [Email-SIG] Double boundaries Message-ID: <1084310743.28228.350.camel@anthem.wooz.org> One of Anthony's torture tests includes a double boundary, e.g. ... Content-type: multipart/x-foo; boundary=BBB ... --BBB --BBB ... --BBB-- Now, his expected output would ignore the second of the double boundaries. The current FeedParser injects basically an empty text/plain Message in there. The best justification I could find for Anthony's expected output is in the RFC 2046 BNF (see Appendix A): body-part := <"message" as defined in RFC 822, with all header fields optional, not starting with the specified dash-boundary, and with the delimiter not occurring anywhere in the body part. Note that the semantics of a part differ from the semantics of a message, as described in the text.> dash-boundary := "--" boundary ; boundary taken from the value of ; boundary parameter of the ; Content-Type field. So because the dash-boundary is the first line, this can't be a body part. Google didn't turn up any further discussion about what the intent of the RFC is in cases like this. If anybody is aware of further references on this subject, please follow up. Barring further information, I'll try to make the FeedParser ignore subsequent double boundaries. -Barry From barry at python.org Tue May 11 18:29:42 2004 From: barry at python.org (Barry Warsaw) Date: Tue May 11 18:29:50 2004 Subject: [Email-SIG] New FeedParser and other updates In-Reply-To: <40A05FC4.7070903@interlink.com.au> References: <1084117916.1735.336.camel@anthem.wooz.org> <40A05FC4.7070903@interlink.com.au> Message-ID: <1084314582.28228.363.camel@anthem.wooz.org> On Tue, 2004-05-11 at 01:08, Anthony Baxter wrote: > Barry Warsaw wrote: > > Another big change is that the FeedParser will not throw parsing errors > > any more. > > Not quite ;) Thank you sir, may I have another: Using email package, version 3.0a0 Out of 75 files: Parser parsed 75, 74 correctly Parsed incorrectly: test_multiple_same_boundary FeedParser parsed 75, 74 correctly Parsed incorrectly: test_multiple_same_boundary Now, I know this one will fail. I outlined an approach that should work, but I want to think more about what output we really want for messages like this, in light of RFC 2046, Section 5.1.2. -Barry P.S. I might not add the torture test to nondist after all, since I've added boiled down test cases for each of the previous failures. I'll add one for this too, once we figure out what we really want the FeedParser to do. From anthony at interlink.com.au Wed May 12 09:49:25 2004 From: anthony at interlink.com.au (Anthony Baxter) Date: Wed May 12 09:50:08 2004 Subject: [Email-SIG] Double boundaries In-Reply-To: <1084310743.28228.350.camel@anthem.wooz.org> References: <1084310743.28228.350.camel@anthem.wooz.org> Message-ID: <40A22B65.4010009@interlink.com.au> Barry Warsaw wrote: > One of Anthony's torture tests includes a double boundary, e.g. > > ... > Content-type: multipart/x-foo; boundary=BBB > > ... > --BBB > --BBB > > ... > > --BBB-- > > Now, his expected output would ignore the second of the double > boundaries. The current FeedParser injects basically an empty > text/plain Message in there. > > The best justification I could find for Anthony's expected output is in > the RFC 2046 BNF (see Appendix A): I guess, in this case, my "expected output" is derived from pragmatism more than the RFC. I see these all too often, and it's always from broken mailers. I see absolutely no benefit to creating an empty text/plain in there. My driver for a lot of these tests was "what am I seeing in the wild?" Absolute strict conformance to the MIME rfcs, while a good thing in theory, should be secondary to producing the correct result. Anthony -- Anthony Baxter It's never too late to have a happy childhood. From barry at python.org Thu May 13 09:51:41 2004 From: barry at python.org (Barry Warsaw) Date: Thu May 13 09:51:46 2004 Subject: [Email-SIG] Double boundaries In-Reply-To: <40A22B65.4010009@interlink.com.au> References: <1084310743.28228.350.camel@anthem.wooz.org> <40A22B65.4010009@interlink.com.au> Message-ID: <1084456299.28228.671.camel@anthem.wooz.org> On Wed, 2004-05-12 at 09:49, Anthony Baxter wrote: > Barry Warsaw wrote: > > One of Anthony's torture tests includes a double boundary, e.g. > > > > ... > > Content-type: multipart/x-foo; boundary=BBB > > > > ... > > --BBB > > --BBB > > > > ... > > > > --BBB-- > > > > Now, his expected output would ignore the second of the double > > boundaries. The current FeedParser injects basically an empty > > text/plain Message in there. > > > > The best justification I could find for Anthony's expected output is in > > the RFC 2046 BNF (see Appendix A): > > I guess, in this case, my "expected output" is derived from pragmatism > more than the RFC. I see these all too often, and it's always from > broken mailers. I see absolutely no benefit to creating an empty > text/plain in there. My driver for a lot of these tests was "what am > I seeing in the wild?" Absolute strict conformance to the MIME rfcs, > while a good thing in theory, should be secondary to producing the > correct result. Except that where the RFC defines things, it defines the expected correct result. In this particular instance though, I agree with you; the spec is far from clear about what is "correct" so your interpretation is as good as any, good enough for me, and what today's FeedParser implements. I was primarily looking for additional published material to reference in comments. Lacking that, we'll just do what we think is right (and I agree with you here about not creating that extra empty text/plain). -Barry From barry at python.org Thu May 13 18:03:45 2004 From: barry at python.org (Barry Warsaw) Date: Thu May 13 18:03:57 2004 Subject: [Email-SIG] State of the FeedParser Message-ID: <1084485823.28228.790.camel@anthem.wooz.org> I feel like the new FeedParser is in a pretty good shape, but I wanted to bring up two cases where what it parses is different than what Anthony's MIME tests expect. The two tests in question are test_multiple_same_boundary (in the email test suite, msg_39.txt), and test_nested-multiples-with-internal-boundary-bastard (msg_38.txt). By my interpretation of RFC 2046, I believe that if you encounter an outer mutipart's boundary inside an inner part, you should treat that as the inner part being truncated, with the boundary separating parts in the outer multipart. This is implemented in the FeedParser as BufferedSubFile.readline() testing all EOF predicates in its stack against every line read. Anthony's tests expect different behavior -- I believe it wants outer boundaries in inner parts to be ignored. You can implement that in .readline() by changing the line for ateof in self._eofstack[::-1]: to for ateof in self._eofstack[-1::]: Under the former, the above two tests are not parsed as Anthony's output expects. Under the latter, test_nested-multiples-with-internal-boundary-bastard gets parsed as expected, but test_multiple_same_boundary still does not. For that case, more complications will have to be added to the FeedParser. I know Anthony will disagree with me, but I'm inclined to leave the FeedParser as it now stands in CVS. I'm convinced it's closer to the intent of the RFC. None of the data is lost, and the message's all get .defects added to them, so you will at least /know/ something's wrong with them. If anybody is motivated to make the FeedParser agree with Anthony's output, please generate a patch. I'd probably want to see some kind of flag in the FeedParser that would get propagated to BufferedSubFile, which switched between RFC-compliance mode and 'ignore-outer-boundaries' mode. It's kind of distasteful to have such a flag, but I really don't want to lose the current behavior. I also don't have much more stomach for trying to add all that to the current FeedParser. -Barry From barry at python.org Thu May 13 19:00:14 2004 From: barry at python.org (Barry Warsaw) Date: Thu May 13 19:00:23 2004 Subject: [Email-SIG] Maybe a bug, maybe not In-Reply-To: <4096A173.4060704@gabuzomeu.net> References: <4096A173.4060704@gabuzomeu.net> Message-ID: <1084489213.28228.826.camel@anthem.wooz.org> I'm finally getting around to replying to this thread... On Mon, 2004-05-03 at 15:45, Alexandre Ratti wrote: > [Eric S. Johansson wrote] > > found a very common form of spam that triggers an exception. > I suspect that the crash occur because these messages have multipart > boundaries but have a text content type header. This cause the > "_handle_text" method of the Generator class (in email/Generator.py) to > be called. This method expects get_payload() to return a string, which > doesn't happen since the message is multipart. > > This seems to similar to a know issue: > > http://sourceforge.net/tracker/index.php?func=detail&aid=846938&group_id=5470&atid=105470 I think it's the same issue and I don't believe this is fixed in email 2.5.5 (Python 2.3.4). I know later on in this thread Skip says it is fixed, but unless my release23-maint branch is messed up, I don't think it is. I honestly don't know if we have the time to get this into a 2.3.4 final, since 1) I probably won't have time to do it, 2) I'm not certain what the right fix it. Basically, the parser should not be parsing such messages such that is_multipart() would return true. That's not going to happen for email 2.5 so perhaps your workaround is the best we can do. Note that email 3.0 (Python 2.4) definitely does not suffer from this problem. > or use this diff (against the 2.5.4 version of the email package): > > --- Generator.orig.py Mon May 3 20:41:27 2004 > +++ Generator.py Mon May 3 20:43:46 2004 > @@ -197,7 +197,12 @@ > if cset is not None: > payload = cset.body_encode(payload) > if not _isstring(payload): > - raise TypeError, 'string payload expected: %s' % type(payload) > + # Changed to handle malformed messages with a text base > + # type and a multipart content. > + if type(payload) == type([]) and msg.is_multipart(): > + return self._handle_multipart(msg) > + else: > + raise TypeError, 'string payload expected: %s' % > type(payload) > if self._mangle_from_: > payload = fcre.sub('>From ', payload) > self._fp.write(payload) > > This change seems to fix the problem. I fed a mailbox with several of > these messages to spambayes and they were parsed OK and flagged as spam > as expected. You you please attach this patch (not cut-n-paste) it to Jason's bug report: http://sourceforge.net/tracker/index.php?func=detail&aid=846938&group_id=5470&atid=105470 That's so much better than letting it get buried in this thread! -Barry P.S. I don't think you need to test for both type(payload) == type([]) and msg.is_multipart(). Just the latter will do, since that all is_multipart() does. Besides, the right way to spell the former (in Python 2.1-speak) would be isinstance(payload, ListType). From barry at python.org Thu May 13 19:12:32 2004 From: barry at python.org (Barry Warsaw) Date: Thu May 13 19:12:39 2004 Subject: [Email-SIG] Maybe a bug, maybe not In-Reply-To: <4096B1B6.7050202@gabuzomeu.net> References: <4096A173.4060704@gabuzomeu.net> <4096ACA5.1020503@harvee.org> <4096B1B6.7050202@gabuzomeu.net> Message-ID: <1084489952.28228.840.camel@anthem.wooz.org> On Mon, 2004-05-03 at 16:55, Alexandre Ratti wrote: > I agree, but I'm not sure how far upstream the fix should be applied. > Fixing the dispatch method should be simple. However, maybe we should > change the message parser instead so that no such message is generated > in the first place Actually, ignore my last message. I think the right fix is for _parsebody() to not simply test whether there's a boundary, but to also test that container.get_content_maintype() == 'multipart'. I will add a fix (and test case) for this to email 2.5.5 / Python 2.3.4 and close that bug report. -Barry From t-meyer at ihug.co.nz Thu May 13 19:12:24 2004 From: t-meyer at ihug.co.nz (Tony Meyer) Date: Thu May 13 19:12:44 2004 Subject: [Email-SIG] Maybe a bug, maybe not In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1306556E94@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677DEB@its-xchg4.massey.ac.nz> > > This change seems to fix the problem. I fed a mailbox with > > several of these messages to spambayes and they were parsed > > OK and flagged as spam as expected. FWIW, SpamBayes handles this particular problem (outside the email package) with versions: * 1.0rc1 (sb_filter, maybe sb_mboxtrain?) * 1.0rc2/1.0 (sb_server/sb_imapfilter/sb_pop3dnd) After SpamBayes 1.0, and once Python 2.4a1 is out (i.e. the middle of the year), I'll try and patch SpamBayes so that it uses FeedParser if email 3.0 is available (and falls back to current behaviour for those using 2.2 or whatever). =Tony Meyer From barry at python.org Thu May 13 19:21:47 2004 From: barry at python.org (Barry Warsaw) Date: Thu May 13 19:22:00 2004 Subject: [Email-SIG] Maybe a bug, maybe not In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2BE4@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13026F2BE4@its-xchg4.massey.ac.nz> Message-ID: <1084490505.28228.843.camel@anthem.wooz.org> On Mon, 2004-05-03 at 21:29, Tony Meyer wrote: > > I'm copying the spambayes list > > since people started reporting this problem on this list too. > > I've moved this to cc spambayes-dev instead, because we're already > discussing this there, and it'll just get lost in the bug reports on the > main list. > > > I suspect that the crash occur because these messages have > > multipart boundaries but have a text content type header. > > That seems to be correct. > > Two additional notes: > > Skip Montanaro thinks that he had a message like this fail with Python 2.2.3 > and email 2.5.3, but work fine with Python from CVS and version 2.5.5 of the > email package, so that might be worth looking into. He's going to check > whether this is the case or not. It didn't until about 5 minutes ago, but it does (work fine) in email 2.5.5 now. Fortunately, we snuck it in under the Python 2.3.4 wire. -Barry From barry at python.org Thu May 13 19:25:49 2004 From: barry at python.org (Barry Warsaw) Date: Thu May 13 19:26:00 2004 Subject: [spambayes-dev] RE: [Email-SIG] Maybe a bug, maybe not In-Reply-To: <40974E21.5090905@gabuzomeu.net> References: <1ED4ECF91CDED24C8D012BCF2B034F13062699B7@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F13026F2BE4@its-xchg4.massey.ac.nz> <16535.1393.304440.918139@montanaro.dyndns.org> <40974E21.5090905@gabuzomeu.net> Message-ID: <1084490749.28228.845.camel@anthem.wooz.org> On Tue, 2004-05-04 at 04:02, Alexandre Ratti wrote: > In case you need more test data, I have saved 3 messages that crashed > Spambayes and the email package (2.5.4): > > http://alexandre.ratti.free.fr/python/email/ None of these crash email 2.5.5 now. -Barry From barry at python.org Thu May 13 23:39:08 2004 From: barry at python.org (Barry Warsaw) Date: Thu May 13 23:39:18 2004 Subject: [Email-SIG] RELEASED email 2.5.5 Message-ID: <1084505948.28228.882.camel@anthem.wooz.org> I've tagged and uploaded email package version 2.5.5 which fixes a number of bugs since version 2.5.4. This version matches what will be in Python 2.3.4 final, and is therefore only useful for use with older versions of Python. It is compatible with Python 2.1.3 and newer. Please see the email-sig for more information: http://www.python.org/sigs/email-sig/ Enjoy, -Barry From anthony at interlink.com.au Sat May 15 05:12:35 2004 From: anthony at interlink.com.au (Anthony Baxter) Date: Sat May 15 05:12:54 2004 Subject: [Email-SIG] State of the FeedParser In-Reply-To: <1084485823.28228.790.camel@anthem.wooz.org> References: <1084485823.28228.790.camel@anthem.wooz.org> Message-ID: <40A5DF03.5000504@interlink.com.au> Barry Warsaw wrote: > I know Anthony will disagree with me, but I'm inclined to leave the > FeedParser as it now stands in CVS. I'm convinced it's closer to the > intent of the RFC. None of the data is lost, and the message's all get > .defects added to them, so you will at least /know/ something's wrong > with them. Yep, I guess we just have different viewpoints - my opinion is that we should do the utmost possible to attempt to reconstruct the message as it was originally sent, and strict MIME compliance (in the reading side) be damned. Note, though, that we should always make sure that the output is completely correct, even if this means we can't do simple tests that msg = Generator(Parser(msg)). The current code is, however, a vast improvement on the old. The primary goal should be, of course, that we never ever fall over on the reading of a message. The new code manages this, as far as I can tell -- although I have a bunch of ideas for various forms of horribleness to try and put together. These aren't, by the way, just because I have a nasty suspicious mind and like making Barry cry - it's more that spammers are going out of their way to construct bogus MIME messages that work in Outlook, but cause anti-spam filters to choke and die. If we can anticipate these and make sure we don't fall over, it makes Python a much stronger choice for writing anti- (spam,virus,worm) software. > If anybody is motivated to make the FeedParser agree with Anthony's > output, please generate a patch. I'd probably want to see some kind of > flag in the FeedParser that would get propagated to BufferedSubFile, > which switched between RFC-compliance mode and 'ignore-outer-boundaries' > mode. It's kind of distasteful to have such a flag, but I really don't > want to lose the current behavior. I also don't have much more stomach > for trying to add all that to the current FeedParser. I might have a bash at it, but it won't be any time soon. Thanks for getting the FP done! -- Anthony Baxter It's never too late to have a happy childhood. From barry at python.org Sat May 15 13:01:10 2004 From: barry at python.org (Barry Warsaw) Date: Sat May 15 13:01:20 2004 Subject: [Email-SIG] State of the FeedParser In-Reply-To: <40A5DF03.5000504@interlink.com.au> References: <1084485823.28228.790.camel@anthem.wooz.org> <40A5DF03.5000504@interlink.com.au> Message-ID: <1084640469.1350.262.camel@anthem.wooz.org> On Sat, 2004-05-15 at 05:12, Anthony Baxter wrote: > Yep, I guess we just have different viewpoints - my opinion is that > we should do the utmost possible to attempt to reconstruct the message > as it was originally sent, and strict MIME compliance (in the reading > side) be damned. Note, though, that we should always make sure that > the output is completely correct, even if this means we can't do simple > tests that msg = Generator(Parser(msg)). I totally get what you're saying. One way to approach reconstruction though is to provide tools to slurp over the object tree after the fact. > The current code is, however, a vast improvement on the old. The > primary goal should be, of course, that we never ever fall over > on the reading of a message. Agreed! And we should never lose data, if at all possible. For example. non-compliant data might be left in a preamble, epilogue or in a nested body. A post-parsing reconstructor could find that and attempt to guess what the intent of the original sending agent was. > The new code manages this, as far as > I can tell -- although I have a bunch of ideas for various forms > of horribleness to try and put together. These aren't, by the way, > just because I have a nasty suspicious mind and like making Barry > cry - it's more that spammers are going out of their way to construct > bogus MIME messages that work in Outlook, but cause anti-spam filters > to choke and die. I support getting medieval on the FeedParser's ass. :) > If we can anticipate these and make sure we don't > fall over, it makes Python a much stronger choice for writing anti- > (spam,virus,worm) software. +1 > > If anybody is motivated to make the FeedParser agree with Anthony's > > output, please generate a patch. I'd probably want to see some kind of > > flag in the FeedParser that would get propagated to BufferedSubFile, > > which switched between RFC-compliance mode and 'ignore-outer-boundaries' > > mode. It's kind of distasteful to have such a flag, but I really don't > > want to lose the current behavior. I also don't have much more stomach > > for trying to add all that to the current FeedParser. > > I might have a bash at it, but it won't be any time soon. > > Thanks for getting the FP done! Sure thing! -Barry From menno at netbox.biz Mon May 24 21:15:41 2004 From: menno at netbox.biz (Menno Smits) Date: Mon May 24 21:15:54 2004 Subject: [Email-SIG] Handling large emails: DiskMessage and DiskFeedParser Message-ID: <40B29E3D.5010004@netbox.biz> Hi all, FeedParser is great because it doesn't load the entire message into memory during parsing (yes, I realise there are other reasons for FeedParser exising too). However, once the message is parsed the attachment bodies are still loaded entirely in to memory when Message instances are created and populated. This is a big problem for real world enviroments where large messages are possible. All available memory is consumed and the machine grinds to a halt. We see large (40MB+) emails all this time and problems start to occur when several of these are being processed simultaneously. To cope with this problem I've created 2 classes DiskMessage and DiskFeedParser (see http://oss.netboxblue.com). DiskMessage is a simple subclass of Message that stores message payloads to temporary files instead of RAM. Its API is compatible with the standard Message class although to truly avoid loading the entire message in to memory you need to use some extra methods. See the source for details. DiskFeedParser is a hack of the current FeedParser that uses the extra methods of DiskMessage to avoid ever loading message payloads into memory. If anyone wants to try cleanly subclassing FeedParser for this purpose instead of just hacking it I'd like to see the results. Some informal tests of memory usage after parsing a 25MB email (2 large attachments), Python 2.3.3: VSZ RSS Parser with Message: 31840 25088 DiskFeedParser with DiskMessage: 12372 6128 Note that these classes haven't been tested extensively but seem to work. Any feedback would be greatly appreciated. Regards, Menno -- Menno Smits, Senior Development Engineer NetBox http://netbox.biz | Voice +61 500 555 357 Oxcoda http://oxcoda.com | Fax +61 500 555 358 From matt at mondoinfo.com Mon May 24 22:39:27 2004 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Mon May 24 22:41:58 2004 Subject: [Email-SIG] Handling large emails: DiskMessage and DiskFeedParser In-Reply-To: <40B29E3D.5010004@netbox.biz> References: <40B29E3D.5010004@netbox.biz> Message-ID: <1085452256.44.1435@mint-julep.mondoinfo.com> Dear Menno, > To cope with this problem I've created 2 classes DiskMessage and > DiskFeedParser (see http://oss.netboxblue.com). The only feedback I have for you is: Cool! Back when the email module was still called mimelib, I suggested to Barry that the old email-processing modules (multifile, etc) not be deprecated until the email module had the ability to keep large payloads on disk, since the old modules did work that way. Since that time, I've found it easier to throw RAM than coding time at the problem. But I'm glad that you've been able to code what I think is one of the last few things the module needs. Regards, Matt From barry at python.org Tue May 25 06:20:56 2004 From: barry at python.org (Barry Warsaw) Date: Tue May 25 06:21:07 2004 Subject: [Email-SIG] Handling large emails: DiskMessage and DiskFeedParser In-Reply-To: <1085452256.44.1435@mint-julep.mondoinfo.com> References: <40B29E3D.5010004@netbox.biz> <1085452256.44.1435@mint-julep.mondoinfo.com> Message-ID: <1085480455.21753.9.camel@anthem.wooz.org> On Mon, 2004-05-24 at 22:39, Matthew Dixon Cowles wrote: > Dear Menno, > > > To cope with this problem I've created 2 classes DiskMessage and > > DiskFeedParser (see http://oss.netboxblue.com). > > The only feedback I have for you is: Cool! Indeed! Unfortunately, I don't have time right now to look at the code, but I'm +1 on including a feature "like this" in email 3.0. What I'd like you to think about is a better API for hooking in features like this. I'm not sure inheritance (of either the FeedParser, but especially of the Message object) is the right way to go. One reason for that is that many applications already pass Message subclasses to the parser to get some extra application-specific functionality. This isn't going to mix well if they'd also like to get disk storage of big attachments. Perhaps we can start with some use cases and then propose a better API for addressing these use cases? I'll see if I can find some time over the weekend to think about this. -Barry From menno at netbox.biz Wed May 26 03:51:48 2004 From: menno at netbox.biz (Menno Smits) Date: Wed May 26 03:52:01 2004 Subject: [Email-SIG] Handling large emails: DiskMessage and DiskFeedParser In-Reply-To: <1085480455.21753.9.camel@anthem.wooz.org> References: <40B29E3D.5010004@netbox.biz> <1085452256.44.1435@mint-julep.mondoinfo.com> <1085480455.21753.9.camel@anthem.wooz.org> Message-ID: <40B44C94.1090607@netbox.biz> Barry Warsaw wrote: >>The only feedback I have for you is: Cool! > > Indeed! Unfortunately, I don't have time right now to look at the code, > but I'm +1 on including a feature "like this" in email 3.0. Great to hear. > What I'd like you to think about is a better API for hooking in features > like this. I'm not sure inheritance (of either the FeedParser, but > especially of the Message object) is the right way to go. One reason > for that is that many applications already pass Message subclasses to > the parser to get some extra application-specific functionality. This > isn't going to mix well if they'd also like to get disk storage of big > attachments. I definitely agree that subclasses FeedParser isn't the way to go, especially with the way it stands now. I'm not convinced that subclassing Message is a bad idea but I'm happy to bounce around other ideas. What do Message subclasses typically do? One example I can think of (MailMan) just adds some convenience functions to simplify access to message headers for that particular application. In this case changing the app to subclass from DiskMessage instead of Message in order to get disk caching would work and would be simple. Of course I'm probably missing some more complex subclasses of Message here. Any other examples? > Perhaps we can start with some use cases and then propose a better API > for addressing these use cases? I'll see if I can find some time over > the weekend to think about this. The main things I care about from my perspective: - message payloads are never loaded entirely in to memory - message payloads are easily accessible for manipulation and searching - same functionality and flexibility wrt message manipulation and generation as is available with current email package Looking forward to your thoughts. Menno -- Menno Smits, Senior Development Engineer NetBox http://netbox.biz | Voice +61 500 555 357 Oxcoda http://oxcoda.com | Fax +61 500 555 358