From esj at harvee.org Thu Jul 1 11:28:37 2004 From: esj at harvee.org (Eric S. Johansson) Date: Thu Jul 1 11:29:26 2004 Subject: [Email-SIG] HeaderParser Message-ID: <40E42DA5.8060807@harvee.org> switched from the parser class to the headerparser class for one of my projects and someone pointed out that the messages now have a little bit of garbage on the end which is just replicated information from the last line. It's about the right length to be a header which was probably added and then removed. My suspicion is that something/someone isn't tracking end of message properly. I'm working on reproducing the problem in a compact form so I can tell if it's my code or the python module that's messing up fedora 1 Python 2.2.3 does this sound familiar to anyone? ---eric From anthony at interlink.com.au Fri Jul 2 00:24:44 2004 From: anthony at interlink.com.au (Anthony Baxter) Date: Fri Jul 2 00:25:26 2004 Subject: [Email-SIG] HeaderParser In-Reply-To: <40E42DA5.8060807@harvee.org> References: <40E42DA5.8060807@harvee.org> Message-ID: <40E4E38C.1060700@interlink.com.au> Eric S. Johansson wrote: > switched from the parser class to the headerparser class for one of my > projects and someone pointed out that the messages now have a little bit > of garbage on the end which is just replicated information from the last > line. It's about the right length to be a header which was probably > added and then removed. > > My suspicion is that something/someone isn't tracking end of message > properly. I'm working on reproducing the problem in a compact form so I > can tell if it's my code or the python module that's messing up > > fedora 1 > Python 2.2.3 > > does this sound familiar to anyone? It's possible. I don't know that anyone's going to be looking at fixing that, though, as it's a very old version - have you tried installing the 2.5.5 version of the standalone email package from http://www.python.org/sigs/email-sig/ Anthony From esj at harvee.org Mon Jul 5 10:09:13 2004 From: esj at harvee.org (Eric S. Johansson) Date: Mon Jul 5 10:07:57 2004 Subject: [Email-SIG] HeaderParser In-Reply-To: <40E4E38C.1060700@interlink.com.au> References: <40E42DA5.8060807@harvee.org> <40E4E38C.1060700@interlink.com.au> Message-ID: <40E96109.2040007@harvee.org> Anthony Baxter wrote: > It's possible. I don't know that anyone's going to be looking at fixing > that, though, as it's a very old version - have you tried installing the > 2.5.5 version of the standalone email package from > http://www.python.org/sigs/email-sig/ found the problem and it was of my own making. It had been there in my code all along, it was just hidden by the fact that the message file expanded every time it went through that bit of code. Changing to parsing only headers made the file shrink occasionally. Without a file truncate, I had "replicated" information at the end of the message. sorry for the false alarm. ---eric From matt at mondoinfo.com Sun Jul 11 20:02:29 2004 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Sun Jul 11 20:02:41 2004 Subject: [Email-SIG] Does anyone need an un-parseable message ? Message-ID: <1089565861.58.2717@mint-julep.mondoinfo.com> Please forgive me for not keeping track of just where development of the email module is at the moment. In playing around with SpamBayes, I've come across a spam that SpamBayes can't deal with because the email module can't parse it. I'm pretty sure that the problem is shallow; the message just lies about being multipart. (Whether the solution is equally shallow is another question.) I expect that that's the sort of thing the new feed parser is meant to be able to deal with. If the new parser exists and I can try it out, I'd be glad if someone could point me to it. If it's not ready for testing and someone would like an example of a troublesome message, I'd be glad to put it somewhere convenient. Regards, Matt From t-meyer at ihug.co.nz Mon Jul 12 05:36:35 2004 From: t-meyer at ihug.co.nz (Tony Meyer) Date: Mon Jul 12 05:36:45 2004 Subject: [Email-SIG] Does anyone need an un-parseable message ? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13070AE8D6@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13064C02D1@its-xchg4.massey.ac.nz> In the on-topic bit of this message: I believe that there is a collection of 'difficult' email messages stored somewhere (Python CVS, perhaps?). There has been some discussion of it recently-ish, so the archives here would have it. It's fairly likely, IMO, that it already includes an example of this, if the malformation is the one I suspect it is. > In playing around with SpamBayes, I've come across a spam > that SpamBayes can't deal with because the email module can't > parse it. What version of SpamBayes is this? There was a reasonably common malformed message like this recently, but 1.0rc2 ought to handle it fine. > I expect that that's the sort of thing the new feed parser is > meant to be able to deal with. If the new parser exists and I > can try it out, I'd be glad if someone could point me to it. It's included in Python 2.4, so you could download the 2.4a1 release and use that. I believe that as a result, using SpamBayes with Python 2.4 will eliminate malformation problems (I have still to test that theory). I'm also fairly sure that we (SpamBayes) can use the information that it generates from malformed messages to generate additional clues (haven't tested this yet, either). =Tony Meyer From anthony at interlink.com.au Mon Jul 12 09:59:14 2004 From: anthony at interlink.com.au (Anthony Baxter) Date: Mon Jul 12 09:59:49 2004 Subject: [Email-SIG] Does anyone need an un-parseable message ? In-Reply-To: <1089565861.58.2717@mint-julep.mondoinfo.com> References: <1089565861.58.2717@mint-julep.mondoinfo.com> Message-ID: <40F244D2.60404@interlink.com.au> Python 2.4a1 has the new email parser. Grab it and try it out. (You can use 'make altinstall' instead of 'make install', and it will install python as 'python2.4' and leave 'python' pointing at your existing python install). From matt at mondoinfo.com Mon Jul 12 21:29:08 2004 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Mon Jul 12 21:29:39 2004 Subject: [Email-SIG] Does anyone need an un-parseable message ? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13064C02D1@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13070AE8D6@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F13064C02D1@its-xchg4.massey.ac.nz> Message-ID: <1089658575.55.530@mint-julep.mondoinfo.com> Dear Tony, > It's fairly likely, IMO, that it already includes an example of > this, if the malformation is the one I suspect it is. I'm sure that it is; it's just a: Content-Type: multipart/alternative; at the top of a message that's not MIME-y at all. >> In playing around with SpamBayes, I've come across a spam that >> SpamBayes can't deal with because the email module can't parse it. > What version of SpamBayes is this? There was a reasonably common > malformed message like this recently, but 1.0rc2 ought to handle it > fine. I shouldn't have implied that it was SpamBayes's problem. In my application, I'm creating email.Message objects and handing them over to SpamBayes, so the parsing error stopped me before SpamBayes saw the message. > It's included in Python 2.4, so you could download the 2.4a1 > release and use that. I believe that as a result, using SpamBayes > with Python 2.4 will eliminate malformation problems (I have still > to test that theory). I'm also fairly sure that we (SpamBayes) can > use the information that it generates from malformed messages to > generate additional clues (haven't tested this yet, either). The FeedParser does indeed parse the message correctly. (Thanks to Anthony too for pointing to it.) The problem that remains is that the multipart/alternative content-type header is preserved and so SpamBayes's tokenizer ignores the text in the payload. The relevant part of SpamBaye's tokenizer.py is: return Set(filter(lambda part: part.get_content_maintype() == 'text', msg.walk())) To come back to the email package, I wonder if it would make sense for the FeedParser to set the content-type to text/something since that's what it is in the message that's returned. It seems possible to me that trying to fix the content-type would turn into a mess of heuristics. On the other hand, if we suppose that the people who are using the FeedParser are coders who are trying to make some sense of a message that's likely to be incorrectly formed, and we leave an incorrect content-type, we're just pushing the mess of heuristics onto the user code. Regards, Matt