From georg.graf at wu-wien.ac.at Fri Jul 7 10:52:40 2006 From: georg.graf at wu-wien.ac.at (Georg Graf) Date: Fri, 7 Jul 2006 10:52:40 +0200 Subject: [Email-SIG] Problem Report for email.Utils.decode_rfc2231 Message-ID: <20060707085240.GB4724@wu-wien.ac.at> Hi Email Gurus! We are running a python milter for some 2 years now and this is the first time I've ran into problems with the python Email parser. And we dont have little Mail volume ;) So great work, but see my problematic message below: There are 2 assumptions in email.Utils.decode_rfc2231 I do not understand. Assumption 1: The string passed either has zero single-quotes or more than 1. Assumption 2: If the string has two or more single-quotes the meaning of the parts is different. I dont know rfc2231, so what do I write. But still it seems funny to me. Fact is in this mail (generated by a recent thunderbird version) there is only one single quote in the filename and the function fails, see below. My fix would be to write "if len(parts) != 3", but I'm interested what you say (it fixes this specific problem, I'd say). regards and thanks, George Ok, so many words, such a small problem, here the data: ------- message ------- From nobody Thu Jul 6 14:29:50 2006 Content-Type: application/pdf; name*0="LZ zu AB 481284, getronics 4500247115 + 4500219041, WU, SSU's.pd"; name*1="f" Content-Transfer-Encoding: base64 Content-Disposition: inline; filename*0="LZ zu AB 481284, getronics 4500247115 + 4500219041, WU, SSU'"; filename*1="s.pdf" JVBERi0xLjQNCiX/////DQoxIDAgb2JqDTw8DS9UeXBlIC9DYXRhbG9nDS9QYWdlcyAzNiAw IFINPj4NZW5kb2JqDTIgMCBvYmoNPDwNL1R5cGUgL1BhZ2UNL1BhcmVudCAzNiAwIFINL01l ZGlhQm94IFswIDAgNTk1IDg0MV0NL1Jlc291cmNlcyA8PA0vUHJvY1NldCBbL1BERiAvVGV4 dCAvSW1hZ2VCIC9JbWFnZUMgL0ltYWdlSV0NL0NvbG9yU3BhY2UgPDwgL0NTMSA1IDAgUiAv Q1MyIDYgMCBSID4+DS9Gb250IDw8IC9GMTcgNyAwIFIgL0YxOCAxMyAwIFIgL0YxOSAxOSAw IFIgL0YyMCAyNSAwIFIgPj4NL1hPYmplY3QgPDwgL0ltOSAzMSAwIFIgL0ltMTMgMzMgMCBS ID4+DT4+DS9Db250ZW50cyBbMyAwIFJdDT4+DWVuZG9iag0zIDAgb2JqDTw8IC9MZW5ndGgg NCAwIFIgL0ZpbHRlciAvRmxhdGVEZWNvZGUgPj4Nc3RyZWFtDQp4XtVaWW8cNxJ+N6D/QM1M a0YtdYts9i3bSmLLsZN4fURx1sm85QKCVRabF//9rSpW8Wj1jEaLYIHAgDycKX6s42Oxit1G ------- end message ------- ------- traceback ------- # save mail above as rfc2231-crash.txt >>> x = email.message_from_file(file ("rfc2231-crash.txt")) >>> x.get_filename() Traceback (most recent call last): File "", line 1, in ? File "/usr/local/lib/python2.4/email/Message.py", line 707, in get_filename `filename' parameter, and it is unquoted. If that header is missing File "/usr/local/lib/python2.4/email/Message.py", line 590, in get_param """ File "/usr/local/lib/python2.4/email/Message.py", line 537, in _get_params_preserve name = p.strip() File "/usr/local/lib/python2.4/email/Utils.py", line 275, in decode_params charset, language, value = decode_rfc2231(EMPTYSTRING.join(value)) File "/usr/local/lib/python2.4/email/Utils.py", line 222, in decode_rfc2231 charset, language, s = parts ValueError: need more than 2 values to unpack ------- end traceback ------- ------- debug seesion ------- > /usr/local/lib/python2.4/email/Message.py(537)_get_params_preserve() -> params = Utils.decode_params(params) (Pdb) params [('inline', ''), ('filename*0', '"LZ zu AB 481284, getronics 4500247115 + 4500219041, WU, SSU\'"'), ('filename*1', '"s.pdf"')] [...] > /usr/local/lib/python2.4/email/Utils.py(275)decode_params() -> charset, language, value = decode_rfc2231(EMPTYSTRING.join(value)) (Pdb) value ["LZ zu AB 481284, getronics 4500247115 + 4500219041, WU, SSU'", 's.pdf'] (Pdb) EMPTYSTRING.join(value) "LZ zu AB 481284, getronics 4500247115 + 4500219041, WU, SSU's.pdf" [...] > /usr/local/lib/python2.4/email/Utils.py(222)decode_rfc2231() -> charset, language, s = parts (Pdb) parts ['LZ zu AB 481284, getronics 4500247115 + 4500219041, WU, SSU', 's.pdf'] (Pdb) s ValueError: 'need more than 2 values to unpack' ------- end debug seesion ------- ------- culprit ------- def decode_rfc2231(s): """Decode string according to RFC 2231""" import urllib parts = s.split("'", 2) if len(parts) == 1: ^^^^^^^^^^^^^^^^^^^ ------------------ <<<<<<<<<<<< return None, None, urllib.unquote(s) charset, language, s = parts return charset, language, urllib.unquote(s) ------- end culprit ------- -- Vienna University of Economics and Business Administration Central and Internet Services Section Center for Computer Services UNIX Server Administration PGP/GPG Key ID: 0xa5232ad5 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available Url : http://mail.python.org/pipermail/email-sig/attachments/20060707/f08298d5/attachment.pgp From johannes at dds.nl Fri Jul 7 13:37:33 2006 From: johannes at dds.nl (Johannes Gijsbers) Date: Fri, 07 Jul 2006 13:37:33 +0200 Subject: [Email-SIG] Barry: New versions of standalone email module not available on the Cheese Shop Message-ID: <1152272253.9579.0.camel@localhost> I was looking through the website bugs today, and somehow I stumbled on the email sig page. It claims that all standalone versions are on the Cheese Shop (with a broken link - I'll fix this). However, only 2.5.7 is found, because 3.0 and 4.0 are marked as hidden in the Cheese Shop database. I don't think this is intentional. Barry? Johannes From barry at python.org Fri Jul 7 14:28:33 2006 From: barry at python.org (Barry Warsaw) Date: Fri, 7 Jul 2006 08:28:33 -0400 Subject: [Email-SIG] Barry: New versions of standalone email module not available on the Cheese Shop In-Reply-To: <1152272253.9579.0.camel@localhost> References: <1152272253.9579.0.camel@localhost> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 7, 2006, at 7:37 AM, Johannes Gijsbers wrote: > I was looking through the website bugs today, and somehow I > stumbled on > the email sig page. It claims that all standalone versions are on the > Cheese Shop (with a broken link - I'll fix this). However, only > 2.5.7 is > found, because 3.0 and 4.0 are marked as hidden in the Cheese Shop > database. I don't think this is intentional. Barry? Definitely not. Try it now, I think I've successfully unhid both versions. Thanks! - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.3 (Darwin) iQCVAwUBRK5TcnEjvBPtnXfVAQIshAP/aM/J6NBVQDNAZRrpLaT21f6NmNL5qHmI HaAkk6gmu9yOXdOlq1xdfAd+G0m0zKgS1bSOpIk7utemiRaFcb886t2Klc5QRL0y YWg7juLxLEFw7O2YDZ2x5wE3lUgOkmE4f6xoWX6UTzmcpCX0K4UiOEcdVLIsFvH+ iomqglXv/So= =MKon -----END PGP SIGNATURE----- From johannes at dds.nl Fri Jul 7 14:49:39 2006 From: johannes at dds.nl (Johannes Gijsbers) Date: Fri, 07 Jul 2006 14:49:39 +0200 Subject: [Email-SIG] Barry: New versions of standalone email module not available on the Cheese Shop In-Reply-To: References: <1152272253.9579.0.camel@localhost> Message-ID: <1152276579.9579.3.camel@localhost> On Fri, 2006-07-07 at 08:28 -0400, Barry Warsaw wrote: > Definitely not. Try it now, I think I've successfully unhid both > versions. As far as I can tell, 3.0.1 is still hidden. 4.0 is fine now. Johannes From barry at python.org Fri Jul 7 15:19:28 2006 From: barry at python.org (Barry Warsaw) Date: Fri, 7 Jul 2006 09:19:28 -0400 Subject: [Email-SIG] Barry: New versions of standalone email module not available on the Cheese Shop In-Reply-To: <1152276579.9579.3.camel@localhost> References: <1152272253.9579.0.camel@localhost> <1152276579.9579.3.camel@localhost> Message-ID: <2534A858-586F-480B-ABB0-F494EF385D8F@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 7, 2006, at 8:49 AM, Johannes Gijsbers wrote: > On Fri, 2006-07-07 at 08:28 -0400, Barry Warsaw wrote: >> Definitely not. Try it now, I think I've successfully unhid both >> versions. > > As far as I can tell, 3.0.1 is still hidden. 4.0 is fine now. Heh, fifth time's the charm I guess. Try it now. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.3 (Darwin) iQCVAwUBRK5fZXEjvBPtnXfVAQJQHQQAkyvn7Flb6TlQdgdpH1R4ZqTyPVMupi0J I7VhOR7Cw6gyLeT0xh47a1+7CritEiuRbS/68hB68eZt9BygEm99MTE0SjgX0+Oe 7n3sQf8vvf2VOD8GADSp3UJ/j5ynJcrRhxcTuymbHO4ZiMiEhjHGpNVl1ye9vMkn gEyD1PNA1bc= =+dCO -----END PGP SIGNATURE----- From msapiro at value.net Sat Jul 8 00:45:04 2006 From: msapiro at value.net (Mark Sapiro) Date: Fri, 7 Jul 2006 15:45:04 -0700 Subject: [Email-SIG] Problem Report for email.Utils.decode_rfc2231 In-Reply-To: <20060707085240.GB4724@wu-wien.ac.at> Message-ID: Georg Graf wrote: > >There are 2 assumptions in email.Utils.decode_rfc2231 I do not >understand. > >Assumption 1: The string passed either has zero single-quotes or >more than 1. > >Assumption 2: If the string has two or more single-quotes the >meaning of the parts is different. I dont know rfc2231, so what >do I write. But still it seems funny to me. See The single quotes are delimiters for character-set and language fields for extended-parameters. >Fact is in this mail (generated by a recent thunderbird version) >there is only one single quote in the filename and the function >fails, see below. > >My fix would be to write "if len(parts) != 3", but I'm interested >what you say (it fixes this specific problem, I'd say). Yes it does, but it doesn't fix the general problem because there could be two single-quote (') characters in a non extended parameter value. The standard is clear that the single-quote (') character is not allowed in extended-values, but I think it is OK in non extended values. The issue in email.Utils is that decode_rfc2231 should only be called for extended parameters of the form name*=charset'language'value or name*0*=charset'language'value I.e. only when the '=' is immediately preceded by '*'. The attached patch.txt file contains a very lightly tested patch that I think will fix the problem. -- Mark Sapiro The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: patch.txt Url: http://mail.python.org/pipermail/email-sig/attachments/20060707/65c68395/attachment.txt From menno at freshfoo.com Fri Jul 14 14:49:01 2006 From: menno at freshfoo.com (Menno Smits) Date: Fri, 14 Jul 2006 13:49:01 +0100 Subject: [Email-SIG] [RFC] Payload class Message-ID: <44B792BD.3020600@freshfoo.com> Hi list, This post follows on from something I brought here up almost 2 years ago: http://mail.python.org/pipermail/email-sig/2004-November/000181.html I finally have time to look at this again and have some (hopefully) better ideas on how to accomplish custom email payload storage. Where I work we need to be able to handle huge email messages which don't always fit in RAM. We are using some pretty awful hacks on the Python email libs to store payloads on disk instead of in memory. The attached patch against 4.0a2 is rough sketch of a relatively clean way to solve the problem. It is rough and incomplete; I've posted it here to get some feedback before I head too far down the path of implementing a particular solution. A simple demo script is also included. The Message class has been modified so it can handle payloads that are either a string (as now) or an instance of a new Payload class. A iter_payload() method has been added to Message to allow streaming out of payload data (regardless of the payload type underneath). I've included 2 sample Payload classes. One is a simple memory store. The other caches payloads to temporary files on disk; the payload doesn't sit in RAM. Future payload classes could: - use mixed memory/disk storage, storing only large payloads on disk so there's minimal I/O overhead for small payloads - cache the decoded copy of a payload so that decoding is only done once if the decoded payload is required multiple times - do crazy things like storing payloads across a network. The possibilities are endless :) More work is required on the parsing side. The FeedParser needs to accept an optional Payload factory class and generate payloads of that type as it parses. This should be an easy change. The Generator class also needs to be modified. It should use the new iter_payload() method so that payloads are not loaded into RAM if the payload is stored in a memory efficient way. These changes are backwards compatible with the existing email API. Thoughts/questions/flames? Regards, Menno Smits -------------- next part -------------- A non-text attachment was scrubbed... Name: email-payloads.patch Type: text/x-patch Size: 6356 bytes Desc: not available Url : http://mail.python.org/pipermail/email-sig/attachments/20060714/3cd657c4/attachment.bin -------------- next part -------------- A non-text attachment was scrubbed... Name: test_payload.py Type: text/x-python Size: 360 bytes Desc: not available Url : http://mail.python.org/pipermail/email-sig/attachments/20060714/3cd657c4/attachment.py From barry at python.org Tue Jul 18 01:13:16 2006 From: barry at python.org (Barry Warsaw) Date: Mon, 17 Jul 2006 19:13:16 -0400 Subject: [Email-SIG] Problem Report for email.Utils.decode_rfc2231 In-Reply-To: <20060707085240.GB4724@wu-wien.ac.at> References: <20060707085240.GB4724@wu-wien.ac.at> Message-ID: <1A357BC3-9BAD-4EC6-8C2D-EACA5E79CFC2@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 7, 2006, at 4:52 AM, Georg Graf wrote: > > Assumption 1: The string passed either has zero single-quotes or > more than 1. > > Assumption 2: If the string has two or more single-quotes the > meaning of the parts is different. I dont know rfc2231, so what > do I write. But still it seems funny to me. > > Fact is in this mail (generated by a recent thunderbird version) > there is only one single quote in the filename and the function > fails, see below. > > My fix would be to write "if len(parts) != 3", but I'm interested > what you say (it fixes this specific problem, I'd say). FWIW, this is now fixed in email 4.0.1 (Python 2.5 trunk), and I will be back porting the fix to email 3.0 (Python 2.4) and email 2.5 (Python 2.3). - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.3 (Darwin) iQCVAwUBRLwZkXEjvBPtnXfVAQKMAAP+KPLDVINLz5Av+8tIjbjhcosfio7bsK1F EGcqgFO1yvIQOhq7nEkCD8s4Eaks1vcvuE6/VedmDM+Knx7j4e0G+Ycf22kxRVgf +Ll2mXn/FuZpCPINBV2LHXVWgrKcI849CcEcIgaSszgs0DPqMjmv5eiDpvWBGGVv dNUgfjZ0LBY= =w525 -----END PGP SIGNATURE----- From msapiro at value.net Tue Jul 18 02:35:59 2006 From: msapiro at value.net (Mark Sapiro) Date: Mon, 17 Jul 2006 17:35:59 -0700 Subject: [Email-SIG] Problem Report for email.Utils.decode_rfc2231 In-Reply-To: <1A357BC3-9BAD-4EC6-8C2D-EACA5E79CFC2@python.org> Message-ID: Barry Warsaw wrote: > >FWIW, this is now fixed in email 4.0.1 (Python 2.5 trunk), and I will >be back porting the fix to email 3.0 (Python 2.4) and email 2.5 >(Python 2.3). I just looked at the fix in SVN, and I think there is still a problem. I don't think the RFC 2231 encodings that produce the error are 'buggy'. There are two independent things going on in RFC 2231 - the charset and language encoding and the splitting of the parameter into multiple pieces, e.g. filename*0=, filename*1=, etc. The problem with email.utils.decode_params() is it doesn't distinguish between these cases. The charset/language information is only present if there is a * immediately preceeding the = as in filename*=charset'language'value or filename*0*=charset'language'value ... in these cases, a compliant value must not contain ' However, if the parameter is filename*0=value_part_0 filename*1=value_part_1 ... these value_parts may contain any number of ' characters and they don't delimit charset and language information. See my suggested patch attached to . -- Mark Sapiro The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan From barry at python.org Wed Jul 19 06:16:03 2006 From: barry at python.org (Barry Warsaw) Date: Wed, 19 Jul 2006 00:16:03 -0400 Subject: [Email-SIG] Problem Report for email.Utils.decode_rfc2231 In-Reply-To: References: Message-ID: <9389898F-FCC5-4C63-A375-18C13A354FE7@python.org> On Jul 17, 2006, at 8:35 PM, Mark Sapiro wrote: > I just looked at the fix in SVN, and I think there is still a problem. > I don't think the RFC 2231 encodings that produce the error are > 'buggy'. There are two independent things going on in RFC 2231 - the > charset and language encoding and the splitting of the parameter into > multiple pieces, e.g. filename*0=, filename*1=, etc. > > The problem with email.utils.decode_params() is it doesn't distinguish > between these cases. The charset/language information is only present > if there is a * immediately preceeding the = as in > > filename*=charset'language'value > > or > > filename*0*=charset'language'value > ... > > in these cases, a compliant value must not contain ' > > However, if the parameter is > > filename*0=value_part_0 > filename*1=value_part_1 > ... > > these value_parts may contain any number of ' characters and they > don't > delimit charset and language information. > > See my suggested patch attached to > . Mark, I think you're right in your diagnosis. I've gone back and re- read RFC 2231 and I agree that we need to distinguish between the two segment types, which I'll call encoded (name ends in *) and non- encoded (no * at end of name). The way I read the RFC however, I don't think the patch is quite right. Specifically, you can mix encoded and non-encoded segments in an extended parameter, like so: filename*0*="This is%20encoded" filename*1="This is%20not encoded" I believe this should end up with a 'filename' parameter with a value: This is encodedThis is%20not encoded Further, if any segment ends in a * then the charset and language information must appear at the front of the string, but this is decoded after segments are %-decoded and all the segments are concatenated together. (The RFC appears to be a bit ambiguous here, but this is the only interpretation that makes sense to me.) Both of these changes caused many failures in the test suite, but I believe that's because many of the tests were incorrect. Some broke because they were using all non-encoded segments yet were expecting Message.get_param() to return a 3-tuple. That interface, while yucky, seems clear that when all non-encoded segments are used, the return value should be a simple string. The other breakage was that non-encoded segments should not be %- decoded, but there were many cases where they were still being decoded. I believe the attached patch fixes all these cases, and yet retains the failsafe checks in decode_rfc2231() -- be liberal in what you accept, blah, blah, blah. The patch also updates all the affected tests. This patch is against the Python trunk. Please let me know what you think! If it looks good, I'll commit it and back port the whole schmere to the earlier email package versions. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: email.diff Type: application/octet-stream Size: 10097 bytes Desc: not available Url : http://mail.python.org/pipermail/email-sig/attachments/20060719/42621e24/attachment.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 304 bytes Desc: This is a digitally signed message part Url : http://mail.python.org/pipermail/email-sig/attachments/20060719/42621e24/attachment.pgp From msapiro at value.net Wed Jul 19 07:02:22 2006 From: msapiro at value.net (Mark Sapiro) Date: Tue, 18 Jul 2006 22:02:22 -0700 Subject: [Email-SIG] Problem Report for email.Utils.decode_rfc2231 In-Reply-To: <9389898F-FCC5-4C63-A375-18C13A354FE7@python.org> Message-ID: Barry Warsaw wrote: > >The way I read the RFC however, I don't think the patch is quite >right. Specifically, you can mix encoded and non-encoded segments in >an extended parameter, like so: > >filename*0*="This is%20encoded" >filename*1="This is%20not encoded" > >I believe this should end up with a 'filename' parameter with a value: > >This is encodedThis is%20not encoded > >Further, if any segment ends in a * then the charset and language >information must appear at the front of the string, but this is >decoded after segments are %-decoded and all the segments are >concatenated together. (The RFC appears to be a bit ambiguous here, >but this is the only interpretation that makes sense to me.) I agree with the above. I considered that there could be mixed encoded and non-encoded segments, but I was focused only on not trying to split the charset and language info when it wasn't there, so my patch is over-zealous and will decode non-encoded segments in the mixed case. I'll take a detailed look at your patch tomorrow I hope and let you know what I think. -- Mark Sapiro The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan From msapiro at value.net Fri Jul 21 07:11:21 2006 From: msapiro at value.net (Mark Sapiro) Date: Thu, 20 Jul 2006 22:11:21 -0700 Subject: [Email-SIG] Problem Report for email.Utils.decode_rfc2231 In-Reply-To: Message-ID: Mark Sapiro wrote: > >I'll take a detailed look at your patch tomorrow I hope and let you >know what I think. Well, it's day after tomorrow, but I've looked at the patch and tried a couple of test cases of my own, and I think it is doing the right thing. -- Mark Sapiro The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan From barry at python.org Fri Jul 21 16:56:50 2006 From: barry at python.org (Barry Warsaw) Date: Fri, 21 Jul 2006 10:56:50 -0400 Subject: [Email-SIG] Problem Report for email.Utils.decode_rfc2231 In-Reply-To: References: Message-ID: <51FBE3E8-AF46-4BEA-AFA4-72FBFA958605@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 21, 2006, at 1:11 AM, Mark Sapiro wrote: > Mark Sapiro wrote: >> >> I'll take a detailed look at your patch tomorrow I hope and let you >> know what I think. > > > Well, it's day after tomorrow, but I've looked at the patch and > tried a > couple of test cases of my own, and I think it is doing the right > thing. Cool. The patch is now committed to the Python trunk. I'll work on back porting it to the other branches and then getting updated packages out to the cheeseshop. Cheers, - -barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.3 (Darwin) iQCVAwUBRMDrMnEjvBPtnXfVAQJR8gP+Ok/9slyZFsHTdy6faxdkJVJtSDsXTFf1 cX3ZUyzQeB7yzkYmnMfP4FkxbubRqGz9CMax+fbmBfG+V4QqpiRmZ2mUswYnzVtM RHgdNDKT7/hnCWRiUxxeeoMtTluEyhAf4NEhogAfVFLNk1ytXIfpqC4WBmyAOFtH 0aZFTg9L8fw= =z9cP -----END PGP SIGNATURE-----