From nicholas.cole at gmail.com Fri Mar 2 11:44:54 2007 From: nicholas.cole at gmail.com (Nicholas Cole) Date: Fri, 2 Mar 2007 10:44:54 +0000 Subject: [Email-SIG] Python's quoprimime encoder Message-ID: I don't pretend to be an expert on quoted printable encoding - far from it, in fact - but I can't understand why pyton's encoder gives different output to encoders online. If I do: email.quoprimime.encode("""This is a test""") python gives me: 'This is =\n\na test.' Whereas online encoders, such as, http://www.motobit.com/util/quoted-printable-encoder.asp, give: This is=20 a =20 test (ie. with =20 rather than just = protecting whitespace before the end of the line) Can anyone explain what is going on? Best wishes, Nicholas From nicholas.cole at gmail.com Fri Mar 2 16:46:56 2007 From: nicholas.cole at gmail.com (Nicholas Cole) Date: Fri, 2 Mar 2007 15:46:56 +0000 Subject: [Email-SIG] Python's quoprimime encoder In-Reply-To: References: Message-ID: > Can anyone explain what is going on? Sorry if the above was not clear. I suppose I should have asked: why does the python quoted printable encoder not encode the trailing whitespace on each line - a proceedure which is (AFAICS) standard. Best wishes, Nicholas From msapiro at value.net Fri Mar 2 23:25:04 2007 From: msapiro at value.net (Mark Sapiro) Date: Fri, 2 Mar 2007 14:25:04 -0800 Subject: [Email-SIG] Python's quoprimime encoder In-Reply-To: Message-ID: Nicholas Cole wrote: > >Sorry if the above was not clear. I suppose I should have asked: why >does the python quoted printable encoder not encode the trailing >whitespace on each line - a proceedure which is (AFAICS) standard. First of all, it is difficult to see what your input in the OP was in terms of what explicit trailing whitespace there was. Also, that notwithstanding, it is clear that your 'result' didn't come exactly from your input as your result has a period after 'test' that wasn't in the input. Anyway, the 'standard' that defines quoted-printable encoding is RFC 2045, and it is flexible in that it allows more than one way to encode the same thing. The only thing that determines whether a particular encoded string is correct is whether it decodes properly back to the original. In particular "xxx =\n\nyyy\n" and "xxx=20\nyyy\n" are two equivalent ways to quoted-printable encode the two lines "xxx " followed by "yyy". In the first case, the trailing space is literally a space character, but since the standard doesn't allow a line ending with a literal space, it is joined to a following empty line. In the second case, the trailing space is encoded as =20. One may be more common than the other, although I see lots of email encoded the former way, but they are both correct encodings and neither is shorter than the other. -- Mark Sapiro The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan From janssen at parc.com Fri Mar 9 20:35:50 2007 From: janssen at parc.com (Bill Janssen) Date: Fri, 9 Mar 2007 11:35:50 PST Subject: [Email-SIG] getting Message.as_string() to put CR-LF on end of each line? Message-ID: <07Mar9.113559pst."57996"@synergy1.parc.xerox.com> I'm trying to use the email package (4.x) as the underpinning for an IMAP server. As part of this, I need to send each line of the message from the server to the client with a CR-LF attached. My mail is in files with CR-LF as line separators. I create a Message from "open(message_path_name, 'rb')". I then call "as_string", and get the message back with only LF. Looking at the code, I see that the default header-printing methods in Generator just call "print" (which strikes me as a bad idea), and get the default newline handling, which the body-printing calls file.write(), and gets whatever get_payload() returned, and I don't see why this doesn't preserve CR-LF pairs in the file. So: how can Generator.flatten(), or as_string(), be fixed to specify CR-LF line endings for each line? Bill From janssen at parc.com Fri Mar 9 20:44:53 2007 From: janssen at parc.com (Bill Janssen) Date: Fri, 9 Mar 2007 11:44:53 PST Subject: [Email-SIG] getting Message.as_string() to put CR-LF on end of each line? In-Reply-To: <07Mar9.113559pst."57996"@synergy1.parc.xerox.com> References: <07Mar9.113559pst."57996"@synergy1.parc.xerox.com> Message-ID: <07Mar9.114454pst."57996"@synergy1.parc.xerox.com> > which > the body-printing calls file.write(), and gets whatever get_payload() > returned, and I don't see why this doesn't preserve CR-LF pairs in the > file. Apologies. I meant to say, "while" the body-printing code... I see that CR-LF pairs are in fact being preserved in the body, they are just being stripped in the headers, due to the use of "print" in the Generator code. Bill From barry at python.org Fri Mar 9 21:27:25 2007 From: barry at python.org (Barry Warsaw) Date: Fri, 9 Mar 2007 15:27:25 -0500 Subject: [Email-SIG] getting Message.as_string() to put CR-LF on end of each line? In-Reply-To: <07Mar9.114454pst."57996"@synergy1.parc.xerox.com> References: <07Mar9.113559pst."57996"@synergy1.parc.xerox.com> <07Mar9.114454pst."57996"@synergy1.parc.xerox.com> Message-ID: <20237583-12E3-4A0B-A94A-F89032E115ED@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Mar 9, 2007, at 2:44 PM, Bill Janssen wrote: > I see that CR-LF pairs are in fact being preserved in the body, they > are just being stripped in the headers, due to the use of "print" in > the Generator code. We have two uses cases we need to support: native line endings and network line endings. I think it would be a good idea to preserve or otherwise be able to specify the line endings for headers, so that indeed you'd get CRLF for direct output to a network protocol sink. It's been a long standing request. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (Darwin) iQCVAwUBRfHDLXEjvBPtnXfVAQIxgwP8Dvjqp591vgqO0uV09TiOVAzobrI1XbtN ga9DpFFp3GzZLAK+aI/pIm2oCtQIvlrFQXS3ScFdcZrdUOFAvWyMvcqLRbVc2nMD Z50erFLX0TZOt41VHhRz9ldk7pc1ZMkYwk1u6jP9ZDpgoM1XROM174zWjAKJPYWF Moa4MBKJ6qA= =yagV -----END PGP SIGNATURE----- From jasper at vs19.net Tue Mar 27 01:39:13 2007 From: jasper at vs19.net (Jasper Spaans) Date: Tue, 27 Mar 2007 01:39:13 +0200 Subject: [Email-SIG] email.header.decode_header eats my spaces Message-ID: Hello SIG, Today I was playing around with the decode_header function of the email.header module, and it is eating my spaces. Some people have filed bugs about this [1] [2] and have proposed the following patch, which to me seems to be obviously correct: etchy:/usr/lib/python2.5/email# diff -u header.py{~,} --- header.py~ 2007-03-27 01:10:31.000000000 +0200 +++ header.py 2007-03-27 01:10:31.000000000 +0200 @@ -77,7 +77,7 @@ continue parts = ecre.split(line) while parts: - unenc = parts.pop(0).strip() + unenc = parts.pop(0).rstrip() if unenc: # Should we continue a long line? if decoded and decoded[-1][1] is None: (Doing a test-run on a corpus of about 23k messages posted to a public mailing list with these two variants shows that several (imho) bugs dissappear and no new bugs appear; typical example: -Ren?Pfeiffer <> vs =?utf-8?B?UmVuw6k=?= Pfeiffer <> +Ren? Pfeiffer <> vs =?utf-8?B?UmVuw6k=?= Pfeiffer <> ) Is there any reason for this not to be incorporated into the package? Cheers, Jasper [1] http://aspn.activestate.com/ASPN/Mail/Message/mimelib-devel/1292338 [2] http://sourceforge.net/tracker/index.php? func=detail&aid=1467619&group_id=5470&atid=105470 -- Jasper Spaans http://jsp.vs19.net/ This line was last modified 0 seconds ago. From tkikuchi at is.kochi-u.ac.jp Tue Mar 27 02:40:30 2007 From: tkikuchi at is.kochi-u.ac.jp (Tokio Kikuchi) Date: Tue, 27 Mar 2007 09:40:30 +0900 Subject: [Email-SIG] email.header.decode_header eats my spaces In-Reply-To: References: Message-ID: <460867FE.9050506@is.kochi-u.ac.jp> Jasper Spaans wrote: > Hello SIG, > > Today I was playing around with the decode_header function of the > email.header module, and it is eating my spaces. > Some people have filed bugs about this [1] [2] and have proposed the > following patch, which to me seems to be obviously correct: > > etchy:/usr/lib/python2.5/email# diff -u header.py{~,} > --- header.py~ 2007-03-27 01:10:31.000000000 +0200 > +++ header.py 2007-03-27 01:10:31.000000000 +0200 > @@ -77,7 +77,7 @@ > continue > parts = ecre.split(line) > while parts: > - unenc = parts.pop(0).strip() > + unenc = parts.pop(0).rstrip() > if unenc: > # Should we continue a long line? > if decoded and decoded[-1][1] is None: > > (Doing a test-run on a corpus of about 23k messages posted to a > public mailing list with these two variants shows that several (imho) > bugs dissappear and no new bugs appear; typical example: > -Ren?Pfeiffer <> vs =?utf-8?B?UmVuw6k=?= Pfeiffer <> > +Ren? Pfeiffer <> vs =?utf-8?B?UmVuw6k=?= Pfeiffer <> > ) What program make this output ? Python 2.5 (r25:51908, Feb 7 2007, 19:53:49) [GCC 3.3.5 (Debian 1:3.3.5-13)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import email.header >>> t = email.header.decode_header('=?utf-8?B?UmVuw6k=?= Pfeiffer <>') >>> t [('Ren\xc3\xa9', 'utf-8'), ('Pfeiffer <>', None)] >>> h = email.header.make_header(t) >>> unicode(h) u'Ren\xe9 Pfeiffer <>' >>> unicode(h).encode('iso-8859-1') 'Ren\xe9 Pfeiffer <>' Use email.header module to re-construct your header from the decoded tuple list. HTH BTW, there is another space-eating problem in the current email package and a patch is in the tracker: http://sourceforge.net/tracker/index.php?func=detail&aid=1681333&group_id=5470&atid=305470 -- Tokio Kikuchi, tkikuchi at is.kochi-u.ac.jp http://weather.is.kochi-u.ac.jp/ From barry at python.org Tue Mar 27 04:58:12 2007 From: barry at python.org (Barry Warsaw) Date: Mon, 26 Mar 2007 22:58:12 -0400 Subject: [Email-SIG] email.header.decode_header eats my spaces In-Reply-To: References: Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Mar 26, 2007, at 7:39 PM, Jasper Spaans wrote: > Today I was playing around with the decode_header function of the > email.header module, and it is eating my spaces. > Some people have filed bugs about this [1] [2] and have proposed the > following patch, which to me seems to be obviously correct: > > Is there any reason for this not to be incorporated into the package? Have you run the test suite with this change? I've been working on a branch since Pycon, which tries to fix this and pass all the unit tests. ISTR that this patch causes several tests to fail. However, resolving the tests was like pulling a thread from a sweater. It now leads me to think that we really aren't true to RFC 2822 wrt folding whitespace. However, I haven't been able to fix that without breaking some current assumptions in the email package. I've been trying to get my branch to a point where it passes all the tests before I posted a message here, but I haven't had a chance to finish it yet. I'll try to get to that and follow up here in a day or so. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (Darwin) iQCVAwUBRgiIRXEjvBPtnXfVAQLWeQP/XbjReQ6fODoRTnkBe2DDGl6IUQLPpcSg WHrzh39X9oi4VDnGCECUPYHaM8mIBHTlYD4eVeMKAHnh0Wyo+Gq5TUEOu/44YCgp 1aJugVzrVMuUmgBR8IjKG4oumnSuuvHLLI0q5j6NlEKPk33vNwcd4gES791I0XFc XOC+fS+XL5I= =5dPZ -----END PGP SIGNATURE----- From tkikuchi at is.kochi-u.ac.jp Tue Mar 27 09:06:03 2007 From: tkikuchi at is.kochi-u.ac.jp (Tokio Kikuchi) Date: Tue, 27 Mar 2007 16:06:03 +0900 Subject: [Email-SIG] email.header.decode_header eats my spaces In-Reply-To: References: Message-ID: <4608C25B.7030608@is.kochi-u.ac.jp> Hi, Barry Warsaw wrote: > >> Today I was playing around with the decode_header function of the >> email.header module, and it is eating my spaces. >> Some people have filed bugs about this [1] [2] and have proposed the >> following patch, which to me seems to be obviously correct: >> >> Is there any reason for this not to be incorporated into the package? > > Have you run the test suite with this change? Don't commit this patch in. As I've written earlier, header manipulation should be done through email.header module and current code is not broken with regard this person's example. > > I've been working on a branch since Pycon, which tries to fix this > and pass all the unit tests. ISTR that this patch causes several > tests to fail. However, resolving the tests was like pulling a > thread from a sweater. It now leads me to think that we really > aren't true to RFC 2822 wrt folding whitespace. However, I haven't > been able to fix that without breaking some current assumptions in > the email package. I've been trying to get my branch to a point > where it passes all the tests before I posted a message here, but I > haven't had a chance to finish it yet. In my opinion (may not be true to RFC2822 in detail), ascii strings in header object should be strip()ped and separated by FWS (including '\r\n ' or '\r\n\t'). If you like to see [('Hi! ', None), ('there.', None)] to be represented by 'Hi! there.' (note two spaces between '!' and 't'), you may have to use workaround like: >>> h = email.Header.Header('Hi! ', 'iso-8859-1') >>> h.append('there.', 'us-ascii') >>> print h =?iso-8859-1?q?Hi!_?= there. >>> print str(unicode(h)) Hi! there. Use of space between the encoded/unencoded words should be: lastcs \ nextcs | ascii | other | ascii | sp | sp | other | sp | nosp | Current code for generating unicode string breaks this for ascii/ascii case, see: >>> h = email.Header.Header('Hi!', 'us-ascii') >>> h.append('there.', 'us-ascii') >>> print h Hi! there. >>> unicode(h) u'Hi!there.' Cheers, -- Tokio Kikuchi, tkikuchi at is.kochi-u.ac.jp http://weather.is.kochi-u.ac.jp/ From barry at python.org Tue Mar 27 15:39:57 2007 From: barry at python.org (Barry Warsaw) Date: Tue, 27 Mar 2007 09:39:57 -0400 Subject: [Email-SIG] email.header.decode_header eats my spaces In-Reply-To: <4608C25B.7030608@is.kochi-u.ac.jp> References: <4608C25B.7030608@is.kochi-u.ac.jp> Message-ID: <078C328A-263F-4221-A6B1-58D4AB4C6C22@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi Tokio, On Mar 27, 2007, at 3:06 AM, Tokio Kikuchi wrote: > In my opinion (may not be true to RFC2822 in detail), ascii strings > in header object should be strip()ped and separated by FWS > (including '\r\n ' or '\r\n\t'). I actually think we should be doing the opposite, namely preserving any FWS in the existing text and /not/ substituting continuation_ws for it when we re-break the headers. This is the only way to maintain idempotency short of saving the original header intact (but then memory usage doubles). continuation_ws should be used only when we're forced to break at a non-existing FWS location, e.g. if we've split a non-ascii header or at a non-whitespace header-specific syntactic break. In the case of RFC 2047 headers, the FWS gets consumed anyway so it isn't idempotentially (?!) significant. That's where my patch is headed anyway. I have one test case failure left to resolve. It's a bear, but when I get that working I'll submit a patch for review. My gut is telling me not to apply this to Python 2.5 but only Python 2.6 since enough of the semantics of continuation_ws and folding has changed that it isn't appropriate for a patch release. Cheers, - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (Darwin) iQCVAwUBRgkernEjvBPtnXfVAQILlQP/ehJ6raVYLZwd1Pb8ZIuq2+KkGM04JsDd WwHw1mbfijHaft00bKa7j7dQK9XewicDW9cAuOEQ1SgzfOCOWO+EodHdGbTq3he1 rNlaRQZ2MFaCmQLWYwbwv2zkogu0m9tpSRupwlcdoOzYMNJb0KhLQiVb3GCMHx45 I2IgJdkFV2s= =XzA4 -----END PGP SIGNATURE----- From jasper at vs19.net Tue Mar 27 16:07:51 2007 From: jasper at vs19.net (Jasper Spaans) Date: Tue, 27 Mar 2007 16:07:51 +0200 Subject: [Email-SIG] email.header.decode_header eats my spaces In-Reply-To: <460867FE.9050506@is.kochi-u.ac.jp> References: <460867FE.9050506@is.kochi-u.ac.jp> Message-ID: <8D2C382B-F8FC-4DFF-A197-4EA5D8847F4C@vs19.net> Op 27-mrt-2007, om 2:40 heeft Tokio Kikuchi het volgende geschreven: > What program make this output ? > > Python 2.5 (r25:51908, Feb 7 2007, 19:53:49) > [GCC 3.3.5 (Debian 1:3.3.5-13)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import email.header > >>> t = email.header.decode_header('=?utf-8?B?UmVuw6k=?= Pfeiffer <>') > >>> t > [('Ren\xc3\xa9', 'utf-8'), ('Pfeiffer <>', None)] > >>> h = email.header.make_header(t) > >>> unicode(h) > u'Ren\xe9 Pfeiffer <>' > >>> unicode(h).encode('iso-8859-1') > 'Ren\xe9 Pfeiffer <>' > > Use email.header module to re-construct your header from the > decoded tuple list. Hmm - that looks like a better solution, I was using the following (part is an instance of email.Message.Message): (from_name, from_addr) = parseaddr(part.get('from')) f = '' for (piece, charset) in decode_header(from_name): if charset: f += piece.decode(charset, 'replace') else: f += piece from_name = f print "%s <%s> vs %s" % (from_name, from_addr, part.get ('from')) which feels like reinventing wheels.. (I was expecting a helper function to do all of this in the email.header package, and unicode (part.get('from')) can't work..) Jasper -- Jasper Spaans http://jsp.vs19.net/ This line was last modified 0 seconds ago. From janssen at parc.com Tue Mar 27 19:20:31 2007 From: janssen at parc.com (Bill Janssen) Date: Tue, 27 Mar 2007 10:20:31 PDT Subject: [Email-SIG] email.header.decode_header eats my spaces In-Reply-To: <078C328A-263F-4221-A6B1-58D4AB4C6C22@python.org> References: <4608C25B.7030608@is.kochi-u.ac.jp> <078C328A-263F-4221-A6B1-58D4AB4C6C22@python.org> Message-ID: <07Mar27.092032pst."57996"@synergy1.parc.xerox.com> > I actually think we should be doing the opposite, namely preserving > any FWS in the existing text and /not/ substituting continuation_ws > for it when we re-break the headers. This is the only way to > maintain idempotency short of saving the original header intact (but > then memory usage doubles). continuation_ws should be used only when > we're forced to break at a non-existing FWS location, e.g. if we've > split a non-ascii header or at a non-whitespace header-specific > syntactic break. In the case of RFC 2047 headers, the FWS gets > consumed anyway so it isn't idempotentially (?!) significant. Barry, this seems correct to me, too. Bill From tkikuchi at is.kochi-u.ac.jp Wed Mar 28 02:06:49 2007 From: tkikuchi at is.kochi-u.ac.jp (Tokio Kikuchi) Date: Wed, 28 Mar 2007 09:06:49 +0900 Subject: [Email-SIG] email.header.decode_header eats my spaces In-Reply-To: <078C328A-263F-4221-A6B1-58D4AB4C6C22@python.org> References: <4608C25B.7030608@is.kochi-u.ac.jp> <078C328A-263F-4221-A6B1-58D4AB4C6C22@python.org> Message-ID: <4609B199.5090208@is.kochi-u.ac.jp> Barry Warsaw wrote: > On Mar 27, 2007, at 3:06 AM, Tokio Kikuchi wrote: > >> In my opinion (may not be true to RFC2822 in detail), ascii strings in >> header object should be strip()ped and separated by FWS (including >> '\r\n ' or '\r\n\t'). > > I actually think we should be doing the opposite, namely preserving any > FWS in the existing text and /not/ substituting continuation_ws for it > when we re-break the headers. This is the only way to maintain > idempotency short of saving the original header intact (but then memory > usage doubles). continuation_ws should be used only when we're forced > to break at a non-existing FWS location, e.g. if we've split a non-ascii > header or at a non-whitespace header-specific syntactic break. In the > case of RFC 2047 headers, the FWS gets consumed anyway so it isn't > idempotentially (?!) significant. Well, this will surely break my contribution on Mailman 2.2 CookHeaders.py where unifying the code for subject prefix munging for both ascii and rfc2047. :-( Almost all the MUAs do subject munging by adding 'Re:' and adjusting the header length. This direction of patching means Python email package can't no more be used for eg. webmail application. If I understand correctly of course. > > That's where my patch is headed anyway. I have one test case failure > left to resolve. It's a bear, but when I get that working I'll submit a > patch for review. My gut is telling me not to apply this to Python 2.5 > but only Python 2.6 since enough of the semantics of continuation_ws and > folding has changed that it isn't appropriate for a patch release. > May be we should add a option for email.header.Header(), like idempotent=Ture/False. ;-) -- Tokio Kikuchi, tkikuchi at is.kochi-u.ac.jp http://weather.is.kochi-u.ac.jp/ From stephen at xemacs.org Wed Mar 28 17:25:18 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 29 Mar 2007 00:25:18 +0900 Subject: [Email-SIG] email.header.decode_header eats my spaces In-Reply-To: <4609B199.5090208@is.kochi-u.ac.jp> References: <4608C25B.7030608@is.kochi-u.ac.jp> <078C328A-263F-4221-A6B1-58D4AB4C6C22@python.org> <4609B199.5090208@is.kochi-u.ac.jp> Message-ID: <87ps6txue9.fsf@uwakimon.sk.tsukuba.ac.jp> Tokio Kikuchi writes: > Barry Warsaw wrote: > > > On Mar 27, 2007, at 3:06 AM, Tokio Kikuchi wrote: > > > >> In my opinion (may not be true to RFC2822 in detail), ascii strings in > >> header object should be strip()ped and separated by FWS (including > >> '\r\n ' or '\r\n\t'). > > > > I actually think we should be doing the opposite, namely preserving any > > FWS in the existing text and /not/ substituting continuation_ws for it > > when we re-break the headers. This is the only way to maintain > > idempotency short of saving the original header intact (but then memory > > usage doubles). Idempotency is a test, not a requirement. The requirement is "first, do no harm". Ie, if you process the header, the result should be as much "like" the original as possible. This is not actually implementable (different people will have different opinions about what that means, except only *really different* people will have the opinion that idempotency is undesirable), but the email package should make it possible for people to get pretty close without rewriting the package. > > continuation_ws should be used only when we're forced > > to break at a non-existing FWS location, e.g. if we've split a non-ascii > > header or at a non-whitespace header-specific syntactic break. In the > > case of RFC 2047 headers, the FWS gets consumed anyway so it isn't > > idempotentially (?!) significant. Only in RFC 2047 conformant MUAs. IMHO, RFC 2047 conformance is a requirement, but it's not sufficient. There are too many MUAs out that that do not correctly handle headers folded between encoded words (eg, Kyle Jones's VM). I don't know if you *should* care, but I think that RFC 2047 is (unfortunately) insufficient grounds for refusing to care at this stage. AFAICS the implication is that you need to make a judicious choice of the default for continuation_ws. > Well, this will surely break my contribution on Mailman 2.2 > CookHeaders.py where unifying the code for subject prefix munging for > both ascii and rfc2047. :-( I don't see why it should, although there might be technical reasons why it would. What I want, and what I think Barry is proposing, is simply that the email package never does anything to disturb FWS by default. If you munge a header (even as trivially as removing a "Re:" prefix), you must accept responsibility for formatting the result. At that point, I see no reason why the email package shouldn't help you "reflow" a header if that's desirable in your application---but the application should have to request that explicitly. It shouldn't be implicit in the setting of continuation_ws. > May be we should add a option for email.header.Header(), like > idempotent=Ture/False. ;-) I think it would be better to add an option, or even a hook function, for formatting. For example, I often use a docstring-like convention for long subject headers, where the gist is in the first line, and the rest is formatted nicely (ie, indented to align with the initial character of the first line of the subject). It would be nice if that kind of thing could be done with an application-supplied function (of course email could provide a number of common ones itself). From barry at python.org Wed Mar 28 17:45:31 2007 From: barry at python.org (Barry Warsaw) Date: Wed, 28 Mar 2007 11:45:31 -0400 Subject: [Email-SIG] email.header.decode_header eats my spaces In-Reply-To: <4609B199.5090208@is.kochi-u.ac.jp> References: <4608C25B.7030608@is.kochi-u.ac.jp> <078C328A-263F-4221-A6B1-58D4AB4C6C22@python.org> <4609B199.5090208@is.kochi-u.ac.jp> Message-ID: <317F1391-7B55-4BE0-867B-B97EE7C4A3E3@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Mar 27, 2007, at 8:06 PM, Tokio Kikuchi wrote: > Well, this will surely break my contribution on Mailman 2.2 > CookHeaders.py where unifying the code for subject prefix munging > for both ascii and rfc2047. :-( > > Almost all the MUAs do subject munging by adding 'Re:' and > adjusting the header length. This direction of patching means > Python email package can't no more be used for eg. webmail > application. If I understand correctly of course. Tokio, I'd like to understand more about why you think these two cases will break. In the meantime, let me explain my understanding of rfc2047 and how and were I think we comply and don't comply. If we get agreement on that, then we can decide what the right solution is. So there are 4 cases we need to handle, ascii+ascii ascii+encoded, encoded+ascii, encoded+encoded. Here's what the email package currently does in these cases (slightly out of order): encoded+encoded: >>> h = Header() >>> h.append('hello', 'utf-8') >>> h.append('world', 'utf-8') >>> print h =?utf-8?q?hello?= =?utf-8?q?world?= >>> print unicode(h) helloworld I think we can all agree that we do this correctly. The rfc is explicitly clear that all "linear-white-space" between the two encoded parts must be ignored. Clearly we could split the line on that linear-white-space and it would make no difference. ascii+encoded >>> h = Header() >>> h.append('hello', 'us-ascii') >>> h.append('world', 'utf-8') >>> print h hello =?utf-8?q?world?= >>> print unicode(h) hello world Here again, I think we're doing the right thing, although IMO the rfc is somewhat ambiguous. While it's clear about whitespace between encoded words, it is /not/ explicit about linear-white-space between unencoded and encoded parts. However, if you look at the second example in section 8 of the rfc, this implies that linear-white-space is /not/ ignored when decoding and concatenating. To me, this is a flaw in the rfc because there's no way to /avoid/ whitespace between unencoded and encoded parts! The separating whitespace is required in order to comply with the parsing rules in the rfc, but then you're left with whitespace that is in some undefined way significant. The only way to avoid that space between the words is to encode both parts. But maybe that example is wrong. Personally, I'd prefer to interpret unicode(h) above as 'helloworld' so that the rules about linear-white- space between unencoded and encoded parts is exactly the same as for between two encoded parts. I really have no way of knowing what the intention of the rfc is here, so perhaps we need a flag on the Header class (or in the .append() method) to specify which interpretation the user wants. If the separating space is treated the same in this case, then our folding rules can be exactly the same. Otherwise things get more complicated because we probably ought to be preserving the whitespace for when we unfold (more on that in a separate followup). >>> h.append('hello', 'utf-8') >>> h = Header() >>> h.append('hello', 'utf-8') >>> h.append('world', 'us-ascii') >>> print h =?utf-8?q?hello?= world >>> print unicode(h) hello world More of the same. >>> h = Header() >>> h.append('hello', 'us-ascii') >>> h.append('world', 'us-ascii') >>> print h hello world >>> print unicode(h) helloworld I think we're nearly correct here. The unicode version is what I'd expect, but the string version is not. I think in both cases we should print 'helloworld'. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (Darwin) iQCVAwUBRgqNnHEjvBPtnXfVAQIFmQP+J4ud9R/hvBupIcUpZNWFntzdcVPPHGPq vTNycMm+9pvaU7KFbIU2LabnQGUGZ+yycFGl8WTTtIddad6DGPBGfeGX2jSOk4XB MpakU5JBO1/uP5zB1wC13yzZlTXVBqyKntNr8Z1VsAHUtzC9EIJhp3xlbUEyqWgW WuhUS4wcMgI= =nffu -----END PGP SIGNATURE----- From barry at python.org Wed Mar 28 18:02:25 2007 From: barry at python.org (Barry Warsaw) Date: Wed, 28 Mar 2007 12:02:25 -0400 Subject: [Email-SIG] email.header.decode_header eats my spaces In-Reply-To: <87ps6txue9.fsf@uwakimon.sk.tsukuba.ac.jp> References: <4608C25B.7030608@is.kochi-u.ac.jp> <078C328A-263F-4221-A6B1-58D4AB4C6C22@python.org> <4609B199.5090208@is.kochi-u.ac.jp> <87ps6txue9.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Mar 28, 2007, at 11:25 AM, Stephen J. Turnbull wrote: > Idempotency is a test, not a requirement. The requirement is "first, > do no harm". Ie, if you process the header, the result should be as > much "like" the original as possible. This is not actually > implementable (different people will have different opinions about > what that means, except only *really different* people will have the > opinion that idempotency is undesirable), but the email package > should make it possible for people to get pretty close without > rewriting the package. I agree that idempotency can't be a hard requirement; there are too many constraints, too much variability in the inputs, and too many ambiguities in the rfcs. This is exactly like our stance on MIME parsing and generating, where broken MIME can break idempotency. But I think we can do better than we currently do by opting to preserve whitespace when we break lines instead of substituting existing whitespace for continuation_ws. >>> continuation_ws should be used only when we're forced >>> to break at a non-existing FWS location, e.g. if we've split a >>> non-ascii >>> header or at a non-whitespace header-specific syntactic break. >>> In the >>> case of RFC 2047 headers, the FWS gets consumed anyway so it isn't >>> idempotentially (?!) significant. > > Only in RFC 2047 conformant MUAs. IMHO, RFC 2047 conformance is a > requirement, but it's not sufficient. There are too many MUAs out > that that do not correctly handle headers folded between encoded words > (eg, Kyle Jones's VM). I don't know if you *should* care, but I think > that RFC 2047 is (unfortunately) insufficient grounds for refusing to > care at this stage. Oh, I know all about VM. I think the first bug I sent to Kyle on that has got to be approaching its 10th anniversary. :) It's a no-win situation if we try to care about broken MUAs. OTOH, let's have some pity on the poor MUA authors, 'cause the rfcs don't make it easy for them. ;). Still, I think there's no perfect solution if we try to also support non-conformant MUAs. > AFAICS the implication is that you need to make a judicious choice of > the default for continuation_ws. Combined with the preference to preserve existing fws when present, and not insert continuation_ws unless absolutely necessary. >> Well, this will surely break my contribution on Mailman 2.2 >> CookHeaders.py where unifying the code for subject prefix munging for >> both ascii and rfc2047. :-( > > I don't see why it should, although there might be technical reasons > why it would. What I want, and what I think Barry is proposing, is > simply that the email package never does anything to disturb FWS by > default. Correct. > If you munge a header (even as trivially as removing a "Re:" prefix), > you must accept responsibility for formatting the result. At that > point, I see no reason why the email package shouldn't help you > "reflow" a header if that's desirable in your application---but the > application should have to request that explicitly. It shouldn't be > implicit in the setting of continuation_ws. > >> May be we should add a option for email.header.Header(), like >> idempotent=Ture/False. ;-) > > I think it would be better to add an option, or even a hook function, > for formatting. For example, I often use a docstring-like convention > for long subject headers, where the gist is in the first line, and the > rest is formatted nicely (ie, indented to align with the initial > character of the first line of the subject). It would be nice if that > kind of thing could be done with an application-supplied function (of > course email could provide a number of common ones itself). I've been thinking about something like this too, not just for headers, but also for message bodies. One of the things that comes up often is the request to use wire-protocol line separators for lines within the body, so you could take the output of a Message and spew it directly on a port-25 socket for example. I've always taken the position that the email package should use native line endings and that protocol modules such as smtplib and nntplib would do the line-ending transformation. But for a variety of reasons, this isn't satisfying, and it's a use case I think email package should handle. Of course, doing this means a radical redesign of some of the classes in the email package. I'd be happy to go down that road because I think it will give people important options, though of course we're now talking new features (i.e. Python 2.6) not bug fixes (Python 2.5 and earlier). For example, an rfc2047 formatter could get involved during Header.append(). It might accept two word chunks and return the whitespace to insert between them. Different formatters could be used for different interpretations of rfc2047. Similarly, the formatter could get involved for breaking long lines. It could decide not to break them at all, or would return the two lines broken and formatted. We'd need a mini-library of useful formatters, and we'd need to choose some reasonable defaults. We'd need to design a good api, figuring out where the hook points ought to be. I'm up for it, but it's a lot of work, so I'd need to get help from this group on getting there. Who's up for some pair programming? :) - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (Darwin) iQCVAwUBRgqRknEjvBPtnXfVAQJ4JwQAkHo07eF5i3EawH5RN0MyduNrYyJBPjeK 5qU9uxRdPYMLlIIMDUk5PILryobzyomWwsXjzPuPjDcOFAuUN5Md5leKh/KHyJ0+ oeevd/tHZJXY2qxAK6VnmrFFYLelwmFWvk+/1QORAgaPJld+wmbVbS0NeSZ2BkZg NwYx+fbTkxE= =lPkF -----END PGP SIGNATURE----- From tkikuchi at is.kochi-u.ac.jp Thu Mar 29 02:13:23 2007 From: tkikuchi at is.kochi-u.ac.jp (Tokio Kikuchi) Date: Thu, 29 Mar 2007 09:13:23 +0900 Subject: [Email-SIG] email.header.decode_header eats my spaces In-Reply-To: <317F1391-7B55-4BE0-867B-B97EE7C4A3E3@python.org> References: <4608C25B.7030608@is.kochi-u.ac.jp> <078C328A-263F-4221-A6B1-58D4AB4C6C22@python.org> <4609B199.5090208@is.kochi-u.ac.jp> <317F1391-7B55-4BE0-867B-B97EE7C4A3E3@python.org> Message-ID: <460B04A3.2070300@is.kochi-u.ac.jp> Barry Warsaw wrote: > ascii+encoded > > >>> h = Header() > >>> h.append('hello', 'us-ascii') > >>> h.append('world', 'utf-8') > >>> print h > hello =?utf-8?q?world?= > >>> print unicode(h) > hello world > > Here again, I think we're doing the right thing, although IMO the rfc is > somewhat ambiguous. While it's clear about whitespace between encoded > words, it is /not/ explicit about linear-white-space between unencoded > and encoded parts. However, if you look at the second example in > section 8 of the rfc, this implies that linear-white-space is /not/ > ignored when decoding and concatenating. > > To me, this is a flaw in the rfc because there's no way to /avoid/ > whitespace between unencoded and encoded parts! Well, it looks to me that RFC2047 prohibits this at least in header text. An example for comment text in section 8 states: (=?ISO-8859-1?Q?a?= b) (a b) Within a 'comment', white space MUST appear between an 'encoded-word' and surrounding text. [Section 5, paragraph (2)]. However, white space is not needed between the initial "(" that begins the 'comment', and the 'encoded-word'. The word MUST means there is no way omitting spaces between encoded-word and surrounding ascii text. The '(' before the encoded-word appears to violate this but it is a higher syntax token. Current email.header violate this example because we have no class which recognizes comment in a structured header. >>> from email.header import * >>> s = '(=?ISO-8859-1?Q?a?= b)' >>> l = decode_header(s) >>> l [('(', None), ('a', 'iso-8859-1'), ('b)', None)] >>> h = make_header(l) >>> print h ( =?iso-8859-1?q?a?= b) ^ notice this extra space. This current behavior is correct if '(' is in a *text field and the example is not appropriate. The problem in email.header module is it can not distiguish between the structured and unstructured (text only) headers. The Header class may have a member function like 'add_comment', IMHO. > But maybe that example is wrong. Personally, I'd prefer to interpret > unicode(h) above as 'helloworld' so that the rules about > linear-white-space between unencoded and encoded parts is exactly the > same as for between two encoded parts. I really have no way of knowing > what the intention of the rfc is here, so perhaps we need a flag on the > Header class (or in the .append() method) to specify which > interpretation the user wants. RFC2047 is clear in that 'encoded-word' should be treated as a plain english word which is separated by space (or higher syntatic token like '(', ')', ';' etc. Only exception is 'encoded-word'--'encoded-word' sequence, which may result from wrapping a long line because it tends to become longer when encoding. > >>> h = Header() > >>> h.append('hello', 'us-ascii') > >>> h.append('world', 'us-ascii') > >>> print h > hello world > >>> print unicode(h) > helloworld > > I think we're nearly correct here. The unicode version is what I'd > expect, but the string version is not. I think in both cases we should > print 'helloworld'. No. email.header module is not a word processor. Because RFC2047 is dealing with 'word's, we should treat these parts as 'word's for consitency. unicode() function should be fixed. If these words are to be concatnated without a space, it should be done outside header module. Remember we are not making an almighty word processor but an RFC compliant module. Cheers, -- Tokio Kikuchi, tkikuchi at is.kochi-u.ac.jp http://weather.is.kochi-u.ac.jp/ From stephen at xemacs.org Thu Mar 29 04:58:43 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 29 Mar 2007 11:58:43 +0900 Subject: [Email-SIG] email.header.decode_header eats my spaces In-Reply-To: <460B04A3.2070300@is.kochi-u.ac.jp> References: <4608C25B.7030608@is.kochi-u.ac.jp> <078C328A-263F-4221-A6B1-58D4AB4C6C22@python.org> <4609B199.5090208@is.kochi-u.ac.jp> <317F1391-7B55-4BE0-867B-B97EE7C4A3E3@python.org> <460B04A3.2070300@is.kochi-u.ac.jp> Message-ID: <87k5x0ycv0.fsf@uwakimon.sk.tsukuba.ac.jp> Tokio Kikuchi writes: > Barry Warsaw wrote: > > To me, this is a flaw in the rfc because there's no way to /avoid/ > > whitespace between unencoded and encoded parts! > > Well, it looks to me that RFC2047 prohibits [deleting whitespace] at > least in header text. That's my understanding as well, for the reasons Tokio gave. If you want "unicode(h)" ==> "helloworld", you need to encode the whole string. (Giving 'hello' the charset 'utf-8' would be a hackish way of doing this.) > The problem in email.header module is it can not distiguish between > the structured and unstructured (text only) headers. Yikes! I didn't think of it that way before, but now that you mention it, my spine is freezing. > The Header class may have a member function like 'add_comment', > IMHO. IMHO, the Header class should be abstract, and there should be subclasses that handle dates, lists of addresses, lists of message-ids, etc. as appropriate to header fields structured in each particular way. Only those object handlers appropriate to a given field would be exposed. StarTextHeader would the unstructured derivative of the (implicitly structured) Header class. Barry again: > > I really have no way of knowing what the intention of the rfc is here, > > so perhaps we need a flag on the Header class (or in the .append() > > method) to specify which interpretation the user wants. I really don't think that users should be allowed to "specify interpretation". RFC 2047 is a "transfer encoding". Users should never need to deal with that kind of thing, and it is dangerous to allow them to do so. Users (including software clients of the package, of course) should simply hand objects and text to the Header class to format according to 2822, 2047, and the definition of each field's structure. Nevertheless, given that RFC 2047 (and 2822, for that matter) is explicitly intended to allow headers to be human-readably formatted but still machine-parsable, the user should be allowed to express *preferences* for the formatting, for example qp_preference_function would be a function of the header contents such that if it returns true, QP encoding should be used, otherwise BASE64. But the decision to use encoded words would not be a user choice. There might be a preserve_whitespace_literally preference, in which case the whole header would have to be RFC 2047 encoded -- but in the case of structured headers (eg, address lists), you can't simply BASE64 the whole thing, only the *text components! And the email package needs to be free to deal with structured headers appropriately (for example, breaking very long addresses to try to keep line-length to a reasonable level). It may not be feasible to respect user preferences in all cases. Maybe there could be an escape to allow Sufficiently Smart Users to format headers "by hand", but its use should be discouraged in favor of a structured Header subclass that DTRTs. > email.header module is not a word processor. Good slogan! From barry at python.org Thu Mar 29 06:24:42 2007 From: barry at python.org (Barry Warsaw) Date: Thu, 29 Mar 2007 00:24:42 -0400 Subject: [Email-SIG] email.header.decode_header eats my spaces In-Reply-To: <460B04A3.2070300@is.kochi-u.ac.jp> References: <4608C25B.7030608@is.kochi-u.ac.jp> <078C328A-263F-4221-A6B1-58D4AB4C6C22@python.org> <4609B199.5090208@is.kochi-u.ac.jp> <317F1391-7B55-4BE0-867B-B97EE7C4A3E3@python.org> <460B04A3.2070300@is.kochi-u.ac.jp> Message-ID: <7BD4BB93-3972-4025-9537-984F09D935CE@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Mar 28, 2007, at 8:13 PM, Tokio Kikuchi wrote: > Well, it looks to me that RFC2047 prohibits this at least in header > text. An example for comment text in section 8 states: > > (=?ISO-8859-1?Q?a?= b) (a b) > > Within a 'comment', white space MUST appear between an > 'encoded-word' and surrounding text. [Section 5, > paragraph (2)]. However, white space is not needed between > the initial "(" that begins the 'comment', and the > 'encoded-word'. > > The word MUST means there is no way omitting spaces between encoded- > word and surrounding ascii text. The '(' before the encoded-word > appears to violate this but it is a higher syntax token. > > Current email.header violate this example because we have no class > which recognizes comment in a structured header. Thanks Tokio, I agree with all of this. I think you're right in identifying that the problem here is that we don't really have any way to understand the semantics of the a particular header's body. > This current behavior is correct if '(' is in a *text field and the > example is not appropriate. The problem in email.header module is > it can not distiguish between the structured and unstructured (text > only) headers. The Header class may have a member function like > 'add_comment', IMHO. I think we might want to try to address this in a more general and extensible way, so that we can support future semantically meaningful headers. >> >>> h = Header() >> >>> h.append('hello', 'us-ascii') >> >>> h.append('world', 'us-ascii') >> >>> print h >> hello world >> >>> print unicode(h) >> helloworld >> I think we're nearly correct here. The unicode version is what >> I'd expect, but the string version is not. I think in both cases >> we should print 'helloworld'. > > No. email.header module is not a word processor. Because RFC2047 > is dealing with 'word's, we should treat these parts as 'word's for > consitency. unicode() function should be fixed. If these words > are to be concatnated without a space, it should be done outside > header module. Right, but these parts aren't being encoded, and yet we've still stuck a space between the parts that didn't exist there before. I'd feel better about it if we encoded these chunks too. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (Darwin) iQCVAwUBRgs/k3EjvBPtnXfVAQLv3gQAl3598ge8qge7epkdqqjBq4F+478374z6 DuvfcBWeBGNZ/b4PEesPbtOwUKprz9mp988N1aoiMWiBa3p5OMQvhIl6q0w1d7Tj Gm2aCxrXa2JRfkFsj+VygDalK8aYT0XcDxh+56vCjfwhTvKHz1MmkAEwWLbJ6Cp/ GxGfW4l6a6g= =7akO -----END PGP SIGNATURE----- From barry at python.org Thu Mar 29 06:56:19 2007 From: barry at python.org (Barry Warsaw) Date: Thu, 29 Mar 2007 00:56:19 -0400 Subject: [Email-SIG] email.header.decode_header eats my spaces In-Reply-To: <87k5x0ycv0.fsf@uwakimon.sk.tsukuba.ac.jp> References: <4608C25B.7030608@is.kochi-u.ac.jp> <078C328A-263F-4221-A6B1-58D4AB4C6C22@python.org> <4609B199.5090208@is.kochi-u.ac.jp> <317F1391-7B55-4BE0-867B-B97EE7C4A3E3@python.org> <460B04A3.2070300@is.kochi-u.ac.jp> <87k5x0ycv0.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Mar 28, 2007, at 10:58 PM, Stephen J. Turnbull wrote: > If you want "unicode(h)" ==> "helloworld", you need to encode the > whole string. (Giving 'hello' the charset 'utf-8' would be a hackish > way of doing this.) >>> decode_header('=?us-ascii?q?hello?= =?us-ascii?q?world?=') [('helloworld', 'us-ascii')] >> The problem in email.header module is it can not distiguish between >> the structured and unstructured (text only) headers. > > Yikes! I didn't think of it that way before, but now that you mention > it, my spine is freezing. Indeed. >> The Header class may have a member function like 'add_comment', >> IMHO. > > IMHO, the Header class should be abstract, and there should be > subclasses that handle dates, lists of addresses, lists of > message-ids, etc. as appropriate to header fields structured in each > particular way. Only those object handlers appropriate to a given > field would be exposed. StarTextHeader would the unstructured > derivative of the (implicitly structured) Header class. I'm not sure inheritance is the right way to organize this. I think instead you might want an interface that allows you to specify header body 'interpreters' which can be associated with Header instances. These interpreters would handle things like splitting and folding header bodies, handling appending of words, etc. Or, maybe inheritance is right. In any case, I think you also want to also have a registry of some sort so that something like the Parser could find a header subclass or header interpreter to use as it was parsing a message text. You want this so an application could register x-headers. [Steve says lots of other intelligent things that Barry agrees with] I'm pretty tired right now so the only thing I can add is... > And the email package needs to be free to deal with structured headers > appropriately (for example, breaking very long addresses to try to > keep line-length to a reasonable level). It may not be feasible to > respect user preferences in all cases. ...that I want to keep in mind that because of x-headers, we have to have an extensible architecture. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (Darwin) iQCVAwUBRgtG9HEjvBPtnXfVAQKgnQP/dts4njMjpB495yNvokiPBBPbPCSajTfH hN5eYh5/3RJP6bOg+ABJlt56ya7QWD0J60cYWJJM2USRqDomGsWq24+TZ+5hsfTr jX87K0qy1vplViyILKbLl1Kfn+hvAQiAPH+57+do+0VX8fQM36QUosy19abaiOod mv9r8QQ/aBA= =jwSp -----END PGP SIGNATURE----- From stephen at xemacs.org Thu Mar 29 08:35:06 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 29 Mar 2007 15:35:06 +0900 Subject: [Email-SIG] email.header.decode_header eats my spaces In-Reply-To: References: <4608C25B.7030608@is.kochi-u.ac.jp> <078C328A-263F-4221-A6B1-58D4AB4C6C22@python.org> <4609B199.5090208@is.kochi-u.ac.jp> <317F1391-7B55-4BE0-867B-B97EE7C4A3E3@python.org> <460B04A3.2070300@is.kochi-u.ac.jp> <87k5x0ycv0.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <87ircky2ud.fsf@uwakimon.sk.tsukuba.ac.jp> Barry Warsaw writes: > Steve writes: > > IMHO, the Header class should be abstract, and there should be > > subclasses that handle dates, lists of addresses, lists of > > message-ids, etc. > I'm not sure inheritance is the right way to organize this. I picked inheritance because I see the header "type" as being fixed at Header instantiation (I can't think of a use-case for changing a "From" header to a "Subject" header, while "Message-ID" and "Resent-Message-ID" would be handled by the same class), but there are some things (handling folding, parsing the field name and body) that are common to all headers. I would be happy with any scheme that has the property that given a field name, the semantics of its contents are fixed according to the field if it is registered, or treated as "*text with caution" (maybe extra warnings? etc) if the field is not registered. > Or, maybe inheritance is right. In any case, I think you also want > to also have a registry of some sort Indeed I do!