From barry at python.org Thu Apr 2 14:54:08 2009 From: barry at python.org (Barry Warsaw) Date: Thu, 2 Apr 2009 07:54:08 -0500 Subject: [Email-SIG] Plans for email 6.0 Message-ID: <0DC79F9C-450F-484F-BBB0-28B69EB879F9@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hello everyone. Today's the last day of Pycon 2009 sprints and I'm eager to return home and see my family. Chris Withers and I had a good day sprinting on the email package before he had to jet out, and although we only closed one bug in Python 2.7 (this is where Chris's mantra "backport, backport" begins :) we had a lot of good discussions about how and where to fix outstanding problems in email. I have lots of ideas on how to improve the email package. I plan on creating a bit of space on the Python wiki to consolidate my thoughts and to coordinate implementation. I'm hoping some of you will be interested enough to help with design, testing, use cases, and coding. We have a few older pages in the wiki covering the email package: http://wiki.python.org/moin/EmailSigSprint http://wiki.python.org/moin/EmailSprint Some of this we've accomplished. Here's a rambling of some of my thoughts on things we should do. * Turn all header values into Header instances. It's difficult and error prone to have to manage both strings and Headers as values, so they should always be Header instances. We should add a registry of Header subclasses, based on the lower cased header name, for allowing higher level semantic folding of header strings. * Implement a Message subclass registry for parsing. This would allow the parser to create custom subclasses based on the Content-Type found while parsing the message. * Bytes and string interfaces. This is the trickiest one. I think that internally, header names and values, and payloads should all be represented as bytes. But APIs should accept bytes and strings, converting to bytes on input, and provide APIs to extract information as either bytes or strings. I've thought about a few ways to do this cleanly, but haven't found anything I particularly like yet. Remember that in email in Py2 is horribly broken in its discrimination between bytes and strings, but Py3 forces us to make a choice (which is a good thing). * Clean up the API. Where possible, simple attribute access should be the norm. Let's get rid of dumb API decisions (like str(msg) including the Unix-From). Let's fix the whole get_payload(decode=True) debacle. Let's fix stuff like needing to specify unicode encodings twice in the same call. Etc. * Add an external storage API so that messages with huge binary payloads don't need to be fully stored in memory. * Let's target Python 3.1 (coming very soon) if possible, or Python 3.2 if not. We should back port email 6.0 to Python 2.x, though we'll have to decide how far back we should go (my suggestion: no earlier than Python 2.5). * Fix the myriad of bugs in the tracker! That's it for now. I'll figure out a place in the wiki for this and we can start capturing our thoughts there. One thing I've heard pretty consistently is that while the email package has its problems, it's one of the best email packages available for any language. Let's make it rock. Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin) iQCVAwUBSdS1cHEjvBPtnXfVAQL7egQAk4LQpdfruSdW3R+Egz7dqAWfbftBnQio dGdyZT/X8cyjGVO9wwcwo2u2c7+JPElpnvBnYZc9oMSFErfUvgumXZo3mEORaGpm hj/+s0vG8c79SzA9Jz5wB1sBj50c7xN1L7kDCR3Ncwhz4vJSkO8nLvOqaJiccuF8 7s76zNewnO8= =Dayc -----END PGP SIGNATURE----- From tonynelson at georgeanelson.com Thu Apr 2 16:16:43 2009 From: tonynelson at georgeanelson.com (Tony Nelson) Date: Thu, 2 Apr 2009 10:16:43 -0400 Subject: [Email-SIG] Plans for email 6.0 In-Reply-To: <0DC79F9C-450F-484F-BBB0-28B69EB879F9@python.org> References: <0DC79F9C-450F-484F-BBB0-28B69EB879F9@python.org> Message-ID: At 07:54 -0500 2009/04/02, Barry Warsaw wrote: ... >...Here's a rambling of some of my thoughts on things we should do. ... >* Bytes and string interfaces. This is the trickiest one. I think >that internally, header names and values, and payloads should all be >represented as bytes. But APIs should accept bytes and strings, >converting to bytes on input, and provide APIs to extract information >as either bytes or strings. I've thought about a few ways to do this >cleanly, but haven't found anything I particularly like yet. Remember >that in email in Py2 is horribly broken in its discrimination between >bytes and strings, but Py3 forces us to make a choice (which is a good >thing). AIUI, this or something like it must be done soon, as the email package is broken on 3.x now. >* Clean up the API. Where possible, simple attribute access should be >the norm. Let's get rid of dumb API decisions (like str(msg) >including the Unix-From). Let's fix the whole >get_payload(decode=True) debacle. Let's fix stuff like needing to >specify unicode encodings twice in the same call. Etc. Sounds good. I'd like __setitem__ (msg[hdr] = foo) to act more like a mapping, and not just append new header fields, with .replace_header() and .add_header() folded together as .set_header(). >* Add an external storage API so that messages with huge binary >payloads don't need to be fully stored in memory. > >* Let's target Python 3.1 (coming very soon) if possible, or Python >3.2 if not. We should back port email 6.0 to Python 2.x, though we'll >have to decide how far back we should go (my suggestion: no earlier >than Python 2.5). Python 3.1 should have a working email package, and a simple way for users needing more to get a better replacement (which they'd install as a site-package). I think that a sane split between bytes and string (or string and Unicode on 2.x) is most needed. >* Fix the myriad of bugs in the tracker! Sure, I'm game! We 2.x users would benefit. Again, a place for users to get an "official" current package is needed, as 2.7 is a ways off. -- ____________________________________________________________________ TonyN.:' ' From barry at python.org Sun Apr 5 19:26:52 2009 From: barry at python.org (Barry Warsaw) Date: Sun, 5 Apr 2009 13:26:52 -0400 Subject: [Email-SIG] Email 6.0 Message-ID: <74AB269B-B7E8-4706-B066-E2AA662EF3DB@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I've started a branch for the email package version 6.0.0. Given that we have until May 2nd to solidify this thing for Python 3.1, I honestly don't think we'll make it. I would rather concentrate on getting this right, and usable as a standalone package, then work toward getting the new version into Python 3.2 and backported to 2.7. I'm working on a branch in Bazaar, at lp:~barry/python/email6 % bzr branch lp:~barry/python/email6 This is a branch of the Py3k trunk. I'm starting by refactoring the huge test_email.py file into smaller separate tests, then fixing thing as I go. After the tests are working I plan on starting to fix the API and other problems we've talked about. For now, let's coordinate on this branch. IOW, if you'd like to contribute (and I hope you do!) please branch the above and let us know about it here. I'll keep the above branch as (for now) the master copy. Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin) iQCVAwUBSdjp3HEjvBPtnXfVAQJcMgQApgnYaX34Au1AhFgOdbRlxbgxN7kRcB/N F+LK0IsPsrk8nqUoTpCcsNyZA/ErNUqeNctikZprdOz28xPnndrwaFNDHsWwbMfn NbzacfTP/2R106wOwNANc68dj7jfco7R6fp8Qa3i4vo1S59SiDuyQy7zMstiql/T nUhCIwijS/Q= =NLTE -----END PGP SIGNATURE----- From barry at python.org Sun Apr 5 19:30:24 2009 From: barry at python.org (Barry Warsaw) Date: Sun, 5 Apr 2009 13:30:24 -0400 Subject: [Email-SIG] Plans for email 6.0 In-Reply-To: References: <0DC79F9C-450F-484F-BBB0-28B69EB879F9@python.org> Message-ID: <88F420F3-4CA1-4BE6-B696-3CDB066B314B@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Apr 2, 2009, at 10:16 AM, Tony Nelson wrote: >> * Bytes and string interfaces. This is the trickiest one. I think >> that internally, header names and values, and payloads should all be >> represented as bytes. But APIs should accept bytes and strings, >> converting to bytes on input, and provide APIs to extract information >> as either bytes or strings. I've thought about a few ways to do this >> cleanly, but haven't found anything I particularly like yet. >> Remember >> that in email in Py2 is horribly broken in its discrimination between >> bytes and strings, but Py3 forces us to make a choice (which is a >> good >> thing). > > AIUI, this or something like it must be done soon, as the email > package is > broken on 3.x now. Indeed. >> * Clean up the API. Where possible, simple attribute access should >> be >> the norm. Let's get rid of dumb API decisions (like str(msg) >> including the Unix-From). Let's fix the whole >> get_payload(decode=True) debacle. Let's fix stuff like needing to >> specify unicode encodings twice in the same call. Etc. > > Sounds good. I'd like __setitem__ (msg[hdr] = foo) to act more like a > mapping, and not just append new header fields, > with .replace_header() and > .add_header() folded together as .set_header(). Is there a reason for this? This is one part of the API that I've found where practicality beats purity. >> * Add an external storage API so that messages with huge binary >> payloads don't need to be fully stored in memory. >> >> * Let's target Python 3.1 (coming very soon) if possible, or Python >> 3.2 if not. We should back port email 6.0 to Python 2.x, though >> we'll >> have to decide how far back we should go (my suggestion: no earlier >> than Python 2.5). > > Python 3.1 should have a working email package, and a simple way for > users > needing more to get a better replacement (which they'd install as a > site-package). I think that a sane split between bytes and string (or > string and Unicode on 2.x) is most needed. Unfortunately, it's a /very/ tricky problem. This pervades every aspect of the package. I'm slowly byte-ifying the internals as I refactor the tests. That's the first step IMO, but it doesn't make for a very convenient API. >> * Fix the myriad of bugs in the tracker! > > Sure, I'm game! We 2.x users would benefit. Again, a place for > users to > get an "official" current package is needed, as 2.7 is a ways off. We will definitely make standalone packages available on the Cheeseshop for Python 2.x and 3.x. The question of what goes into 3.1 is still up in the air I think. Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin) iQCVAwUBSdjqsHEjvBPtnXfVAQJZSwP/fABeQG7Q1c4LOZhwCZBcb41Gh4ybZVoK tZFM2Q1UTdq0bvaEG5xKMkGPHd1S/+AovrwtC4qTIL531p/RJZp3KaDvucGLfWJ3 w61Mk75Zj6yTEbg2GtJwKiY1Zj7oYZgod0NEQ6vgaBAchLAWrnwsE52ap3w+9K7M wzmppfl/r/I= =sxwD -----END PGP SIGNATURE----- From tonynelson at georgeanelson.com Sun Apr 5 21:04:54 2009 From: tonynelson at georgeanelson.com (Tony Nelson) Date: Sun, 5 Apr 2009 15:04:54 -0400 Subject: [Email-SIG] Plans for email 6.0 In-Reply-To: <88F420F3-4CA1-4BE6-B696-3CDB066B314B@python.org> References: <0DC79F9C-450F-484F-BBB0-28B69EB879F9@python.org> <88F420F3-4CA1-4BE6-B696-3CDB066B314B@python.org> Message-ID: Traffic! At 13:30 -0400 04/05/2009, Barry Warsaw wrote: >-----BEGIN PGP SIGNED MESSAGE----- >Hash: SHA1 > >On Apr 2, 2009, at 10:16 AM, Tony Nelson wrote: >>>* Clean up the API. Where possible, simple attribute access should be >>>the norm. Let's get rid of dumb API decisions (like str(msg) including >>>the Unix-From). Let's fix the whole get_payload(decode=True) debacle. >>>Let's fix stuff like needing to specify unicode encodings twice in the >>>same call. Etc. >> >>Sounds good. I'd like __setitem__ (msg[hdr] = foo) to act more like a >>mapping, and not just append new header fields, with .replace_header() >>and .add_header() folded together as .set_header(). > >Is there a reason for this? This is one part of the API that I've >found where practicality beats purity. What part of saying: msg["Subject"] = "new subject line" and getting a second Subject: header field is practical? For those times when you really want more then one instance of a header field: msg.append_header("Subject", "new subject line") In general, users of the email package must currently be familiar with all the mail RFCs in order to properly use the package to create or manipulate any but the simplest messages, and having "[]" mean "append" isn't helping. Your suggestion that header fields should always be represented as Header objects is urgently needed. Those Header objects will need to be smart about the header field they represent, and apply all the various encodings etc. as necessary. ... >>>* Let's target Python 3.1 (coming very soon) if possible, or Python 3.2 >>>if not. We should back port email 6.0 to Python 2.x, though we'll have >>>to decide how far back we should go (my suggestion: no earlier than >>>Python 2.5). >> >>Python 3.1 should have a working email package, and a simple way for >>users needing more to get a better replacement (which they'd install as a >>site-package). I think that a sane split between bytes and string (or >>string and Unicode on 2.x) is most needed. > >Unfortunately, it's a /very/ tricky problem. I assume you mean "working email package", not "a simple way for users ... to get a better replacement". >This pervades every >aspect of the package. I'm slowly byte-ifying the internals as I >refactor the tests. That's the first step IMO, but it doesn't make >for a very convenient API. So it goes. It may make more sense as you get farther along. What parts of that work can you farm out? Do you need a RFC-compliant header parser? I could write one in a few days, I think. >>> * Fix the myriad of bugs in the tracker! >> >>Sure, I'm game! We 2.x users would benefit. Again, a place for users to >>get an "official" current package is needed, as 2.7 is a ways off. > >We will definitely make standalone packages available on the >Cheeseshop for Python 2.x and 3.x. The question of what goes into 3.1 >is still up in the air I think. Well, I think that the bugs I've worked on so far should go into 2.6, 2.7, and 3.1 (unless 3.1 makes a lot of progress and renders some of the bugs obsolete). [issue5610] email feedparser.py CRLFLF bug: $ vs \Z [issue5638] test_httpservers fails CGI tests if --enable-shared [issue1555570] email parser incorrectly breaks headers with a CRLF at 8192 [issue3169] email/header.py doesn't handle Base64 headers that have been insufficiently padded. [issue4487] Add utf8 alias for email charsets [issue1079] decode_header does not follow RFC 2047 (There's some argument on the last one, where R. David Murray doesn't want any header that might not conform to the RFCs to be decoded, and I want any header that might corform to be decoded -- I cite Postel's law in another issue, and I think it applies here as well. A full header parser and Header implementation would solve the problem properly, but only for Python 3.2 or later.) -- ____________________________________________________________________ TonyN.:' ' From tonynelson at georgeanelson.com Sun Apr 5 22:01:26 2009 From: tonynelson at georgeanelson.com (Tony Nelson) Date: Sun, 5 Apr 2009 16:01:26 -0400 Subject: [Email-SIG] Email 6.0 In-Reply-To: <74AB269B-B7E8-4706-B066-E2AA662EF3DB@python.org> References: <74AB269B-B7E8-4706-B066-E2AA662EF3DB@python.org> Message-ID: At 13:26 -0400 04/05/2009, Barry Warsaw wrote: >-----BEGIN PGP SIGNED MESSAGE----- >Hash: SHA1 > >I've started a branch for the email package version 6.0.0. Given that >we have until May 2nd to solidify this thing for Python 3.1, I >honestly don't think we'll make it. I would rather concentrate on >getting this right, and usable as a standalone package, then work >toward getting the new version into Python 3.2 and backported to 2.7. Lets also fix some existing bugs, for 2.6, 2.7, and possibly 3.1 if it can get healthy enough to use. >I'm working on a branch in Bazaar, at lp:~barry/python/email6 > >% bzr branch lp:~barry/python/email6 I'm not able to check out that branch: bzr ERROR: Not a branch: "bzr+ssh"//bazaar.lanchpad.net//~barry/python/email6/". Probably it is because it has not been pushed. >This is a branch of the Py3k trunk. I'm starting by refactoring the >huge test_email.py file into smaller separate tests, then fixing thing >as I go. After the tests are working I plan on starting to fix the >API and other problems we've talked about. For now, let's coordinate >on this branch. IOW, if you'd like to contribute (and I hope you do!) >please branch the above and let us know about it here. I'll keep the >above branch as (for now) the master copy. As I'm new to bzr and launchpad, I'm not sure what all that means. Does it mean that I should create a branch at my own launchpad account, based on a checkout of lp:~barry/python/email6? -- ____________________________________________________________________ TonyN.:' ' From stephen at xemacs.org Tue Apr 7 07:22:19 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 07 Apr 2009 14:22:19 +0900 Subject: [Email-SIG] Plans for email 6.0 In-Reply-To: References: <0DC79F9C-450F-484F-BBB0-28B69EB879F9@python.org> <88F420F3-4CA1-4BE6-B696-3CDB066B314B@python.org> Message-ID: <8763hht4vo.fsf@xemacs.org> Tony Nelson writes: > In general, users of the email package must currently be familiar with all > the mail RFCs in order to properly use the package to create or manipulate > any but the simplest messages, IMHO, that's a problem with the mail RFCs, not with the email package. Internet messaging is inherently complex because of the backward and Microsoft compatibility requirements. > and having "[]" mean "append" isn't helping. That's probably true, but that's because in Python mapping semantics are invariably replace rather than append in this circumstance. It has nothing to do with the RFCs per se. > Your suggestion that header fields should always be represented as > Header objects is urgently needed. Those Header objects will need > to be smart about the header field they represent, and apply all > the various encodings etc. as necessary. That's not a good idea. Header methods should be strict about what encodings are allowed, but all too often the decisions between quoted-printable and base64 transfer encodings, and among various possible text encodings (Japanese alone has 4 majors ones in *daily* use, with different ones typically used in the header and body! and Chinese isn't much better) are dependent on content or receiver and/or sender. It's reasonable for email to have "recommendations", perhaps implemented as defaults, for each situation, but programmers should be reminded that that the text they provide to the Header class etc is being munged as it gets inserted into the message. For simple situations, of course it makes sense to provide a high-level interface, such as a string:contents dictionary for headers. headers = { "From" : [("Stephen J. Turnbull", "stephen at xemacs.org")], "To" : [("Email SIG", "email-sig at python.org"), ("da FLUFL", "barry at python.org")], "Subject" : "Don't DO that!" "Summary" : "This could go on forever but doesn't." } body = """I just wanted you to know that I don't think it's a good idea. Just-yer-neighborhood-busybody-ly y'rs """ ready_for_sendmail = email.format_simple_message (headers, body) And that would be encoded in some lowest-common-denominator charset like ASCII, ISO-8859-15, ISO-8859-1, or UTF-8 with the earliest feasible one used, and some heuristic like minimum encoded size or fraction of non-ASCII used to determine content-transfer-encoding. But it should be implemented by .format_simple_message, not Header, IMHO. From v+python at g.nevcal.com Tue Apr 7 07:44:22 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Mon, 06 Apr 2009 22:44:22 -0700 Subject: [Email-SIG] Plans for email 6.0 In-Reply-To: <8763hht4vo.fsf@xemacs.org> References: <0DC79F9C-450F-484F-BBB0-28B69EB879F9@python.org> <88F420F3-4CA1-4BE6-B696-3CDB066B314B@python.org> <8763hht4vo.fsf@xemacs.org> Message-ID: <49DAE836.3030107@g.nevcal.com> On approximately 4/6/2009 10:22 PM, came the following characters from the keyboard of Stephen J. Turnbull: > IMHO, that's a problem with the mail RFCs, not with the email > package. Internet messaging is inherently complex because of the > backward and Microsoft compatibility requirements. I agree that Internet messaging, particularly some of the character encodings, in inherently complex due to backward compatibility requirements. I'm not surprised that you mention Microsoft issues, as I've found quite a few cases of messages from Microsoft email clients that do not conform to the RFCs. Apple Mail violates a number of them, also, especially with MIME constructions. But I've never attempted to track the Microsoft violations of the RFCs... do you have or know of a list of such? -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From stephen at xemacs.org Tue Apr 7 13:42:44 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 07 Apr 2009 20:42:44 +0900 Subject: [Email-SIG] Plans for email 6.0 In-Reply-To: <49DAE836.3030107@g.nevcal.com> References: <0DC79F9C-450F-484F-BBB0-28B69EB879F9@python.org> <88F420F3-4CA1-4BE6-B696-3CDB066B314B@python.org> <8763hht4vo.fsf@xemacs.org> <49DAE836.3030107@g.nevcal.com> Message-ID: <87vdpgsn9n.fsf@xemacs.org> Glenn Linderman writes: > I'm not surprised that you mention Microsoft issues, as I've found > quite a few cases of messages from Microsoft email clients that do > not conform to the RFCs. Apple Mail violates a number of them, > also, especially with MIME constructions. But I've never attempted > to track the Microsoft violations of the RFCs... do you have or > know of a list of such? No, I don't. For me it's not been worth keeping one, but if email is going to be the world-beating email library, it might be worth keeping one. I mean, just how many people would fall in love with Mailman if there were a "select your broken MUA here" in the personal user's page, and selecting actually got you a personalized message that didn't display Sender in the From field in Outlook Express? :-) From tonynelson at georgeanelson.com Thu Apr 9 17:05:38 2009 From: tonynelson at georgeanelson.com (Tony Nelson) Date: Thu, 9 Apr 2009 11:05:38 -0400 Subject: [Email-SIG] email package Bytes vs Unicode (was Re: [Python-Dev] Dropping bytes "support" in json) In-Reply-To: References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> Message-ID: (email-sig added) At 08:07 -0400 04/09/2009, Steve Holden wrote: >Barry Warsaw wrote: ... >> This is an interesting question, and something I'm struggling with for >> the email package for 3.x. It turns out to be pretty convenient to have >> both a bytes and a string API, both for input and output, but I think >> email really wants to be represented internally as bytes. Maybe. Or >> maybe just for content bodies and not headers, or maybe both. Anyway, >> aside from that decision, I haven't come up with an elegant way to allow >> /output/ in both bytes and strings (input is I think theoretically >> easier by sniffing the arguments). >> >The real problem I came across in storing email in a relational database >was the inability to store messages as Unicode. Some messages have a >body in one encoding and an attachment in another, so the only ways to >store the messages are either as a monolithic bytes string that gets >parsed when the individual components are required or as a sequence of >components in the database's preferred encoding (if you want to keep the >original encoding most relational databases won't be able to help unless >you store the components as bytes). ... I found it confusing myself, and did it wrong for a while. Now, I understand that essages come over the wire as bytes, either 7-bit US-ASCII or 8-bit whatever, and are parsed at the receiver. I think of the database as a wire to the future, and store the data as bytes (a BLOB), letting the future receiver parse them as it did the first time, when I cleaned the message. Data I care to query is extracted into fields (in UTF-8, what I usually use for char fields). I have no need to store messages as Unicode, and they aren't Unicode anyway. I have no need ever to flatten a message to Unicode, only to US-ASCII or, for messages (spam) that are corrupt, raw 8-bit data. If you need the data from the message, by all means extract it and store it in whatever form is useful to the purpose of the database. If you need the entire message, store it intact in the database, as the bytes it is. Email isn't Unicode any more than a JPEG or other image types (often payloads in a message) are Unicode. -- ____________________________________________________________________ TonyN.:' ' From barry at python.org Fri Apr 10 04:26:22 2009 From: barry at python.org (Barry Warsaw) Date: Thu, 9 Apr 2009 22:26:22 -0400 Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json In-Reply-To: References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> Message-ID: <1F3DC671-746B-425C-A847-4F6CB0DB9FD0@python.org> On Apr 9, 2009, at 8:07 AM, Steve Holden wrote: > The real problem I came across in storing email in a relational > database > was the inability to store messages as Unicode. Some messages have a > body in one encoding and an attachment in another, so the only ways to > store the messages are either as a monolithic bytes string that gets > parsed when the individual components are required or as a sequence of > components in the database's preferred encoding (if you want to keep > the > original encoding most relational databases won't be able to help > unless > you store the components as bytes). > > All in all, as you might expect from a system that's been growing up > since 1970 or so, it can be quite intractable. There are really two ways to look at an email message. It's either an unstructured blob of bytes, or it's a structured tree of objects. Those objects have headers and payload. The payload can be of any type, though I think it generally breaks down into "strings" for text/ * types and bytes for anything else (not counting multiparts). The email package isn't a perfect mapping to this, which is something I want to improve. That aside, I think storing a message in a database means storing some or all of the headers separately from the byte stream (or text?) of its payload. That's for non-multipart types. It would be more complicated to represent a message tree of course. It does seem to make sense to think about headers as text header names and text header values. Of course, header values can contain almost anything and there's an encoding to bring it back to 7-bit ASCII, but again, you really have two views of a header value. Which you want really depends on your application. Maybe you just care about the text of both the header name and value. In that case, I think you want the values as unicodes, and probably the headers as unicodes containing only ASCII. So your table would be strings in both cases. OTOH, maybe your application cares about the raw underlying encoded data, in which case the header names are probably still strings of ASCII-ish unicodes and the values are bytes. It's this distinction (and I think the competing use cases) that make a true Python 3.x API for email more complicated. Thinking about this stuff makes me nostalgic for the sloppy happy days of Python 2.x -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 304 bytes Desc: This is a digitally signed message part URL: From barry at python.org Fri Apr 10 04:38:11 2009 From: barry at python.org (Barry Warsaw) Date: Thu, 9 Apr 2009 22:38:11 -0400 Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json In-Reply-To: References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> Message-ID: <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> On Apr 9, 2009, at 11:55 AM, Daniel Stutzbach wrote: > On Thu, Apr 9, 2009 at 6:01 AM, Barry Warsaw wrote: > Anyway, aside from that decision, I haven't come up with an elegant > way to allow /output/ in both bytes and strings (input is I think > theoretically easier by sniffing the arguments). > > Won't this work? (assuming dumps() always returns a string) > > def dumpb(obj, encoding='utf-8', *args, **kw): > s = dumps(obj, *args, **kw) > return s.encode(encoding) So, what I'm really asking is this. Let's say you agree that there are use cases for accessing a header value as either the raw encoded bytes or the decoded unicode. What should this return: >>> message['Subject'] The raw bytes or the decoded unicode? Okay, so you've picked one. Now how do you spell the other way? The Message class probably has these explicit methods: >>> Message.get_header_bytes('Subject') >>> Message.get_header_string('Subject') (or better names... it's late and I'm tired ;). One of those maps to message['Subject'] but which is the more obvious choice? Now, setting headers. Sometimes you have some unicode thing and sometimes you have some bytes. You need to end up with bytes in the ASCII range and you'd like to leave the header value unencoded if so. But in both cases, you might have bytes or characters outside that range, so you need an explicit encoding, defaulting to utf-8 probably. >>> Message.set_header('Subject', 'Some text', encoding='utf-8') >>> Message.set_header('Subject', b'Some bytes') One of those maps to >>> message['Subject'] = ??? I'm open to any suggestions here! -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 304 bytes Desc: This is a digitally signed message part URL: From barry at python.org Fri Apr 10 04:40:30 2009 From: barry at python.org (Barry Warsaw) Date: Thu, 9 Apr 2009 22:40:30 -0400 Subject: [Email-SIG] [Python-Dev] email package Bytes vs Unicode (was Re: Dropping bytes "support" in json) In-Reply-To: References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> Message-ID: <657BFEEA-04E3-418F-86C0-D2F80C75DB96@python.org> On Apr 9, 2009, at 12:20 PM, Steve Holden wrote: > PostgreSQL strongly encourages you to store text as encoded columns. > Because emails lack an encoding it turns out this is a most > inconvenient > storage type for it. Sadly BLOBs are such a pain in PostgreSQL that > it's > easier to store the messages in external files and just use the > relational database to index those files to retrieve content, so > that's > what I ended up doing. That's not insane for other reasons. Do you really want to store 10MB of mp3 data in your database? Which of course reminds me that I want to add an interface, probably to the parser and message class, to allow an application to store message payloads in other than memory. Parsing and holding onto messages with huge payloads can kill some applications, when you might not care too much about the actual payload content. Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 304 bytes Desc: This is a digitally signed message part URL: From barry at python.org Fri Apr 10 05:03:35 2009 From: barry at python.org (Barry Warsaw) Date: Thu, 9 Apr 2009 23:03:35 -0400 Subject: [Email-SIG] [Python-Dev] the email module, text, and bytes (was Re: Dropping bytes "support" in json) In-Reply-To: <20090410031151.12555.724184150.divmod.xquotient.7482@weber.divmod.com> References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <1F3DC671-746B-425C-A847-4F6CB0DB9FD0@python.org> <20090410031151.12555.724184150.divmod.xquotient.7482@weber.divmod.com> Message-ID: On Apr 9, 2009, at 11:11 PM, glyph at divmod.com wrote: > I think this is a problematic way to model bytes vs. text; it gives > text a special relationship to bytes which should be avoided. > > IMHO the right way to think about domains like this is a multi-level > representation. The "low level" representation is always bytes, > whether your MIME type is text/whatever or application/x-i-dont-know. This is a really good point, and I really should be clearer when describing my current thinking (sleep would help :). > The thing that's "special" about text is that it's a "high level" > representation that the standard library can know about. But the > 'email' package ought to support being extended to support other > types just as well. For example, I want to ask for image/png > content as PIL.Image objects, not bags of bytes. Of course this > presupposes some way for PIL itself to get at some bytes, but then > you need the email module itself to get at the bytes to convert to > text in much the same way. There also needs to be layering at the > level of bytes->base64->some different bytes->PIL->Image. There are > mail clients that will base64-encode unusual encodings so you have > to do that same layering for text sometimes. > > I'm also being somewhat handwavy with talk of "low" and "high" level > representations; of course there are actually multiple levels beyond > that. I might want text/x-python content to show up as an AST, but > the intermediate DOM-parsing representation really wants to operate > on characters. Similarly for a DOM and text/html content. (Modulo > the usual encoding-detection weirdness present in parsers.) When I was talking about supporting text/* content types as strings, I was definitely thinking about using basically the same plug-in or higher level or whatever API to do that as you might use to get PIL images from an image/gif. > So, as long as there's a crisp definition of what layer of the MIME > stack one is operating on, I don't think that there's really any > ambiguity at all about what type you should be getting. In that case, we really need the bytes-in-bytes-out-bytes-in-the-chewy- center API first, and build things on top of that. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 304 bytes Desc: This is a digitally signed message part URL: From barry at python.org Fri Apr 10 05:05:37 2009 From: barry at python.org (Barry Warsaw) Date: Thu, 9 Apr 2009 23:05:37 -0400 Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json In-Reply-To: <20090410025203.GA199@panix.com> References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> <20090410025203.GA199@panix.com> Message-ID: <663162E3-D2EB-4417-93D0-4764BC94646C@python.org> On Apr 9, 2009, at 10:52 PM, Aahz wrote: > On Thu, Apr 09, 2009, Barry Warsaw wrote: >> >> So, what I'm really asking is this. Let's say you agree that there >> are >> use cases for accessing a header value as either the raw encoded >> bytes or >> the decoded unicode. What should this return: >> >>>>> message['Subject'] >> >> The raw bytes or the decoded unicode? > > Let's make that the raw bytes by default -- we can add a parameter to > Message() to specify that the default where possible is unicode for > returned values, if that isn't too painful. I don't know whether the parameter thing will work or not, but you're probably right that we need to get the bytes-everywhere API first. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 304 bytes Desc: This is a digitally signed message part URL: From barry at python.org Fri Apr 10 05:23:40 2009 From: barry at python.org (Barry Warsaw) Date: Thu, 9 Apr 2009 23:23:40 -0400 Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json In-Reply-To: <49DEBB21.70305@gmail.com> References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> <20090410025203.GA199@panix.com> <663162E3-D2EB-4417-93D0-4764BC94646C@python.org> <49DEBB21.70305@gmail.com> Message-ID: <0047AD0A-7B5B-4703-96D6-BD26B9752E7D@python.org> On Apr 9, 2009, at 11:21 PM, Nick Coghlan wrote: > Barry Warsaw wrote: >> I don't know whether the parameter thing will work or not, but you're >> probably right that we need to get the bytes-everywhere API first. > > Given that json is a wire protocol, that sounds like the right > approach > for json as well. Once bytes-everywhere works, then a text API can be > built on top of it, but it is difficult to build a bytes API on top > of a > text one. Agreed! > So I guess the IO library *is* the right model: bytes at the bottom of > the stack, with text as a wrapper around it (mediated by codecs). Yes, that's a very interesting (and proven?) model. I don't quite see how we could apply that email and json, but it seems like there's a good idea there. ;) -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 304 bytes Desc: This is a digitally signed message part URL: From tonynelson at georgeanelson.com Fri Apr 10 05:41:58 2009 From: tonynelson at georgeanelson.com (Tony Nelson) Date: Thu, 9 Apr 2009 23:41:58 -0400 Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json In-Reply-To: <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> Message-ID: At 22:38 -0400 04/09/2009, Barry Warsaw wrote: ... >So, what I'm really asking is this. Let's say you agree that there >are use cases for accessing a header value as either the raw encoded >bytes or the decoded unicode. What should this return: > > >>> message['Subject'] > >The raw bytes or the decoded unicode? That's an easy one: Subject: is an unstructured header, so it must be text, thus Unicode. We're looking at a high-level representation of an email message, with parsed header fields and a MIME message tree. >Okay, so you've picked one. Now how do you spell the other way? message.get_header_bytes('Subject') Oh, I see that's what you picked. >The Message class probably has these explicit methods: > > >>> Message.get_header_bytes('Subject') > >>> Message.get_header_string('Subject') > >(or better names... it's late and I'm tired ;). One of those maps to >message['Subject'] but which is the more obvious choice? Structured header fields are more of a problem. Any header with addresses should return a list of addresses. I think the default return type should depend on the data type. To get an explicit bytes or string or list of addresses, be explicit; otherwise, for convenience, return the appropriate type for the particular header field name. >Now, setting headers. Sometimes you have some unicode thing and >sometimes you have some bytes. You need to end up with bytes in the >ASCII range and you'd like to leave the header value unencoded if so. >But in both cases, you might have bytes or characters outside that >range, so you need an explicit encoding, defaulting to utf-8 probably. Never for header fields. The default is always RFC 2047, unless it isn't, say for params. The Message class should create an object of the appropriate subclass of Header based on the name (or use the existing object, see other discussion), and that should inspect its argument and DTRT or complain. > > >>> Message.set_header('Subject', 'Some text', encoding='utf-8') > >>> Message.set_header('Subject', b'Some bytes') > >One of those maps to > > >>> message['Subject'] = ??? The expected data type should depend on the header field. For Subject:, it should be bytes to be parsed or verbatim text. For To:, it should be a list of addresses or bytes or text to be parsed. The email package should be pythonic, and not require deep understanding of dozens of RFCs to use properly. Users don't need to know about the raw bytes; that's the whole point of MIME and any email package. It should be easy to set header fields with their natural data types, and doing it with bad data should produce an error. This may require a bit more care in the message parser, to always produce a parsed message with defects. -- ____________________________________________________________________ TonyN.:' ' From tonynelson at georgeanelson.com Fri Apr 10 05:59:54 2009 From: tonynelson at georgeanelson.com (Tony Nelson) Date: Thu, 9 Apr 2009 23:59:54 -0400 Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json In-Reply-To: <1F3DC671-746B-425C-A847-4F6CB0DB9FD0@python.org> References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <1F3DC671-746B-425C-A847-4F6CB0DB9FD0@python.org> Message-ID: At 22:26 -0400 04/09/2009, Barry Warsaw wrote: >There are really two ways to look at an email message. It's either an >unstructured blob of bytes, or it's a structured tree of objects. >Those objects have headers and payload. The payload can be of any >type, though I think it generally breaks down into "strings" for text/ >* types and bytes for anything else (not counting multiparts). > >The email package isn't a perfect mapping to this, which is something >I want to improve. That aside, I think storing a message in a >database means storing some or all of the headers separately from the >byte stream (or text?) of its payload. That's for non-multipart >types. It would be more complicated to represent a message tree of >course. Storing an email message in a database does mean storing some of the header fields as database fields, but the set of email header fields is open, so any "unused" fields in a message must be stored elsewhere. It isn't useful to just have a bag of name/value pairs in a table. General message MIME payload trees don't map well to a database either, unless one wants to get very relational. Sometimes the database needs to represent the entire email message, header fields and MIME tree, but only if it is an email program and usually not even then. Usually, the database has a specific purpose, and can be designed for the data it cares about; it may choose to keep the original message as bytes. >It does seem to make sense to think about headers as text header names >and text header values. Of course, header values can contain almost >anything and there's an encoding to bring it back to 7-bit ASCII, but >again, you really have two views of a header value. Which you want >really depends on your application. I think of header fields as having text-like names (the set of allowed characters is more than just text, though defined headers don't make use of that), but the data is either bytes or it should be parsed into something appropriate: text for unstructured fields like Subject:, a list of addresses for address fields like To:. Many of the structured header fields have a reasonable mapping to text; certainly this is true for adress header fields. Content-Type header fields are barely text, they can be so convolutedly structured, but I suppose one could flatten one of them to text instead of bytes if the user wanted. It's not very useful, though, except for debugging (either by the programmer or the recipient who wants to know what was cleaned from the message). >Maybe you just care about the text of both the header name and value. >In that case, I think you want the values as unicodes, and probably >the headers as unicodes containing only ASCII. So your table would be >strings in both cases. OTOH, maybe your application cares about the >raw underlying encoded data, in which case the header names are >probably still strings of ASCII-ish unicodes and the values are >bytes. It's this distinction (and I think the competing use cases) >that make a true Python 3.x API for email more complicated. If a database stores the Subject: header field, it would be as text. The various recipient address fields are a one message to many names and addresses mapping, and need a related table of name/address fields, with each field being text. The original message (or whatever part of it one preserves) should be bytes. I don't think this complicates the email package API; rather, it just shows where generality is needed. >Thinking about this stuff makes me nostalgic for the sloppy happy days >of Python 2.x You now have the opportunity to finally unsnarl that mess. It is not an insurmountable opportunity. -- ____________________________________________________________________ TonyN.:' ' From turnbull at sk.tsukuba.ac.jp Fri Apr 10 07:22:04 2009 From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull) Date: Fri, 10 Apr 2009 14:22:04 +0900 Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json In-Reply-To: <1F3DC671-746B-425C-A847-4F6CB0DB9FD0@python.org> References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <1F3DC671-746B-425C-A847-4F6CB0DB9FD0@python.org> Message-ID: <87zlepf5hf.fsf@xemacs.org> Barry Warsaw writes: > There are really two ways to look at an email message. It's either an > unstructured blob of bytes, or it's a structured tree of objects. Indeed! > Those objects have headers and payload. The payload can be of any > type, though I think it generally breaks down into "strings" for text/ > * types and bytes for anything else (not counting multiparts). *sigh* Why are you back-tracking? The payload should be of an appropriate *object* type. Atomic object types will have their content stored as string or bytes [nb I use Python 3 terminology throughout]. Composite types (multipart/*) won't need string or bytes attributes AFAICS. Start by implementing the application/octet-stream and text/plain;charset=utf-8 object types, of course. > It does seem to make sense to think about headers as text header names > and text header values. I disagree. IMHO, structured header types should have object values, and something like message['to'] = "Barry 'da FLUFL' Warsaw " should be smart enough to detect that it's a string and attempt to (flexibly) parse it into a fullname and a mailbox adding escapes, etc. Whether these should be structured objects or they can be strings or bytes, I'm not sure (probably bytes, not strings, though -- see next exampl). OTOH message['to'] = b'''"Barry 'da.FLUFL' Warsaw" ''' should assume that the client knows what they are doing, and should parse it strictly (and I mean "be a real bastard", eg, raise an exception on any non-ASCII octet), merely dividing it into fullname and mailbox, and caching the bytes for later insertion in a wire-format message. > In that case, I think you want the values as unicodes, and probably > the headers as unicodes containing only ASCII. So your table would be > strings in both cases. OTOH, maybe your application cares about the > raw underlying encoded data, in which case the header names are > probably still strings of ASCII-ish unicodes and the values are > bytes. It's this distinction (and I think the competing use cases) > that make a true Python 3.x API for email more complicated. I don't see why you can't have the email API be specific, with message['to'] always returning a structured_header object (or maybe even more specifically an address_header object), and methods like message['to'].build_header_as_text() which returns """To: "Barry 'da.FLUFL' Warsaw" """ and message['to'].build_header_in_wire_format() which returns b"""To: "Barry 'da.FLUFL' Warsaw" """ Then have email.textview.Message and email.wireview.Message which provide a simple interface where message['to'] would invoke .build_header_as_text() and .build_header_in_wire_format() respectively. > Thinking about this stuff makes me nostalgic for the sloppy happy days > of Python 2.x Er, yeah. Nostalgic-for-the-BITNET-days-where-everything-was-Just-EBCDIC-ly y'rs, From janssen at parc.com Fri Apr 10 18:35:44 2009 From: janssen at parc.com (Bill Janssen) Date: Fri, 10 Apr 2009 09:35:44 PDT Subject: [Email-SIG] [Python-Dev] the email module, text, and bytes (was Re: Dropping bytes "support" in json) In-Reply-To: References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <1F3DC671-746B-425C-A847-4F6CB0DB9FD0@python.org> <20090410031151.12555.724184150.divmod.xquotient.7482@weber.divmod.com> Message-ID: <92023.1239381344@parc.com> Barry Warsaw wrote: > In that case, we really need the > bytes-in-bytes-out-bytes-in-the-chewy- > center API first, and build things on top of that. Yep. Bill From barry at python.org Fri Apr 10 18:56:09 2009 From: barry at python.org (Barry Warsaw) Date: Fri, 10 Apr 2009 12:56:09 -0400 Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json In-Reply-To: <20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com> References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> <20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com> Message-ID: On Apr 10, 2009, at 1:19 AM, glyph at divmod.com wrote: > On 02:38 am, barry at python.org wrote: >> So, what I'm really asking is this. Let's say you agree that there >> are use cases for accessing a header value as either the raw >> encoded bytes or the decoded unicode. What should this return: >> >> >>> message['Subject'] >> >> The raw bytes or the decoded unicode? > > My personal preference would be to just get deprecate this API, and > get rid of it, replacing it with a slightly more explicit one. > > message.headers['Subject'] > message.bytes_headers['Subject'] This is pretty darn clever Glyph. Stop that! :) I'm not 100% sure I like the name .bytes_headers or that .headers should be the decoded header (rather than have .headers return the bytes thingie and say .decoded_headers return the decoded thingies), but I do like the general approach. >> Now, setting headers. Sometimes you have some unicode thing and >> sometimes you have some bytes. You need to end up with bytes in >> the ASCII range and you'd like to leave the header value unencoded >> if so. But in both cases, you might have bytes or characters >> outside that range, so you need an explicit encoding, defaulting to >> utf-8 probably. > > message.headers['Subject'] = 'Some text' > > should be equivalent to > > message.headers['Subject'] = Header('Some text') Yes, absolutely. I think we're all in general agreement that header values should be instances of Header, or subclasses thereof. > My preference would be that > > message.headers['Subject'] = b'Some Bytes' > > would simply raise an exception. If you've got some bytes, you > should instead do > > message.bytes_headers['Subject'] = b'Some Bytes' > > or > > message.headers['Subject'] = Header(bytes=b'Some Bytes', > encoding='utf-8') > > Explicit is better than implicit, right? Yes. Again, I really like the general idea, if I might quibble about some of the details. Thanks for a great suggestion. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 304 bytes Desc: This is a digitally signed message part URL: From barry at python.org Fri Apr 10 19:08:26 2009 From: barry at python.org (Barry Warsaw) Date: Fri, 10 Apr 2009 13:08:26 -0400 Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json In-Reply-To: References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> Message-ID: <595A42B2-0D3B-4886-960B-F16D50D0CC5A@python.org> On Apr 9, 2009, at 11:41 PM, Tony Nelson wrote: > At 22:38 -0400 04/09/2009, Barry Warsaw wrote: > ... >> So, what I'm really asking is this. Let's say you agree that there >> are use cases for accessing a header value as either the raw encoded >> bytes or the decoded unicode. What should this return: >> >>>>> message['Subject'] >> >> The raw bytes or the decoded unicode? > > That's an easy one: Subject: is an unstructured header, so it must be > text, thus Unicode. We're looking at a high-level representation of > an > email message, with parsed header fields and a MIME message tree. I'm liking Glyph's suggestion here. We'll probably have to support the message['Subject'] API for backward compatibility, but in that case it really should be a bytes API. >> (or better names... it's late and I'm tired ;). One of those maps to >> message['Subject'] but which is the more obvious choice? > > Structured header fields are more of a problem. Any header with > addresses > should return a list of addresses. I think the default return type > should > depend on the data type. To get an explicit bytes or string or list > of > addresses, be explicit; otherwise, for convenience, return the > appropriate > type for the particular header field name. Yes, structured headers are trickier. In a separate message, James Knight makes some excellent points, which I agree with. However the email package obviously cannot support every time of structured header possible. It must support this through extensibility. The obvious way is through inheritance (i.e. subclasses of Header), but in my experience, using inheritance of the Message class really doesn't work very well. You need to pass around factories to parsing functions and your application tends to have its own hierarchy of subclasses for whatever extra things it needs. ISTM that subclassing is simply not the right pattern to support extensibility in the Message objects or Header objects. Yes, this leads me to think that all the MIME* subclasses are essentially /wrong/. Having said all that, the email package must support structured headers. Look at the insanity which is the current folding whitespace splitting and the impossibility of the current code to do the right thing for say Subject headers and Received headers, and you begin to see why it must be possible to extend this stuff. >> Now, setting headers. Sometimes you have some unicode thing and >> sometimes you have some bytes. You need to end up with bytes in the >> ASCII range and you'd like to leave the header value unencoded if so. >> But in both cases, you might have bytes or characters outside that >> range, so you need an explicit encoding, defaulting to utf-8 >> probably. > > Never for header fields. The default is always RFC 2047, unless it > isn't, > say for params. > > The Message class should create an object of the appropriate > subclass of > Header based on the name (or use the existing object, see other > discussion), and that should inspect its argument and DTRT or > complain. >>>>> Message.set_header('Subject', 'Some text', encoding='utf-8') >>>>> Message.set_header('Subject', b'Some bytes') >> >> One of those maps to >> >>>>> message['Subject'] = ??? > > The expected data type should depend on the header field. For > Subject:, it > should be bytes to be parsed or verbatim text. For To:, it should > be a > list of addresses or bytes or text to be parsed. At a higher level, yes. At the low level, it has to be bytes. > The email package should be pythonic, and not require deep > understanding of > dozens of RFCs to use properly. Users don't need to know about the > raw > bytes; that's the whole point of MIME and any email package. It > should be > easy to set header fields with their natural data types, and doing > it with > bad data should produce an error. This may require a bit more care > in the > message parser, to always produce a parsed message with defects. I agree that we should have some higher level APIs that make it easy to compose email messages, and probably easy-ish to parse a byte stream into an email message tree. But we can't build those without the lower level raw support. I'm also convinced that this lower level will be the domain of those crazy enough to have the RFCs tattooed to the back of their eyelids. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 304 bytes Desc: This is a digitally signed message part URL: From barry at python.org Fri Apr 10 19:12:48 2009 From: barry at python.org (Barry Warsaw) Date: Fri, 10 Apr 2009 13:12:48 -0400 Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json In-Reply-To: References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <1F3DC671-746B-425C-A847-4F6CB0DB9FD0@python.org> Message-ID: <50EC006F-CF96-45F4-AD71-73B9DE7E510E@python.org> On Apr 9, 2009, at 11:59 PM, Tony Nelson wrote: >> Thinking about this stuff makes me nostalgic for the sloppy happy >> days >> of Python 2.x > > You now have the opportunity to finally unsnarl that mess. It is > not an > insurmountable opportunity. No, it's just a full time job . Now where did I put that hack- drink-coffee-twitter clone? -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 304 bytes Desc: This is a digitally signed message part URL: From barry at python.org Fri Apr 10 19:21:45 2009 From: barry at python.org (Barry Warsaw) Date: Fri, 10 Apr 2009 13:21:45 -0400 Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json In-Reply-To: <87zlepf5hf.fsf@xemacs.org> References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <1F3DC671-746B-425C-A847-4F6CB0DB9FD0@python.org> <87zlepf5hf.fsf@xemacs.org> Message-ID: <67879F1D-B386-4B9B-8203-86DB977BD7FF@python.org> On Apr 10, 2009, at 1:22 AM, Stephen J. Turnbull wrote: >> Those objects have headers and payload. The payload can be of any >> type, though I think it generally breaks down into "strings" for >> text/ >> * types and bytes for anything else (not counting multiparts). > > *sigh* Why are you back-tracking? I'm not. Sleep deprivation on makes it seem like that. > The payload should be of an appropriate *object* type. Atomic object > types will have their content stored as string or bytes [nb I use > Python 3 terminology throughout]. Composite types (multipart/*) won't > need string or bytes attributes AFAICS. Yes, agreed. > Start by implementing the application/octet-stream and > text/plain;charset=utf-8 object types, of course. Yes. See my lament about using inheritance for this. >> It does seem to make sense to think about headers as text header >> names >> and text header values. > > I disagree. IMHO, structured header types should have object values, > and something like While I agree, there's still a need for a higher level API that make it easy to do the simple things. > message['to'] = "Barry 'da FLUFL' Warsaw " > > should be smart enough to detect that it's a string and attempt to > (flexibly) parse it into a fullname and a mailbox adding escapes, etc. > Whether these should be structured objects or they can be strings or > bytes, I'm not sure (probably bytes, not strings, though -- see next > exampl). OTOH > > message['to'] = b'''"Barry 'da.FLUFL' Warsaw" ''' > > should assume that the client knows what they are doing, and should > parse it strictly (and I mean "be a real bastard", eg, raise an > exception on any non-ASCII octet), merely dividing it into fullname > and mailbox, and caching the bytes for later insertion in a > wire-format message. I agree that the Message class needs to be strict. A parser needs to be lenient; see the .defects attribute introduced in the current email package. Oh, and this reminds me that we still haven't talked about idempotency. That's an important principle in the current email package, but do we need to give up on that? >> In that case, I think you want the values as unicodes, and probably >> the headers as unicodes containing only ASCII. So your table would >> be >> strings in both cases. OTOH, maybe your application cares about the >> raw underlying encoded data, in which case the header names are >> probably still strings of ASCII-ish unicodes and the values are >> bytes. It's this distinction (and I think the competing use cases) >> that make a true Python 3.x API for email more complicated. > > I don't see why you can't have the email API be specific, with > message['to'] always returning a structured_header object (or maybe > even more specifically an address_header object), and methods like > > message['to'].build_header_as_text() > > which returns > > """To: "Barry 'da.FLUFL' Warsaw" """ > > and > > message['to'].build_header_in_wire_format() > > which returns > > b"""To: "Barry 'da.FLUFL' Warsaw" """ > > Then have email.textview.Message and email.wireview.Message which > provide a simple interface where message['to'] would invoke > .build_header_as_text() and .build_header_in_wire_format() > respectively. This seems similar to Glyph's basic idea, but with a different spelling. >> Thinking about this stuff makes me nostalgic for the sloppy happy >> days >> of Python 2.x > > Er, yeah. > > Nostalgic-for-the-BITNET-days-where-everything-was-Just-EBCDIC-ly > y'rs, Can I have my uucp address back now? -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 304 bytes Desc: This is a digitally signed message part URL: From v+python at g.nevcal.com Fri Apr 10 20:00:54 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Fri, 10 Apr 2009 11:00:54 -0700 Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json In-Reply-To: References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> <20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com> Message-ID: <49DF8956.5050501@g.nevcal.com> On approximately 4/10/2009 9:56 AM, came the following characters from the keyboard of Barry Warsaw: > On Apr 10, 2009, at 1:19 AM, glyph at divmod.com wrote: >> On 02:38 am, barry at python.org wrote: >>> So, what I'm really asking is this. Let's say you agree that there >>> are use cases for accessing a header value as either the raw encoded >>> bytes or the decoded unicode. What should this return: >>> >>> >>> message['Subject'] >>> >>> The raw bytes or the decoded unicode? >> >> My personal preference would be to just get deprecate this API, and >> get rid of it, replacing it with a slightly more explicit one. >> >> message.headers['Subject'] >> message.bytes_headers['Subject'] > > This is pretty darn clever Glyph. Stop that! :) > > I'm not 100% sure I like the name .bytes_headers or that .headers > should be the decoded header (rather than have .headers return the > bytes thingie and say .decoded_headers return the decoded thingies), > but I do like the general approach. If one name has to be longer than the other, it should be the bytes version. Real user code is more likely to want to use the text version, and hopefully there will be more of that type of code than implementations using bytes. Of course, one could use message.header and message.bythdr and they'd be the same length. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From barry at python.org Fri Apr 10 20:55:23 2009 From: barry at python.org (Barry Warsaw) Date: Fri, 10 Apr 2009 14:55:23 -0400 Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json In-Reply-To: <49DF8956.5050501@g.nevcal.com> References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> <20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com> <49DF8956.5050501@g.nevcal.com> Message-ID: <71E1EA03-6E24-4A28-A47A-4EA2D501CC6D@python.org> On Apr 10, 2009, at 2:00 PM, Glenn Linderman wrote: > If one name has to be longer than the other, it should be the bytes > version. Real user code is more likely to want to use the text > version, and hopefully there will be more of that type of code than > implementations using bytes. I'm not sure we know that yet, actually. Nothing written for Python 2 counts, and email is too broken in 3 for any sane person to be writing such code for Python 3. > Of course, one could use message.header and message.bythdr and > they'd be the same length. I was trying to figure out what a 'thdr' was that we'd want to index 'by' it. :) -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 304 bytes Desc: This is a digitally signed message part URL: From barry at python.org Fri Apr 10 20:55:56 2009 From: barry at python.org (Barry Warsaw) Date: Fri, 10 Apr 2009 14:55:56 -0400 Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json In-Reply-To: <49DF8A95.4010700@voidspace.org.uk> References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> <20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com> <49DF8956.5050501@g.nevcal.com> <49DF8A95.4010700@voidspace.org.uk> Message-ID: On Apr 10, 2009, at 2:06 PM, Michael Foord wrote: > Shouldn't headers always be text? /me weeps -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 304 bytes Desc: This is a digitally signed message part URL: From stephen at xemacs.org Fri Apr 10 21:04:22 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sat, 11 Apr 2009 04:04:22 +0900 Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json In-Reply-To: <67879F1D-B386-4B9B-8203-86DB977BD7FF@python.org> References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <1F3DC671-746B-425C-A847-4F6CB0DB9FD0@python.org> <87zlepf5hf.fsf@xemacs.org> <67879F1D-B386-4B9B-8203-86DB977BD7FF@python.org> Message-ID: <87prfkfhzd.fsf@xemacs.org> Shouldn't this thread move lock stock and .signature to email-sig? Barry Warsaw writes: > >> It does seem to make sense to think about headers as text header > >> names and text header values. > > > > I disagree. IMHO, structured header types should have object values, > > and something like > > While I agree, there's still a need for a higher level API that make > it easy to do the simple things. Sure. I'm suggesting that the way to determine whether something is simple or not is by whether it falls out naturally from correct structure. Ie, no operations that only a Cirque du Soleil juggler can perform are allowed. > I agree that the Message class needs to be strict. A parser needs to > be lenient; Not always. The Postel Principle only applies to stuph coming in off the wire. But we're *also* going to be parsing pseudo-email components that are being handed to us by applications (eg, the perennial control-character-in-the-unremovable-address Mailman bug). Our parser should Just Say No to that crap. > see the .defects attribute introduced in the current email > package. Oh, and this reminds me that we still haven't talked about > idempotency. That's an important principle in the current email > package, but do we need to give up on that? "Idempotency"? I'm not sure what that means in the context of the email package ... multiplication by zero? Do you mean that .parse().to_wire() should be idempotent? Yes, I think that's a good idea, and it shouldn't be too hard to implement by (optionally?) caching the whole original message or individual components (headers with all whitespace including folding cached verbatim, etc). I think caching has to be done, since stuff like "did the original fold with a leading tab or a leading space, and at what column" and so on seems kind of pointless to encode as attributes on Header objects. [Description of MessageTextView and MessageWireView elided.] > This seems similar to Glyph's basic idea, but with a different spelling. Yes. I don't much care which way it's done, and Glyph's style of spelling is more explicit. But I was thinking in terms of the number of people who are surely going to sing "Mama don' 'low no Unicodes roun' here" and squeal "codec WTF?! outta mah face, man!" From stephen at xemacs.org Fri Apr 10 21:06:59 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sat, 11 Apr 2009 04:06:59 +0900 Subject: [Email-SIG] [Python-Dev] the email module, text, and bytes (was Re: Dropping bytes "support" in json) In-Reply-To: <92023.1239381344@parc.com> References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <1F3DC671-746B-425C-A847-4F6CB0DB9FD0@python.org> <20090410031151.12555.724184150.divmod.xquotient.7482@weber.divmod.com> <92023.1239381344@parc.com> Message-ID: <87ocv4fhv0.fsf@xemacs.org> Bill Janssen writes: > Barry Warsaw wrote: > > > In that case, we really need the > > bytes-in-bytes-out-bytes-in-the-chewy- > > center API first, and build things on top of that. > > Yep. Uh, I hate to rain on a parade, but isn't that how we arrived at the *current* email package? From barry at python.org Fri Apr 10 21:04:01 2009 From: barry at python.org (Barry Warsaw) Date: Fri, 10 Apr 2009 15:04:01 -0400 Subject: [Email-SIG] [Python-Dev] the email module, text, and bytes (was Re: Dropping bytes "support" in json) In-Reply-To: <87ocv4fhv0.fsf@xemacs.org> References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <1F3DC671-746B-425C-A847-4F6CB0DB9FD0@python.org> <20090410031151.12555.724184150.divmod.xquotient.7482@weber.divmod.com> <92023.1239381344@parc.com> <87ocv4fhv0.fsf@xemacs.org> Message-ID: On Apr 10, 2009, at 3:06 PM, Stephen J. Turnbull wrote: > Bill Janssen writes: >> Barry Warsaw wrote: >> >>> In that case, we really need the >>> bytes-in-bytes-out-bytes-in-the-chewy- >>> center API first, and build things on top of that. >> >> Yep. > > Uh, I hate to rain on a parade, but isn't that how we arrived at the > *current* email package? Not really. We got here because we were too damn sloppy about the distinction. I'm going to remove python-dev from subsequent follow ups. Please join us at email-sig for further discussion. Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 304 bytes Desc: This is a digitally signed message part URL: From mark at msapiro.net Fri Apr 10 21:34:41 2009 From: mark at msapiro.net (Mark Sapiro) Date: Fri, 10 Apr 2009 12:34:41 -0700 Subject: [Email-SIG] Dropping bytes "support" in json In-Reply-To: <87prfkfhzd.fsf@xemacs.org> Message-ID: Stephen J. Turnbull wrote: >Shouldn't this thread move lock stock and .signature to email-sig? I'm doing my part :) >"Idempotency"? I'm not sure what that means in the context of the >email package ... multiplication by zero? Do you mean that >.parse().to_wire() should be idempotent? Yes, I think that's a good >idea, and it shouldn't be too hard to implement by (optionally?) >caching the whole original message or individual components (headers >with all whitespace including folding cached verbatim, etc). I think >caching has to be done, since stuff like "did the original fold with a >leading tab or a leading space, and at what column" and so on seems >kind of pointless to encode as attributes on Header objects. My response here is probably OT, but RFC 822 is the only RFC that talks about folding by *inserting* whitespace. both RFC 2822 and RFC 5322 say folding is done by inserting ahead of *existing* whitespace and unfolding is done by removing the (only). Thus, the question of whether folding was with or should not arise. Of course, in terms of trying to reconstruct the original on_the_wire message exactly, the question of where the folding occurred is still relevant. but if we're doing the right thing, the question of what character should follow the is not. -- Mark Sapiro The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan From barry at python.org Fri Apr 10 21:39:37 2009 From: barry at python.org (Barry Warsaw) Date: Fri, 10 Apr 2009 15:39:37 -0400 Subject: [Email-SIG] Dropping bytes "support" in json In-Reply-To: References: Message-ID: On Apr 10, 2009, at 3:34 PM, Mark Sapiro wrote: > My response here is probably OT, but RFC 822 is the only RFC that > talks > about folding by *inserting* whitespace. both RFC 2822 and RFC 5322 > say folding is done by inserting ahead of *existing* whitespace > and unfolding is done by removing the (only). Thus, the > question of whether folding was with or should not > arise. > > Of course, in terms of trying to reconstruct the original on_the_wire > message exactly, the question of where the folding occurred is still > relevant. but if we're doing the right thing, the question of what > character should follow the is not. +1 I /think/ the email package in Python 3.0 DTRT here, or well, at least does better than the one in 2.6. Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 304 bytes Desc: This is a digitally signed message part URL: From barry at python.org Fri Apr 10 23:57:17 2009 From: barry at python.org (Barry Warsaw) Date: Fri, 10 Apr 2009 17:57:17 -0400 Subject: [Email-SIG] Append behavior of __setitem__ Message-ID: <426749C6-27FD-4F97-BE00-08076386B2D8@python.org> So I'm just starting to read RFC 5322 and I'm starting by skimming over Appendix A (differences between RFC 5322 and 2822). I see this: 26. No multiple occurrences of fields (except resent and received).* Which i find very interesting, and possibly relevant to the discussion about changing the semantics of Message.__setitem__() to not append to the list of headers, as well as some of the other semantics of message headers (e.g. get_all()). thinking-out-loud-ly y'rs, -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 304 bytes Desc: This is a digitally signed message part URL: From tonynelson at georgeanelson.com Sat Apr 11 00:39:02 2009 From: tonynelson at georgeanelson.com (Tony Nelson) Date: Fri, 10 Apr 2009 18:39:02 -0400 Subject: [Email-SIG] Append behavior of __setitem__ In-Reply-To: <426749C6-27FD-4F97-BE00-08076386B2D8@python.org> References: <426749C6-27FD-4F97-BE00-08076386B2D8@python.org> Message-ID: At 17:57 -0400 04/10/2009, Barry Warsaw wrote: ... >So I'm just starting to read RFC 5322 and I'm starting by skimming >over Appendix A (differences between RFC 5322 and 2822). Oh, bother! >I see this: > >26. No multiple occurrences of fields (except resent and received).* > >Which i find very interesting, and possibly relevant to the discussion >about changing the semantics of Message.__setitem__() to not append to >the list of headers, as well as some of the other semantics of message >headers (e.g. get_all()). Thank you for mentioning this. Darn it. -- ____________________________________________________________________ TonyN.:' ' From tonynelson at georgeanelson.com Sat Apr 11 00:46:44 2009 From: tonynelson at georgeanelson.com (Tony Nelson) Date: Fri, 10 Apr 2009 18:46:44 -0400 Subject: [Email-SIG] Append behavior of __setitem__ In-Reply-To: References: <426749C6-27FD-4F97-BE00-08076386B2D8@python.org> Message-ID: (Fired too fast.) At 18:39 -0400 04/10/2009, Tony Nelson wrote: >At 17:57 -0400 04/10/2009, Barry Warsaw wrote: > ... >>So I'm just starting to read RFC 5322 and I'm starting by skimming >>over Appendix A (differences between RFC 5322 and 2822). > >Oh, bother! Appendix B? ... >Thank you for mentioning this. Darn it. I note that there is also RFC 5321, "Simple Mail Transfer Protocol", which obsoletes RFC 2821 and updates RFC 1123, "Registration of Mail and MIME Header Fields". -- ____________________________________________________________________ TonyN.:' ' From barry at python.org Sat Apr 11 00:55:34 2009 From: barry at python.org (Barry Warsaw) Date: Fri, 10 Apr 2009 18:55:34 -0400 Subject: [Email-SIG] Append behavior of __setitem__ In-Reply-To: References: <426749C6-27FD-4F97-BE00-08076386B2D8@python.org> Message-ID: On Apr 10, 2009, at 6:46 PM, Tony Nelson wrote: > (Fired too fast.) > > At 18:39 -0400 04/10/2009, Tony Nelson wrote: >> At 17:57 -0400 04/10/2009, Barry Warsaw wrote: >> ... >>> So I'm just starting to read RFC 5322 and I'm starting by skimming >>> over Appendix A (differences between RFC 5322 and 2822). >> >> Oh, bother! > > Appendix B? Oops, yep! > ... >> Thank you for mentioning this. Darn it. > > I note that there is also RFC 5321, "Simple Mail Transfer Protocol", > which > obsoletes RFC 2821 and updates RFC 1123, "Registration of Mail and > MIME Yeah. We'll let the smtplib.py people worry about that one . -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 304 bytes Desc: This is a digitally signed message part URL: From stephen at xemacs.org Sat Apr 11 09:43:56 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sat, 11 Apr 2009 16:43:56 +0900 Subject: [Email-SIG] Append behavior of __setitem__ In-Reply-To: <426749C6-27FD-4F97-BE00-08076386B2D8@python.org> References: <426749C6-27FD-4F97-BE00-08076386B2D8@python.org> Message-ID: <87ocv3tz2b.fsf@xemacs.org> Barry Warsaw writes: > So I'm just starting to read RFC 5322 and I'm starting by skimming > over Appendix A (differences between RFC 5322 and 2822). I know Barry's a big supporter of the Postel Principle. As a guideline[1], how far back should we be lenient? RFC 822 (no leading "2" ;-)? Footnotes: [1] Presumably over time we'll accrete definitely non-conforming practices that we need to accept and do something sane with (eg, we can't just raise ArmageddonException because we get a header with 8-bit characters in it). But I think we also should have a plan for formerly acceptable syntax that has been restricted in more recent RFCs, etc. From tonynelson at georgeanelson.com Sat Apr 11 23:17:13 2009 From: tonynelson at georgeanelson.com (Tony Nelson) Date: Sat, 11 Apr 2009 17:17:13 -0400 Subject: [Email-SIG] Append behavior of __setitem__ In-Reply-To: <87ocv3tz2b.fsf@xemacs.org> References: <426749C6-27FD-4F97-BE00-08076386B2D8@python.org> <87ocv3tz2b.fsf@xemacs.org> Message-ID: At 16:43 +0900 04/11/2009, Stephen J. Turnbull wrote: >Barry Warsaw writes: > > > So I'm just starting to read RFC 5322 and I'm starting by skimming > > over Appendix A (differences between RFC 5322 and 2822). > >I know Barry's a big supporter of the Postel Principle. As a >guideline[1], how far back should we be lenient? RFC 822 (no leading "2" >;-)? Sure. The header field should be parsed, if possible, and possibly add a defect to the message. For some header fields, the data should be added to the previous Header instance; for others, an extra Header instance might need to be created. Message /generation/ should comply with what was in RFC 2822, where this requirement was added, and also the new RFC 5322. >Footnotes: >[1] Presumably over time we'll accrete definitely non-conforming >practices that we need to accept and do something sane with (eg, we >can't just raise ArmageddonException because we get a header with >8-bit characters in it). But I think we also should have a plan for >formerly acceptable syntax that has been restricted in more recent >RFCs, etc. Any email parser must cope with both obsolete-* syntax and common bad practices. Python's already does in various places. -- ____________________________________________________________________ TonyN.:' ' From barry at python.org Mon Apr 13 16:04:51 2009 From: barry at python.org (Barry Warsaw) Date: Mon, 13 Apr 2009 10:04:51 -0400 Subject: [Email-SIG] Append behavior of __setitem__ In-Reply-To: <87ocv3tz2b.fsf@xemacs.org> References: <426749C6-27FD-4F97-BE00-08076386B2D8@python.org> <87ocv3tz2b.fsf@xemacs.org> Message-ID: On Apr 11, 2009, at 3:43 AM, Stephen J. Turnbull wrote: > Barry Warsaw writes: > >> So I'm just starting to read RFC 5322 and I'm starting by skimming >> over Appendix A (differences between RFC 5322 and 2822). > > I know Barry's a big supporter of the Postel Principle. As a > guideline[1], how far back should we be lenient? RFC 822 (no > leading "2" > ;-)? > > Footnotes: > [1] Presumably over time we'll accrete definitely non-conforming > practices that we need to accept and do something sane with (eg, we > can't just raise ArmageddonException because we get a header with > 8-bit characters in it). But I think we also should have a plan for > formerly acceptable syntax that has been restricted in more recent > RFCs, etc. We could potentially have strict and lenient modes, or possible RFC 822, 2822, 5322 modes. OTOH, I feel very strongly that the parser should accept just about any stream of bytes without throwing an exception. Thinking about an application like Mailman, it's rather inconvenient for the parsing phase to throw any exception. Much better is to register defects and then decide the disposition of messages based on the defect list. OTOH, when creating messages from whole cloth, I think it's okay to raise exception. You just have to be careful because often the same APIs are used by the parser. Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 304 bytes Desc: This is a digitally signed message part URL: From barry at python.org Mon Apr 13 16:05:56 2009 From: barry at python.org (Barry Warsaw) Date: Mon, 13 Apr 2009 10:05:56 -0400 Subject: [Email-SIG] Append behavior of __setitem__ In-Reply-To: References: <426749C6-27FD-4F97-BE00-08076386B2D8@python.org> <87ocv3tz2b.fsf@xemacs.org> Message-ID: <24040806-9EE5-421E-A699-BEDB627CF8D1@python.org> On Apr 11, 2009, at 5:17 PM, Tony Nelson wrote: > Sure. The header field should be parsed, if possible, and possibly > add a > defect to the message. For some header fields, the data should be > added to > the previous Header instance; for others, an extra Header instance > might > need to be created. I don't follow this part. > Message /generation/ should comply with what was in RFC 2822, where > this > requirement was added, and also the new RFC 5322. +1 -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 304 bytes Desc: This is a digitally signed message part URL: From barry at python.org Mon Apr 13 16:11:09 2009 From: barry at python.org (Barry Warsaw) Date: Mon, 13 Apr 2009 10:11:09 -0400 Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json In-Reply-To: References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> Message-ID: On Apr 10, 2009, at 11:08 AM, James Y Knight wrote: > Until you write a parser for every header, you simply cannot decode > to unicode. The only sane choices are: > 1) raw bytes > 2) parsed structured data The email package does not need a parser for every header, but it should provide a framework that applications (or third party libraries) can use to extend the built-in header parsers. A bare minimum for functionality requires a Content-Type parser. I think the email package should also include an address header (Originator, Destination) parser, and a Message-ID header parser. Possibly others. The default would probably be some unstructured parser for headers like Subject. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 304 bytes Desc: This is a digitally signed message part URL: From barry at python.org Mon Apr 13 16:14:04 2009 From: barry at python.org (Barry Warsaw) Date: Mon, 13 Apr 2009 10:14:04 -0400 Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json In-Reply-To: <49DF8956.5050501@g.nevcal.com> References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> <20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com> <49DF8956.5050501@g.nevcal.com> Message-ID: <7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org> On Apr 10, 2009, at 2:00 PM, Glenn Linderman wrote: > If one name has to be longer than the other, it should be the bytes > version. Real user code is more likely to want to use the text > version, and hopefully there will be more of that type of code than > implementations using bytes. > > Of course, one could use message.header and message.bythdr and > they'd be the same length. Actually, thinking about this over the weekend, it's much better for message['subject'] to return a Header instance in all cases. Use bytes(header) to get the raw bytes. A good API for getting the parsed and decoded header values needs to take into account that it won't always be a string. For unstructured headers like Subject, str(header) would work just fine. For an Originator or Destination address, what does str(header) return? And what would be the API for getting the set of realname/addresses out of the header? -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 304 bytes Desc: This is a digitally signed message part URL: From barry at python.org Mon Apr 13 16:18:12 2009 From: barry at python.org (Barry Warsaw) Date: Mon, 13 Apr 2009 10:18:12 -0400 Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json In-Reply-To: <87prfkfhzd.fsf@xemacs.org> References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <1F3DC671-746B-425C-A847-4F6CB0DB9FD0@python.org> <87zlepf5hf.fsf@xemacs.org> <67879F1D-B386-4B9B-8203-86DB977BD7FF@python.org> <87prfkfhzd.fsf@xemacs.org> Message-ID: On Apr 10, 2009, at 3:04 PM, Stephen J. Turnbull wrote: > Shouldn't this thread move lock stock and .signature to email-sig? Yep. I'll try to be more conscientious about removing python-dev from the CC. > "Idempotency"? I'm not sure what that means in the context of the > email package ... multiplication by zero? Do you mean that > .parse().to_wire() should be idempotent? Yes, I think that's a good > idea, and it shouldn't be too hard to implement by (optionally?) > caching the whole original message or individual components (headers > with all whitespace including folding cached verbatim, etc). I think > caching has to be done, since stuff like "did the original fold with a > leading tab or a leading space, and at what column" and so on seems > kind of pointless to encode as attributes on Header objects. I tend to agree. I'm also happy of there's a way to tell say the parser that an application doesn't care about that. All that extra caching will have a memory overhead that you should only pay for if you care. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 304 bytes Desc: This is a digitally signed message part URL: From barry at python.org Mon Apr 13 16:20:19 2009 From: barry at python.org (Barry Warsaw) Date: Mon, 13 Apr 2009 10:20:19 -0400 Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json In-Reply-To: <87myaofh5q.fsf@xemacs.org> References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> <1239382031.8682.11.camel@haku> <87myaofh5q.fsf@xemacs.org> Message-ID: <203DDBFE-1B15-454A-95FC-61D863D10B97@python.org> On Apr 10, 2009, at 3:22 PM, Stephen J. Turnbull wrote: > Robert Brewer writes: > >> Syntactically, there's no sense in providing: >> >> Message.set_header('Subject', 'Some text', encoding='utf-16') >> >> ...since you could more clearly write the same as: >> >> Message.set_header('Subject', 'Some text'.encode('utf-16')) > > Which you now must *parse* and guess the encoding to determine how to > RFC-2047-encode the binary mush. I think the encoding parameter is > necessary here. Agreed! In fact, it's redundant to explicitly encode the string. So the first spelling is preferred. >> But it would be far easier to do all the encoding at once in an >> output() or serialize() method. Do different headers need different >> encodings? > > You can have multiple encodings within a single header (and a na?ve > algorithm might very well encode "The price of G?del-Escher-Bach is > ?25" as "The price of =?ISO-8859-1?Q?G=F6del-Escher-Bach?= is > =?ISO-8859-15?Q?=A425?="). Isn't email just wonderful? Please, spam and Facebook, kill it off once and for all, won't you? -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 304 bytes Desc: This is a digitally signed message part URL: From barry at python.org Mon Apr 13 16:28:32 2009 From: barry at python.org (Barry Warsaw) Date: Mon, 13 Apr 2009 10:28:32 -0400 Subject: [Email-SIG] [Python-Dev] headers api for email package In-Reply-To: <49E08F8C.5030205@simplistix.co.uk> References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> <49E08F8C.5030205@simplistix.co.uk> Message-ID: On Apr 11, 2009, at 8:39 AM, Chris Withers wrote: > Barry Warsaw wrote: >> >>> message['Subject'] >> The raw bytes or the decoded unicode? > > A header object. Yep. You got there before I did. :) >> Okay, so you've picked one. Now how do you spell the other way? > > str(message['Subject']) Yes for unstructured headers like Subject. For structured headers... hmm. > bytes(message['Subject']) Yes. >> Now, setting headers. Sometimes you have some unicode thing and >> sometimes you have some bytes. You need to end up with bytes in >> the ASCII range and you'd like to leave the header value unencoded >> if so. But in both cases, you might have bytes or characters >> outside that range, so you need an explicit encoding, defaulting to >> utf-8 probably. >> >>> Message.set_header('Subject', 'Some text', encoding='utf-8') >> >>> Message.set_header('Subject', b'Some bytes') > > Where you just want "a damned valid email and stop making my life > hard!": > > Message['Subject']='Some text' Yes. In which case I propose we guess the encoding as 1) ascii, 2) utf-8, 3) wtf? > Where you care about what encoding is used: > > Message['Subject']=Header('Some text',encoding='utf-8') Yes. > If you have bytes, for whatever reason: > > Message['Subject']=b'some bytes'.decode('utf-8') > > ...because only you know what encoding those bytes use! So you're saying that __setitem__() should not accept raw bytes? -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 304 bytes Desc: This is a digitally signed message part URL: From rdmurray at bitdance.com Mon Apr 13 17:49:35 2009 From: rdmurray at bitdance.com (R. David Murray) Date: Mon, 13 Apr 2009 11:49:35 -0400 (EDT) Subject: [Email-SIG] [Python-Dev] headers api for email package In-Reply-To: References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> <49E08F8C.5030205@simplistix.co.uk> Message-ID: On Mon, 13 Apr 2009 at 10:28, Barry Warsaw wrote: > On Apr 11, 2009, at 8:39 AM, Chris Withers wrote: > >> Barry Warsaw wrote: >> > > > > message['Subject'] >> > The raw bytes or the decoded unicode? >> >> A header object. > > Yep. You got there before I did. :) +1 >> > Okay, so you've picked one. Now how do you spell the other way? >> >> str(message['Subject']) > > Yes for unstructured headers like Subject. For structured headers... hmm. Some "reasonable" printable interpretation that has no semantic meaning? >> bytes(message['Subject']) > > Yes. > >> > Now, setting headers. Sometimes you have some unicode thing and >> > sometimes you have some bytes. You need to end up with bytes in the >> > ASCII range and you'd like to leave the header value unencoded if so. >> > But in both cases, you might have bytes or characters outside that range, >> > so you need an explicit encoding, defaulting to utf-8 probably. >> > > > > Message.set_header('Subject', 'Some text', encoding='utf-8') >> > > > > Message.set_header('Subject', b'Some bytes') >> >> Where you just want "a damned valid email and stop making my life hard!": >> >> Message['Subject']='Some text' > > Yes. In which case I propose we guess the encoding as 1) ascii, 2) utf-8, 3) > wtf? Given some usenet postings I've just dealt with, (3) appears to sometimes be spelled 'x-unknown' and sometimes (in the most recent case) 'unknown-8bit'. A quick google turns up a hit on RFC1428 for the latter, and a bunch of trouble tickets for the former...so I think 'wtf' is correctly spelled 'unknown-8bit'. However, it's not supposed to be used by mail composers, who are expected to know the encoding. It's for mail gateways that are transforming something and don't know the encoding. I'm not sure what this means for the email module, which certainly will be used in a mail gateways....maybe it's the responsibility of the application code to explicitly say 'unknown encoding'? >> Where you care about what encoding is used: >> >> Message['Subject']=Header('Some text',encoding='utf-8') > > Yes. > >> If you have bytes, for whatever reason: >> >> Message['Subject']=b'some bytes'.decode('utf-8') >> >> ...because only you know what encoding those bytes use! > > So you're saying that __setitem__() should not accept raw bytes? If I'm understanding things correctly, if it did accept bytes the person using that interface would need to do whatever encoding (eg: encoded-word) was needed, so the interface should check that the byte string is 8 bit clean. But having some sort of 'setraw' method on Header might be better for that case. --David From stephen at xemacs.org Mon Apr 13 19:15:20 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 14 Apr 2009 02:15:20 +0900 Subject: [Email-SIG] [Python-Dev] headers api for email package In-Reply-To: References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> <49E08F8C.5030205@simplistix.co.uk> Message-ID: <873accv5jr.fsf@xemacs.org> Barry Warsaw writes: > On Apr 11, 2009, at 8:39 AM, Chris Withers wrote: > > > Barry Warsaw wrote: > >> >>> message['Subject'] > >> The raw bytes or the decoded unicode? > > > > A header object. > > Yep. You got there before I did. :) > > >> Okay, so you've picked one. Now how do you spell the other way? > > > > str(message['Subject']) > > Yes for unstructured headers like Subject. For structured headers... > hmm. Well, suppose we get really radical here. *People* see email as (rich-)text. So ... message['Subject'] returns an object, partly to be consistent with more complex headers' APIs, but partly to remind us that nothing in email is as simple as it seems. Now, str(message['Subject']) is really for presentation to the user, right? OK, so let's make it a presentation function! Decode the MIME-words, optionally unfold folded lines, optionally compress spaces, etc. This by default returns the subject field as a single, possibly quite long, line. Then a higher-level API can rewrap it, add fonts etc, for fancy presentation. This also suggests that we don't the field tag (ie, "Subject") to be part of this value. Of course a *really* smart higher-level API would access structured headers based on their structure, not on the one-size-fits-all str() conversion. Then MTAs see email as a string of octets. So guess what: > > bytes(message['Subject']) gives wire format. Yow! I think I'm just joking. Right? > >> Now, setting headers. Sometimes you have some unicode thing and > >> sometimes you have some bytes. You need to end up with bytes in > >> the ASCII range and you'd like to leave the header value unencoded > >> if so. But in both cases, you might have bytes or characters > >> outside that range, so you need an explicit encoding, defaulting to > >> utf-8 probably. > >> >>> Message.set_header('Subject', 'Some text', encoding='utf-8') > >> >>> Message.set_header('Subject', b'Some bytes') > > > > Where you just want "a damned valid email and stop making my life > > hard!": -1 I mean, yeah, Brother, I feel your pain but it just isn't that easy. If that were feasible, it would be *criminal* to have a .set_header() method at all! In fact, > > Message['Subject']='Some text' is going to (a) need to take *only* unicodes, or (b) raise Exceptions at the slightest provocation when handed bytes. And things only get worse if you try to provide this interface for say "From" (let alone "Content-Type"). Is it really worth doing the mapping interface if it's only usable with free-form headers (ie, only Subject among the commonly used headers)? > Yes. In which case I propose we guess the encoding as 1) ascii, 2) > utf-8, 3) wtf? Uh, what guessing? If you don't know what you have but you believe it to be a valid header field, then presumably you got it off the wire and it's still in bytes and you just spit it out on the wire without trying to decode or encode it. But as I already said, I think that's a bad idea. Otherwise, you should have a unicode, and you simply look at the range of the string. If it fits in ASCII, Bob's your uncle. If not, Bob's your aunt (and you use UTF-8). > > Where you care about what encoding is used: > > > > Message['Subject']=Header('Some text',encoding='utf-8') > > Yes. > > > If you have bytes, for whatever reason: > > > > Message['Subject']=b'some bytes'.decode('utf-8') > > > > ...because only you know what encoding those bytes use! > > So you're saying that __setitem__() should not accept raw bytes? How do you distinguish "raw" bytes from "encoded bytes"? __setitem__() shouldn't accept bytes at all. There should be an API which sets a .formatted_for_the_wire member, and it should have a "validate" option (ie, when true the API attempts to parse the header and raises an exception if it fails to do so; when false, it assumes you know what you're doing and will send out the bytes verbatim). From tonynelson at georgeanelson.com Mon Apr 13 19:09:36 2009 From: tonynelson at georgeanelson.com (Tony Nelson) Date: Mon, 13 Apr 2009 13:09:36 -0400 Subject: [Email-SIG] Append behavior of __setitem__ In-Reply-To: References: <426749C6-27FD-4F97-BE00-08076386B2D8@python.org> <87ocv3tz2b.fsf@xemacs.org> Message-ID: At 10:04 -0400 04/13/2009, Barry Warsaw wrote: ... >We could potentially have strict and lenient modes, or possible RFC >822, 2822, 5322 modes. Is there any need to produce emails that don't conform to the latest spec? Those specs are crafted to produce backward-compatible messages. >OTOH, I feel very strongly that the parser >should accept just about any stream of bytes without throwing an >exception. Thinking about an application like Mailman, it's rather >inconvenient for the parsing phase to throw any exception. Much >better is to register defects and then decide the disposition of >messages based on the defect list. > >OTOH, The second other hand should be the Gripping hand, as it should be the overriding point. >when creating messages from whole cloth, I think it's okay to >raise exception. You just have to be careful because often the same >APIs are used by the parser. APIs raise exceptions, parser catches them, makes into defects? -- ____________________________________________________________________ TonyN.:' ' From tonynelson at georgeanelson.com Mon Apr 13 19:09:41 2009 From: tonynelson at georgeanelson.com (Tony Nelson) Date: Mon, 13 Apr 2009 13:09:41 -0400 Subject: [Email-SIG] Append behavior of __setitem__ In-Reply-To: <24040806-9EE5-421E-A699-BEDB627CF8D1@python.org> References: <426749C6-27FD-4F97-BE00-08076386B2D8@python.org> <87ocv3tz2b.fsf@xemacs.org> <24040806-9EE5-421E-A699-BEDB627CF8D1@python.org> Message-ID: At 10:05 -0400 04/13/2009, Barry Warsaw wrote: >On Apr 11, 2009, at 5:17 PM, Tony Nelson wrote: > >>Sure. The header field should be parsed, if possible, and possibly add a >>defect to the message. For some header fields, the data should be added >>to the previous Header instance; for others, an extra Header instance >>might need to be created. > >I don't follow this part. When a duplicate header field is parsed, errors are made into defects. If duplicates are not allowed for that header field, the contents should be added to the previous header if that is possible (Subject:, just append with whitespace; address headers, just append addresses), or a new (improper) Header should be created if it is not possible to add to the previous Header (Message-ID:, Content-Type:). OK, those examples for extra Headers aren't good; it may be that they only produce message defects. -- ____________________________________________________________________ TonyN.:' ' From tonynelson at georgeanelson.com Mon Apr 13 19:09:43 2009 From: tonynelson at georgeanelson.com (Tony Nelson) Date: Mon, 13 Apr 2009 13:09:43 -0400 Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json In-Reply-To: References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> Message-ID: At 10:11 -0400 04/13/2009, Barry Warsaw wrote: >On Apr 10, 2009, at 11:08 AM, James Y Knight wrote: > >> Until you write a parser for every header, you simply cannot decode >> to unicode. The only sane choices are: >> 1) raw bytes >> 2) parsed structured data > >The email package does not need a parser for every header, but it >should provide a framework that applications (or third party >libraries) can use to extend the built-in header parsers. A bare >minimum for functionality requires a Content-Type parser. I think the >email package should also include an address header (Originator, >Destination) parser, and a Message-ID header parser. Possibly >others. The default would probably be some unstructured parser for >headers like Subject. I think the email package should have a parser for every header. All the headers defined in normal mail RFCs should have their own parser, and there would be a default parser for unhandled headers, probably the Unstructured parser. Users could add their own, probably by importing something module that knew how to add its parsing to the email package parsers. -- ____________________________________________________________________ TonyN.:' ' From tonynelson at georgeanelson.com Mon Apr 13 19:13:25 2009 From: tonynelson at georgeanelson.com (Tony Nelson) Date: Mon, 13 Apr 2009 13:13:25 -0400 Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json In-Reply-To: <7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org> References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> <20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com> <49DF8956.5050501@g.nevcal.com> <7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org> Message-ID: At 10:14 -0400 04/13/2009, Barry Warsaw wrote: ... >Actually, thinking about this over the weekend, it's much better for >message['subject'] to return a Header instance in all cases. Use >bytes(header) to get the raw bytes. I don't agree. I'd want it to return the appropriate type for that header: string for Subject:, a list of addresses for To:, and so on. Either the user knows what to expect, or they'll learn immediately. If they get a Header, they have to then extract the appropriate data from it, based on its type (but they only know the name). OK, Header instances could have a .useful field that returned the useful data in all instances. But in any case, the email package should guide users in the correct usage, rather than leaving every choice seeming equal, when only one choice is correct. >A good API for getting the parsed and decoded header values needs to >take into account that it won't always be a string. For unstructured >headers like Subject, str(header) would work just fine. For an >Originator or Destination address, what does str(header) return? And >what would be the API for getting the set of realname/addresses out of >the header? msg[] would be the preferred way. msg.get_header().useful would return the useful data form of any header. msg.get_header().addresses would return the address list from any address Header, and raise AttributeError with other Headers. -- ____________________________________________________________________ TonyN.:' ' From tonynelson at georgeanelson.com Mon Apr 13 19:09:23 2009 From: tonynelson at georgeanelson.com (Tony Nelson) Date: Mon, 13 Apr 2009 13:09:23 -0400 Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json In-Reply-To: References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <1F3DC671-746B-425C-A847-4F6CB0DB9FD0@python.org> <87zlepf5hf.fsf@xemacs.org> <67879F1D-B386-4B9B-8203-86DB977BD7FF@python.org> <87prfkfhzd.fsf@xemacs.org> Message-ID: At 10:18 -0400 04/13/2009, Barry Warsaw wrote: >On Apr 10, 2009, at 3:04 PM, Stephen J. Turnbull wrote: ... >> "Idempotency"? I'm not sure what that means in the context of the >> email package ... multiplication by zero? Do you mean that >> .parse().to_wire() should be idempotent? Yes, I think that's a good >> idea, and it shouldn't be too hard to implement by (optionally?) >> caching the whole original message or individual components (headers >> with all whitespace including folding cached verbatim, etc). I think >> caching has to be done, since stuff like "did the original fold with a >> leading tab or a leading space, and at what column" and so on seems >> kind of pointless to encode as attributes on Header objects. > >I tend to agree. I'm also happy of there's a way to tell say the >parser that an application doesn't care about that. All that extra >caching will have a memory overhead that you should only pay for if >you care. I'd expect the caching to have very low overhead. Message bodies will not be cached (an extra time), only some headers (when the Header isn't idempotent already) and the preamble and epiloge around message bodies. -- ____________________________________________________________________ TonyN.:' ' From stephen at xemacs.org Mon Apr 13 20:38:27 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 14 Apr 2009 03:38:27 +0900 Subject: [Email-SIG] API for Header objects [was: Dropping bytes "support" in json] In-Reply-To: References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> <20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com> <49DF8956.5050501@g.nevcal.com> <7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org> Message-ID: <87y6u4tn4s.fsf@xemacs.org> Tony Nelson writes: > OK, Header instances could have a .useful field that returned the useful > data in all instances. But in any case, the email package should guide > users in the correct usage, rather than leaving every choice seeming equal, > when only one choice is correct. What do you mean by "only one choice is correct?" For example, a Destination field might be used for presentation (in which case the display name are needed), or to compose a list of recipients (when thjey should be discarded). Some applications might prefer to receive the combination as the original string (although that often is not valid RFC-any), others might prefer it parsed into a pair of display name and mailbox. Quoth Barry Warsaw: > >A good API for getting the parsed and decoded header values needs to > >take into account that it won't always be a string. For unstructured > >headers like Subject, str(header) would work just fine. For an > >Originator or Destination address, what does str(header) return? A string (not folded) of comma-separated addresses in "Display Name" form. > >And what would be the API for getting the set of > >realname/addresses out of the header? Does there need to be one? An AddressHeader object could support indexing: message['To'][0] returns the first displayname,mailbox pair. If you really want a list, what's wrong with list(header)? (Yes, I recall that you (Barry) said you don't think subclassing worked very well, but I wonder if maybe we can't get it righter this time around.) > msg[] would be the preferred way. This goes against the principle that this returns a Header object. For one thing, I really think that there need to be some common methods all Header objects support, like str() and to_wire_format(). Also, if this returns a list for 'To', then str(msg['To']) won't work right: it will return the list enclosed in square brackets and the mailbox portions will be quoted, which isn't useful. > msg.get_header().useful would return the useful data form of > any header. Er, shouldn't we just throw away the data that is never useful? > msg.get_header().addresses would return the address list from > any address Header, and raise AttributeError with other Headers. Yes, but a list of what? Strings? Bytes? Displayname/mailbox pairs? From tonynelson at georgeanelson.com Tue Apr 14 02:58:40 2009 From: tonynelson at georgeanelson.com (Tony Nelson) Date: Mon, 13 Apr 2009 20:58:40 -0400 Subject: [Email-SIG] API for Header objects [was: Dropping bytes "support" in json] In-Reply-To: <87y6u4tn4s.fsf@xemacs.org> References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> <20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com> <49DF8956.5050501@g.nevcal.com> <7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org> <87y6u4tn4s.fsf@xemacs.org> Message-ID: At 03:38 +0900 04/14/2009, Stephen J. Turnbull wrote: >Tony Nelson writes: > > > OK, Header instances could have a .useful field that returned the useful > > data in all instances. But in any case, the email package should guide > > users in the correct usage, rather than leaving every choice seeming equal, > > when only one choice is correct. > >What do you mean by "only one choice is correct?" For example, a >Destination field might be used for presentation (in which case the >display name are needed), or to compose a list of recipients (when >thjey should be discarded). Some applications might prefer to receive >the combination as the original string (although that often is not >valid RFC-any), others might prefer it parsed into a pair of display >name and mailbox. ... Assuming that by "Destination" you mean a class of Address header fields, as there is no Destionation: header field, such header fields contain addresses, which can be considered to contain (as the email package does) a list of (name, email address) pairs, or, at a lower level, to also have Comments, there is indeed only one correct choice, which is the one the email package currently provides the diligent user. I wish it to be the one obvious choice, so that less study is needed to properly use the email package. Any use that wishes to discard the email addresses in favor of the friendly names can do so most easily from the parsed [(name, address)], not from the bytes. Parsing Address header fields is hard. Note that Address headers are not Text, as only certain tokens -- not part of the email addresses -- can be RFC 2047-encoded. -- ____________________________________________________________________ TonyN.:' ' From stephen at xemacs.org Tue Apr 14 06:48:52 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 14 Apr 2009 13:48:52 +0900 Subject: [Email-SIG] [Python-Dev] headers api for email package In-Reply-To: <200904140432.25953.steve@pearwood.info> References: <873accv5jr.fsf@xemacs.org> <200904140432.25953.steve@pearwood.info> Message-ID: <87prffu9fv.fsf@xemacs.org> Removing Python-Dev from the addressees. Steven D'Aprano writes: > On Tue, 14 Apr 2009 03:15:20 am Stephen J. Turnbull wrote: > > > *People* see email as (rich-)text. > > We do? Yup. You don't see the email, you see a *presentation* of that email. That presentation is usually text, plus possible some other stuff (fonts, highlighting, active links, images). Thus the "(rich-)". > It's not clear what you actually mean by "(rich-)text". I mean presentation. I mean "human readable". I mean Unicode. I mean "Do Not Feed The Program" (not for machine processing -- so your associations with virii are completely off the mark). > rich-text. I guess you mean Unicode characters. Am I right? No. I mean presentation, which for Python purposes includes but is not limited to Unicode. > Now, correct me if I'm wrong, but I don't think mail headers can > actually be anything *but* bytes. On the wire. email's Headers have applications other than putting bytes on the wire. > If you're proposing converting those bytes into characters, that's all > very well and good, but what's your strategy for dealing with the > inevitable wrongly-formatted headers? Whatever you want it to be. There are a number of such strategies, some of which should be among the batteries we include. Header.__str__() will need to know how to find out which is in effect, of course. > If the header can't be correctly decoded into text, there still > needs to be a way to get to the raw bytes. Sure. That's what Header.__bytes__() will do. Specifically, if you have a Header that was parsed out of a message received over the wire, it will return a verbatim copy of the header as received, folding whitespace, CRLFs, and all. If the Header was constructed (including editing a received header), then __bytes__ will construct the wire format, and optionally cache it as if it were a received header. (But this has some gotchas, see below.) > > ?> > bytes(message['Subject']) > > > > gives wire format. ?Yow! ?I think I'm just joking. ?Right? > > Er, I'm not sure. Are you joking? I hope not, because it is important to > be able to get to the raw, unmodified bytes that the MTA sees, without > all the fancy processing you suggest. Er, I'm not suggesting any processing in particular. I'm suggesting an API in which str(header) produces a text/plain rendering of the field contents, with no folding, MIME words, or other wire format detritus, suitable for human viewing, more or less (specifically, it might be a rather long line). bytes(header) produces the wire format, either verbatim as received or as constructed based on client input. Note that an issue here is that a received header may be bogus, in which case you *don't* want bytes(header) to simply return the original and then spew over the wire. Should it raise an Exception or "fix up" the bytes? I don't know, and thus I wonder if this proposed API might just be a joke, not something you can dare use in a production application. Of course, str() and bytes() as proposed here are not necessarily what you want. So there will need to be ways to access the internal representation of Header directly (or via further specialized formatter functions if string or bytes format is preferred to structured objects). > Again, correct me if I'm wrong, but *all* valid mail headers must fit in > ASCII. Of course, that's true on the wire. I've assumed that everybody here is assuming STD 11 (currently RFC 822 according to rfc-editor.org) folding of long header lines and RFC 2047 encoding of characters outside of the restricted-ASCII repertoire (RFC 5322 at least doesn't permit all the ASCII control characters) before putting it on the wire. This is basically a solved problem, though, so I didn't bother mentioning it. Sorry for the confusion. But what we're talking about *here* are email APIs that may or may not be directly connected to a display or wire. There is no reason why headers *must* be represented as bytes, strings, or anything else in a Header, and no reason why the bytes or str format *must* be RFC compatible. I think it's quite sensible to specify "bytes(header) will be RFC 5322-conforming", but we need to specify how to handle bogus headers that we have received and not edited. Should we ever raise an Exception, and if so, in what contexts? Should we "fix up" the bogosity somehow? Should we delete the offensive header? Should we pass it on verbatim, and leave it to a higher level to verify(!) and decide what to do about it? Do the RFCs say anything about all this (eg, with broken trace headers I think it's implied that we pass them on verbatim)? From stephen at xemacs.org Tue Apr 14 09:00:59 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 14 Apr 2009 16:00:59 +0900 Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json In-Reply-To: <49E3CA6E.1070501@canterbury.ac.nz> References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> <49E3CA6E.1070501@canterbury.ac.nz> Message-ID: <87ocuzu3bo.fsf@xemacs.org> Warning: Reply-To set to email-sig. Greg Ewing writes: > Only for headers known to be unstructured, I think. > Completely unknown headers should be available only > as bytes. Why do I get the feeling that you guys are feeling up an elephant? There are four things you might want to do with a header: (1) Put it on the wire, which must be bytes (in fact, ASCII). (2) Show it to a user (such as a rootin-tootin spam-fightin mail admin), which for consistency with well-behaved, implemented headers (ie, you might want to *gasp* *concatenate* your unknown header with a string), will sooner or later be string (ie, Unicode). (3) (Try to) parse it, in which case an internal representation with some other structure may or may not be appropriate for storing the parsed data. (4) Munge it, in which case an internal representation with some other structure may or may not be appropriate. I see no particular reason for restricting these basic API classes for any header. From turnbull at sk.tsukuba.ac.jp Tue Apr 14 11:11:53 2009 From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull) Date: Tue, 14 Apr 2009 18:11:53 +0900 Subject: [Email-SIG] API for Header objects [was: Dropping bytes "support" in json] In-Reply-To: References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> <20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com> <49DF8956.5050501@g.nevcal.com> <7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org> <87y6u4tn4s.fsf@xemacs.org> Message-ID: <87vdp761ly.fsf@xemacs.org> Tony Nelson writes: > Assuming that by "Destination" you mean a class of Address header fields, > as there is no Destionation: header field, such header fields contain > addresses, which can be considered to contain (as the email package does) a > list of (name, email address) pairs, or, at a lower level, to also have > Comments, there is indeed only one correct choice, which is the one the > email package currently provides the diligent user. I wish it to be the > one obvious choice, so that less study is needed to properly use the email > package. As you point out above, display names and comments are different. It's *not* obvious to me that they should be confounded by default. In any case, it would certainly be possible to implement both the indexing feature, so that msg['To'][0] returns a (display, mailbox) tuple, and a converter so that list(msg['to']) returns a list of such tuples (in both cases, assuming that most users prefer not to distinguish comments from display names). From tonynelson at georgeanelson.com Wed Apr 15 03:26:19 2009 From: tonynelson at georgeanelson.com (Tony Nelson) Date: Tue, 14 Apr 2009 21:26:19 -0400 Subject: [Email-SIG] API for Header objects [was: Dropping bytes "support" in json] In-Reply-To: <87vdp761ly.fsf@xemacs.org> References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> <20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com> <49DF8956.5050501@g.nevcal.com> <7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org> <87y6u4tn4s.fsf@xemacs.org> <87vdp761ly.fsf@xemacs.org> Message-ID: At 18:11 +0900 04/14/2009, Stephen J. Turnbull wrote: >Tony Nelson writes: > > > Assuming that by "Destination" you mean a class of Address header fields, > > as there is no Destionation: header field, such header fields contain > > addresses, which can be considered to contain (as the email package does) a > > list of (name, email address) pairs, or, at a lower level, to also have > > Comments, there is indeed only one correct choice, which is the one the > > email package currently provides the diligent user. I wish it to be the > > one obvious choice, so that less study is needed to properly use the email > > package. > >As you point out above, display names and comments are different. >It's *not* obvious to me that they should be confounded by default. The examples in the RFC seem to use one or the other for the friendly name. The problem comes when there are both. Actually, I haven't seen comments used, so I don't have any experience there. >In any case, it would certainly be possible to implement both the >indexing feature, so that msg['To'][0] returns a (display, mailbox) >tuple, and a converter so that list(msg['to']) returns a list of such >tuples (in both cases, assuming that most users prefer not to >distinguish comments from display names). Well, msg['To'] would return a list (or tuple) of addresses (which are tuples), so msg['To'][0] would return the first such address, if any. No converter required. -- ____________________________________________________________________ TonyN.:' ' From stephen at xemacs.org Wed Apr 15 10:47:07 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 15 Apr 2009 17:47:07 +0900 Subject: [Email-SIG] API for Header objects [was: Dropping bytes "support" in json] In-Reply-To: References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> <20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com> <49DF8956.5050501@g.nevcal.com> <7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org> <87y6u4tn4s.fsf@xemacs.org> <87vdp761ly.fsf@xemacs.org> Message-ID: <87k55m5mno.fsf@xemacs.org> Tony Nelson writes: > Well, msg['To'] would return a list (or tuple) of addresses (which > are tuples), so msg['To'][0] would return the first such address, > if any. No converter required. How do you propose to spell msg['To'].split_addresses()[0] where the split_addresses method returns a list of addresses in their original form? And is it really worth losing the consistency that str(msg[tag]) and bytes(msg[tag]) (especially the latter) do something more or less useful regardless of whether 'tag' names a structured field or a text field? As I wrote elsewhere, I don't *know* that such features will be useful or practically implementable, but I do think what you're suggesting is premature and overly restrictive. Especially since we are pretty sure (due to the desire for idempotency) that internally msg['To'] will *not* be a sequence of addresses parsed into display name and mailbox. From tonynelson at georgeanelson.com Thu Apr 16 02:39:52 2009 From: tonynelson at georgeanelson.com (Tony Nelson) Date: Wed, 15 Apr 2009 20:39:52 -0400 Subject: [Email-SIG] API for Header objects [was: Dropping bytes "support" in json] In-Reply-To: <87k55m5mno.fsf@xemacs.org> References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> <20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com> <49DF8956.5050501@g.nevcal.com> <7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org> <87y6u4tn4s.fsf@xemacs.org> <87vdp761ly.fsf@xemacs.org> <87k55m5mno.fsf@xemacs.org> Message-ID: At 17:47 +0900 04/15/2009, Stephen J. Turnbull wrote: >Tony Nelson writes: > > > Well, msg['To'] would return a list (or tuple) of addresses (which > > are tuples), so msg['To'][0] would return the first such address, > > if any. No converter required. > >How do you propose to spell > > msg['To'].split_addresses()[0] > >where the split_addresses method returns a list of addresses in their >original form? And is it really worth losing the consistency that >str(msg[tag]) and bytes(msg[tag]) (especially the latter) do something >more or less useful regardless of whether 'tag' names a structured >field or a text field? I don't want there to be any "str(msg['tag'])" or "bytes(msg['tag'])" at all, so there would be no loss of consistency. Messages need flattening to bytes, but there is no use for converting individual header fields into bytes or strings, outside of a message. Some header field data /is/ strings, some is lists of address pairs, and so on. If the data for a header field is not properly a string, a means to get it as one is wrong. I can't imagine that .split_addresses() would provide anything in its original form. I'd certainly want it to split something into a list or tuple. As individual addresses in an Address header field are accessed from the list returned by "msg['To']" (or other Address header field name), there is no need to "split" them any more. >As I wrote elsewhere, I don't *know* that such features will be useful >or practically implementable, but I do think what you're suggesting is >premature and overly restrictive. Especially since we are pretty sure >(due to the desire for idempotency) that internally msg['To'] will >*not* be a sequence of addresses parsed into display name and mailbox. All the grotty internals of Heaer objects would be accessible by fetching the Header object with "msg.get_header('name')". "msg[...]" is an abbreviation for convenience which should not mislead users or be complex or magical in action. I want to be able to get and put the proper type of data for a particular header field, and to be told when I did it wrong, rather than just get a corrupt message. Internally, the Header whose .useful attribute is returned by "msg['foo']" will contain parsed data, referring to parsed tokens. Flattening those parsed tokens will produce the original data. Not a problem at all, simple to implement, in the most direct way. -- ____________________________________________________________________ TonyN.:' ' From rdmurray at bitdance.com Thu Apr 16 03:19:44 2009 From: rdmurray at bitdance.com (R. David Murray) Date: Wed, 15 Apr 2009 21:19:44 -0400 (EDT) Subject: [Email-SIG] API for Header objects [was: Dropping bytes "support" in json] In-Reply-To: References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> <20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com> <49DF8956.5050501@g.nevcal.com> <7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org> <87y6u4tn4s.fsf@xemacs.org> <87vdp761ly.fsf@xemacs.org> <87k55m5mno.fsf@xemacs.org> Message-ID: On Wed, 15 Apr 2009 at 20:39, Tony Nelson wrote: > Internally, the Header whose .useful attribute is returned by "msg['foo']" > will contain parsed data, referring to parsed tokens. Flattening those > parsed tokens will produce the original data. Not a problem at all, simple > to implement, in the most direct way. The first part of that is too magical and inconsistent for my tastes. I want message['fooheader'] to return a Header object. Which yes, should contain the parsed token structure and be able to regenerate the original bytes on demand (or vice versa, or keeping both the original bytes and the parse tree if the parse tree is lossy). For a header involving a list of addresses, I'd expect to get back a Header subclass that I could iterate over to get individual Address objects. For other structured headers, I'd expect to get a subclass with useful methods and attributes for accessing the structure. And when I str the Header (for example, when presenting one or more selected headers to a user), I would expect to get a string that a user would expect to read, which is to say a fully-decoded-to-unicode user-oriented representation of the structured data as one long string (I'll do any folding formatting for presentation as needed). Going the other way I have fewer opinions about, as I haven't written any code to do that yet :) --David From stephen at xemacs.org Thu Apr 16 08:24:47 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 16 Apr 2009 15:24:47 +0900 Subject: [Email-SIG] API for Header objects [was: Dropping bytes "support" in json] In-Reply-To: References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> <20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com> <49DF8956.5050501@g.nevcal.com> <7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org> <87y6u4tn4s.fsf@xemacs.org> <87vdp761ly.fsf@xemacs.org> <87k55m5mno.fsf@xemacs.org> Message-ID: <8763h55d5c.fsf@xemacs.org> Tony Nelson writes: > strings, some is lists of address pairs, and so on. If the data > for a header field is not properly a string, a means to get it as > one is wrong. Er, but the data for an address field is not "properly" a list of pairs, either. So I guess you would agree that a means to get it as one is wrong, then? > All the grotty internals of Heaer objects would be accessible by > fetching the Header object with "msg.get_header('name')". > "msg[...]" is an abbreviation for convenience which should not > mislead users or be complex or magical in action. A message or so back you made the point that an address header is a rather complex object that is *not* easy to parse. For example (this is a trick question), in your opinion, what should msg['To'][0] return if the original header was To: Stephen J. Turnbull ? > Internally, the Header whose .useful attribute is returned by > "msg['foo']" will contain parsed data, referring to parsed tokens. > Flattening those parsed tokens will produce the original data. Not > a problem at all, simple to implement, in the most direct way. And horrid to use, if you mean that the internal representation will be a full parse tree according to the augmented BNF in RFCs 822, 2822, 5322, 2045-2049, etc etc., and that the only other way to access that data is via an arbitrarily defined .useful attribute (which, BTW, is quite unpythonic if you intend for it to be available as msg['foo'] as well: TOOWTDI). From steve at pearwood.info Thu Apr 16 15:02:13 2009 From: steve at pearwood.info (Steven D'Aprano) Date: Thu, 16 Apr 2009 23:02:13 +1000 Subject: [Email-SIG] API for Header objects [was: Dropping bytes "support" in json] In-Reply-To: References: <87k55m5mno.fsf@xemacs.org> Message-ID: <200904162302.14641.steve@pearwood.info> On Thu, 16 Apr 2009 10:39:52 am Tony Nelson wrote: > I don't want there to be any "str(msg['tag'])" or "bytes(msg['tag'])" > at all, so there would be no loss of consistency. That's ... different. > Messages need > flattening to bytes, but there is no use for converting individual > header fields into bytes or strings, outside of a message. Of course there is. You create each header individually, so you should be able to extract each header individually. Here, for example, is a use-case: I want to send postmaster a copy of the X-Spam-Evidence header so she can see why a particular piece of ham got wrongly flagged as spam, or visa versa: X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; '(which': 0.03; 'attribute': 0.04; 'objects': 0.04; 'returns': 0.05; 'split': ??0.05; ... I need to be able to extract just that one header, and while some applications (mail client?) may choose to give me the entire message as text and expect me to manually hunt for the relevant line and copy-and-paste it, other applications may wish to automatically extract the appropriate header and email it to postmaster at localhost. Or write it to a log file, or whatever. Whatever they do, they probably need it as a string (of characters or bytes), not a binary blob. > Some > header field data /is/ strings, some is lists of address pairs, and > so on. But "lists of address pairs" themselves are strings. > If the data for a header field is not properly a string, But it always is. Even badly formatted emails with corrupt headers containing binary characters are strings -- they're just byte (non-Unicode) strings containing binary characters. Your mail server might not accept it as part of a valid header, but it's a valid byte string. > a > means to get it as one is wrong. Email *is* text. It's built on top of a restricted range of ASCII bytes, which we can legitimately call "text" because it is a subset of Unicode text. Even if a particular header contains binary data, it must be encoded as ASCII text before it can be placed into the header. X-Some-Header: \0\0\01\0\xff3G\04 (where \0 means byte('\0') etc) is not a valid email header -- the binary data must be encoded as ASCII text first. So any valid header must have a bytes form and a Unicode form (since the restricted range of allowed bytes are always valid Unicode as well). Corrupted headers may not have a valid Unicode form, but they will always have a byte form -- after all, the header eventually must be written to disk in some mail box somewhere, and it can only do so as bytes. So for any header, there is always a way of writing it in bytes, and nearly always a way of writing it as characters. there a valid text version of any header. Furthermore, in general for arbitrary headers, we can't tell what the header means *except* as a text string: X-Some-Header: AB34F8702D6 We have no way of telling whether the payload "AB34F8702D6" is a string of characters meaningful to some application just as they are, or whether it is a string encoded from binary data. We might *guess* that the encoding *could be* some known encoding (quoted-printable, base64, etc) but we can't tell unless it is a known standard header. > I want to be able to get and put > the proper type of data for a particular header field, and to be told > when I did it wrong, rather than just get a corrupt message. But in general, you can't know what the "proper type of data" is for arbitrary headers. What are valid data for X-policyd-weight headers? What about X-Some-Random-Header-I-Just-Made-Up? -- Steven D'Aprano From tonynelson at georgeanelson.com Thu Apr 16 20:08:59 2009 From: tonynelson at georgeanelson.com (Tony Nelson) Date: Thu, 16 Apr 2009 14:08:59 -0400 Subject: [Email-SIG] API for Header objects [was: Dropping bytes "support" in json] In-Reply-To: <8763h55d5c.fsf@xemacs.org> References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> <20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com> <49DF8956.5050501@g.nevcal.com> <7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org> <87y6u4tn4s.fsf@xemacs.org> <87vdp761ly.fsf@xemacs.org> <87k55m5mno.fsf@xemacs.org> <8763h55d5c.fsf@xemacs.org> Message-ID: At 15:24 +0900 04/16/2009, Stephen J. Turnbull wrote: >Tony Nelson writes: > > > strings, some is lists of address pairs, and so on. If the data > > for a header field is not properly a string, a means to get it as > > one is wrong. > >Er, but the data for an address field is not "properly" a list of >pairs, either. So I guess you would agree that a means to get it as >one is wrong, then? No. The useful data for an address field is *properly* a list of pairs of friendly name, address -- you should read RFC 5322 section 3.4. You need to understand this about email in order to continue this discussion, though your confusion does bring up the important point that people have poor understanding of email, and need guidance in how to use and compose it. This makes it very important that the easy way of doing things be the correct way. With Address fields, that way is a sequence of pairs of friendly name and address. Though the address could be parsed further, there is seldom any need to do so (outside of the Header parser itself). > > All the grotty internals of Heaer objects would be accessible by > > fetching the Header object with "msg.get_header('name')". > > "msg[...]" is an abbreviation for convenience which should not > > mislead users or be complex or magical in action. > >A message or so back you made the point that an address header is a >rather complex object that is *not* easy to parse. Which is exactly why the email package already has an address parser, though it also needs a more general parser for the other header field types. >For example (this >is a trick question), in your opinion, what should > > msg['To'][0] > >return if the original header was > >To: Stephen J. Turnbull > >? ('Stephen J. Turnbull', 'stephen at xemacs.org') You must be very confused to think this is a trick question. Try it with the current email package's email.utils.parseaddr(). Again, see RFC5322 section 3.4. > > Internally, the Header whose .useful attribute is returned by > > "msg['foo']" will contain parsed data, referring to parsed tokens. > > Flattening those parsed tokens will produce the original data. Not > > a problem at all, simple to implement, in the most direct way. > >And horrid to use, if you mean that the internal representation will >be a full parse tree according to the augmented BNF in RFCs 822, 2822, >5322, 2045-2049, etc etc., and that the only other way to access that >data is via an arbitrarily defined .useful attribute (which, BTW, is >quite unpythonic if you intend for it to be available as msg['foo'] as >well: TOOWTDI). You put words in my mouth. Wny assume that I am incompetent, or a fool? Of course the internal representation would include the full parse tree. Of course the external interface would provide read and write access to the relevent data. The .useful attribute (need a better name) is the way to read the useful part of the data extracted from the parse tree, whatever type of data that is, which depends on the header field type, determined by its name. Each Header subclass would have its own other attributes. The .useful attribute guides users and is used by .__getitem__() to return that data. -- ____________________________________________________________________ TonyN.:' ' From tonynelson at georgeanelson.com Thu Apr 16 20:08:57 2009 From: tonynelson at georgeanelson.com (Tony Nelson) Date: Thu, 16 Apr 2009 14:08:57 -0400 Subject: [Email-SIG] API for Header objects [was: Dropping bytes "support" in json] In-Reply-To: <200904162302.14641.steve@pearwood.info> References: <87k55m5mno.fsf@xemacs.org> <200904162302.14641.steve@pearwood.info> Message-ID: At 23:02 +1000 04/16/2009, Steven D'Aprano wrote: >On Thu, 16 Apr 2009 10:39:52 am Tony Nelson wrote: > >> I don't want there to be any "str(msg['tag'])" or "bytes(msg['tag'])" >> at all, so there would be no loss of consistency. > >That's ... different. > > >> Messages need >> flattening to bytes, but there is no use for converting individual >> header fields into bytes or strings, outside of a message. > >Of course there is. You create each header individually, so you should >be able to extract each header individually. Here, for example, is a >use-case: I want to send postmaster a copy of the X-Spam-Evidence >header so she can see why a particular piece of ham got wrongly flagged >as spam, or visa versa: > >X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; '(which': 0.03; > 'attribute': 0.04; 'objects': 0.04; 'returns': 0.05; 'split': > 0.05; ... > >I need to be able to extract just that one header, and while some >applications (mail client?) may choose to give me the entire message as >text and expect me to manually hunt for the relevant line and >copy-and-paste it, other applications may wish to automatically extract >the appropriate header and email it to postmaster at localhost. Or write >it to a log file, or whatever. Whatever they do, they probably need it >as a string (of characters or bytes), not a binary blob. This example seems tortured and contrived. Custom code to extract a single header one time to send to someone? Just hit "reply" and trim it yourself. If you must, you can use .get_header('X-Spam-Evidence').flatten(). I doubt that anyone would actually do that, outside of a debugging session. Any automatic process for sending reflected spam should include more of the message, using the relevent MIME type message/partial (or message/rfc822). >> Some >> header field data /is/ strings, some is lists of address pairs, and >> so on. > >But "lists of address pairs" themselves are strings. Wrong! They are *lists* (or at least sequences) of address pairs of friendly name, email address. Just as bytes are not strings, and dicts are not strings, and JPEC images, lists are not strings. For better understanding of what an Address is, see RFC 5322 (the current incarnation of RFC x822), section 3.4, which describes both the best way and current or obsolete practice. >> If the data for a header field is not properly a string, > >But it always is. No. This is important, and you will not understand RFC x822 email until you understand this: email messages are not character strings. They are byte sequences. This confusion pervades the email package only because in Python before 3.x, bytes were represented as strings. >Even badly formatted emails with corrupt headers containing binary >characters are strings -- they're just byte (non-Unicode) strings >containing binary characters. Your mail server might not accept it as >part of a valid header, but it's a valid byte string. Strings are not bytes. Sequences of bytes are not strings. Converting between them demands an encoding. Sometimes the encoding exists, sometimes it mostly exists, and sometimes there is no such encoding, as for a JPEG image, which is a structured byte sequence. >> a means to get it as one is wrong. > >Email *is* text. It's built on top of a restricted range of ASCII bytes, >which we can legitimately call "text" because it is a subset of Unicode >text. Even if a particular header contains binary data, it must be >encoded as ASCII text before it can be placed into the header. ... No, email is not text. Email message bodies and some header fields may represent text. An email message is a byte sequence. One really needs to understand this in order to work with email at a low level. When one does not understand, then the email package should lead the user in the right direction. -- ____________________________________________________________________ TonyN.:' ' From rdmurray at bitdance.com Thu Apr 16 21:42:07 2009 From: rdmurray at bitdance.com (R. David Murray) Date: Thu, 16 Apr 2009 15:42:07 -0400 (EDT) Subject: [Email-SIG] API for Header objects [was: Dropping bytes "support" in json] In-Reply-To: References: <87k55m5mno.fsf@xemacs.org> <200904162302.14641.steve@pearwood.info> Message-ID: On Thu, 16 Apr 2009 at 14:08, Tony Nelson wrote: > At 23:02 +1000 04/16/2009, Steven D'Aprano wrote: >> On Thu, 16 Apr 2009 10:39:52 am Tony Nelson wrote: >> >>> I don't want there to be any "str(msg['tag'])" or "bytes(msg['tag'])" >>> at all, so there would be no loss of consistency. >> >> That's ... different. Indeed. >>> Messages need >>> flattening to bytes, but there is no use for converting individual >>> header fields into bytes or strings, outside of a message. >> >> Of course there is. You create each header individually, so you should >> be able to extract each header individually. Here, for example, is a >> use-case: I want to send postmaster a copy of the X-Spam-Evidence >> header so she can see why a particular piece of ham got wrongly flagged >> as spam, or visa versa: >> >> X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; '(which': 0.03; >> 'attribute': 0.04; 'objects': 0.04; 'returns': 0.05; 'split': >> 0.05; ... >> >> I need to be able to extract just that one header, and while some >> applications (mail client?) may choose to give me the entire message as >> text and expect me to manually hunt for the relevant line and >> copy-and-paste it, other applications may wish to automatically extract >> the appropriate header and email it to postmaster at localhost. Or write >> it to a log file, or whatever. Whatever they do, they probably need it >> as a string (of characters or bytes), not a binary blob. > > This example seems tortured and contrived. Custom code to extract a single > header one time to send to someone? Just hit "reply" and trim it yourself. > If you must, you can use .get_header('X-Spam-Evidence').flatten(). I doubt > that anyone would actually do that, outside of a debugging session. > > Any automatic process for sending reflected spam should include more of the > message, using the relevent MIME type message/partial (or message/rfc822). Have you written a user interface using the email package? I have. In that user interface, I most definitely want to turn individual headers into strings. Specifically, this is a usenet news reader, and when presenting messages I want to display _only_ the Date and From headers. You will note that 'From' is an address header, and in this particular use case I want to use "str(message['From'])", and I don't care two hoots that the thing is properly a list of friendly-name address pairs. That is not a contrived example, that's _production code_ that I use every day. Nor is the quoted example all that contrived...after reading it I was considering if it would be useful to run a program over my incoming mail to extract the X-Spam-Evidence headers and a couple other headers and email them to me in a report daily. It's not useful enough that I'll write the code, I've too many other priorities, but it's potentially useful enough (for tuning my spam filters) that I don't consider it a contrived use case. And if the spam gets worse I may just come back to that idea. >>> Some >>> header field data /is/ strings, some is lists of address pairs, and >>> so on. >> >> But "lists of address pairs" themselves are strings. > > Wrong! They are *lists* (or at least sequences) of address pairs of > friendly name, email address. Just as bytes are not strings, and dicts are > not strings, and JPEC images, lists are not strings. For better > understanding of what an Address is, see RFC 5322 (the current incarnation > of RFC x822), section 3.4, which describes both the best way and current or > obsolete practice. I suspect that most or all of us do understand the RFC. When Steve says 'but lists of address pairs are themselves strings' I hear him saying that each element of the pair is a string. I think you would have to agree with that. Unless you want them to remain as byte strings? Or, as I would prefer, make them into Address objects with appropriate methods and an appropriate str. But even then, the friendly name and address data elements of the Address should be unicode strings. >>> If the data for a header field is not properly a string, >> >> But it always is. > > No. This is important, and you will not understand RFC x822 email until > you understand this: email messages are not character strings. They are > byte sequences. This confusion pervades the email package only because in > Python before 3.x, bytes were represented as strings. A header always has a string representation, though. It's the one a dumb-text UI would present to the user. IMO the email package needs to support building such UIs. The string representation is also useful for debugging (as is the bytes representation). I see no reason it should not be accessible through the normal Python 'str' method. Why obfuscate access to it? >> Even badly formatted emails with corrupt headers containing binary >> characters are strings -- they're just byte (non-Unicode) strings >> containing binary characters. Your mail server might not accept it as >> part of a valid header, but it's a valid byte string. > > Strings are not bytes. Sequences of bytes are not strings. Converting > between them demands an encoding. Sometimes the encoding exists, sometimes > it mostly exists, and sometimes there is no such encoding, as for a JPEG > image, which is a structured byte sequence. I agree with you that Unicode strings are not bytes, and that email is encoded as (ASCII) bytes. As for the JPEG, sure there's no encoding in the Unicode sense. There certainly is an encoding, though: JPEG wrapped up in the appropriate mime type encoding. >>> a means to get it as one is wrong. IMO it is always appropriate to be able to get a header body as a string. It may not be a meaningful format in which to _manipulate_ the header body information (which is why I think message's __getitem__ needs to return a Header object), but it is a legitimate representation for user consumption. >> Email *is* text. It's built on top of a restricted range of ASCII bytes, >> which we can legitimately call "text" because it is a subset of Unicode >> text. Even if a particular header contains binary data, it must be >> encoded as ASCII text before it can be placed into the header. > ... > > No, email is not text. Email message bodies and some header fields may > represent text. An email message is a byte sequence. One really needs to > understand this in order to work with email at a low level. When one does > not understand, then the email package should lead the user in the right > direction. You and Steve are defining terms differently here, I think, but other than that I suspect you are not that far apart on this particular point. What I want the email package to do is make it easy to pass text in and have the email package create the syntactically correct bytes representation to go out on the wire. I'm visualizing building the 'From' header, for example, something like this: message['From'] = AddressHeader(Address('John Smith', 'john at foo.com')) and have it default to UTF-8 encoding....or maybe the encoding gets specified when I say message.serialize('utf-8'). But as I said, I haven't actually written code that builds messages yet. Note that while I want to be able to do str(someHeader) to get a string representation of a header body, I'm not so enamored of being able to do message['From'] = 'John Smith ' and have it get turned into a Header or AddressHeader object. Frankly, that looks too magical to me. --David From v+python at g.nevcal.com Thu Apr 16 22:44:14 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Thu, 16 Apr 2009 13:44:14 -0700 Subject: [Email-SIG] API for Header objects [was: Dropping bytes "support" in json] In-Reply-To: <200904162302.14641.steve@pearwood.info> References: <87k55m5mno.fsf@xemacs.org> <200904162302.14641.steve@pearwood.info> Message-ID: <49E7989E.60402@g.nevcal.com> On approximately 4/16/2009 6:02 AM, came the following characters from the keyboard of Steven D'Aprano: > On Thu, 16 Apr 2009 10:39:52 am Tony Nelson wrote: > >> I don't want there to be any "str(msg['tag'])" or "bytes(msg['tag'])" >> at all, so there would be no loss of consistency. >> > > That's ... different. > >> If the data for a header field is not properly a string, >> > But it always is. > > Even badly formatted emails with corrupt headers containing binary > characters are strings -- they're just byte (non-Unicode) strings > containing binary characters. Your mail server might not accept it as > part of a valid header, but it's a valid byte string. > Wire format email headers are composed of a subset of ASCII text. There should be a way to obtain them, either as bytes, or via the trivial str conversion of those bytes to Unicode. Even corrupt headers containing binary characters should be obtainable that way. There are no header encoding or decoding algorithms that cannot be reworked to function properly on either the raw_bytes or raw_str version of a header, since the numeric values and sequence of all binary octets would be preserved via both raw_bytes and raw_str. *The key is to know what is in hand.* For both raw_bytes and raw_str, all characters would be in the range 0 - 0xFF. This is simple transliteration, not interpretation or parsing. A non-corrupt header would have a smaller range, 0x20 - 0x7F. Any header should be obtainable or settable in this form, using either bytes or str parameters/results. Yes, it should be possible to create corrupt headers in this manner. Useful mostly for testing, or for idempotency (which I also call GIGO). However, obtaining headers in that way should be "hard", but only the sense of having to type more because it is part of a lower level interface, not the primary APIs... like msg['tag'].raw_bytes or msg['tag'].raw_str... because it is actually the easiest way (implementation-wise) to obtain a copy of the data... but that copy may not be as useful as one might like. str(msg['tag']) or msg['tag'].str (or some such spelling[s]) should always produce a displayable form of the header. If it is a known, standardized header that may contained data that was encoded for transmission, such encodings should be reversed, and Unicode characters outside the range of U+0020 - U+007F may be included. Remember the goal here is "displayable". So if the encoding is bad for a standard header, or a standard header is corrupt, or a non-standard header contains what is apparently binary gibberish, and non-displayable Unicode control characters are generated, they should be escaped as 7 ASCII characters representing a Unicode code point "\U+0017". All such display strings must always have "\" converted to "\\" so that there is no ambiguity when interpreting strings that may contain text that looks like one of the escape strings. Known standard headers should have additional APIs (these already exist for the most useful ones) to obtain the interesting subcomponents (encodings, names, addresses, MIME types, etc.). These should have str parameters and results interfaces only, and specification of an encoding can be optional, defaulting to UTF-8 (or possibly defaulting to a Message-level encoding specification, which in turn may default to UTF-8), overridable in some of the APIs via optional parameters (some, because overloaded assignment APIs may not have room for such overrides, not having optional parameters). -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From stephen at xemacs.org Fri Apr 17 12:04:39 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 17 Apr 2009 19:04:39 +0900 Subject: [Email-SIG] API for Header objects In-Reply-To: References: <87k55m5mno.fsf@xemacs.org> <200904162302.14641.steve@pearwood.info> Message-ID: <87iql34mvc.fsf@xemacs.org> Tony Nelson writes: > This example seems tortured and contrived. Not at all. I currently use grep, not the email package, but in fact I extract several headers for use in mailing list moderation. It's getting to the point where my gradually accreting shell script doesn't cut it (more because I'm recruiting additional moderators than because I'm not happy with it), and if I'm going to do this in Python I definitely want an obvious and elegant way to produce a displayable string (ie, Unicode) because not all of the messages I get in Chinese and Korean are spam. > Custom code to extract a single header one time to send to someone? That is precisely why we want a simple readable short elegant API. Like str(msg['To']). This also suggests the sequence interface of msg['To'] should not contain tuples of strings, but rather NameAddr objects (taken from the RFC 5322 grammar). Then to flatten a NameAddr, use str or bytes as appropriate. So to present a list of addressees in a moderation interface, you could use recips = list(msg['To']) + list(msg['Cc']) # We have a utf-8 codec on stdout, between us and the wire. print("
    \n") for recip in recips: print("
  • ") print(htmlesc(str(recip))) print("
  • \n") print("
\n") Of course for wire protocol, you just use "bytes" instead of "str". Hey! that's not bad, even if I do say so myself. > Just hit "reply" and trim it yourself. That won't work, for several reasons. > If you must, you can use .get_header('X-Spam-Evidence').flatten(). > I doubt that anyone would actually do that, outside of a debugging > session. I do it. > No. This is important, and you will not understand RFC x822 email > until you understand this: email messages are not character > strings. They are byte sequences. This confusion pervades the > email package only because in Python before 3.x, bytes were > represented as strings. That's a bit generous and ungenerous at the same time. The people who worked on email were trying to come up with a reasonable interface that on the one side treated wire format as bytes (Python 1.x, 2.x str) and display format as text (Python 1.x str, oops, Python 2.x unicode). They failed, unfortunately, but not really because the tools were unavailable. They just treated the difficulties with insufficient respect. On the other hand, these difficulties are inherent in the medium. People (by which I mean nobody participating in this thread) think of email as text. MTAs think of email as octet sequences. Developers (especially Americans) have been sloppy about that distinction for *five* decades, and because until 2000 at least email was the sine qua non of networking, backward compatibility has long demanded incorporating all those mistakes in current practice. And now you're doing the same thing. Email messages have at *least* four ways of manifesting in our world that email-sig needs to worry about: as byte sequences on the wire, as (mostly, anyway, and certainly the headers) texts in our MUAs, as whatever-they-really-are, and as the internal representation of the email package. So depending on which side of the argument you feel like taking, you insist (inconsistently) that "an email is a byte string" or "a header is not a string at all, it's a structured thingie". But it's not that easy. What we need to do is come up with an API that respects all of those aspects *simultaneously*, and allows us to elegantly but accurately change the perspective we use to view this "whatever-it-really-is". > No, email is not text. Email message bodies and some header fields > may represent text. An email message is a byte sequence. One > really needs to understand this in order to work with email at a > low level. Hm. And here I was hoping that the email package would *implement* the low level, leaving me free to think about high-level things. > When one does not understand, then the email package should lead > the user in the right direction. No, thank you. Python is a double-opt-in language. We're all consenting adults here. Programmers who don't understand the RFCs are likely to be surprised in many places, but they asked for it, they got it. From stephen at xemacs.org Fri Apr 17 12:09:42 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 17 Apr 2009 19:09:42 +0900 Subject: [Email-SIG] API for Header objects [was: Dropping bytes "support" in json] In-Reply-To: References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> <20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com> <49DF8956.5050501@g.nevcal.com> <7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org> <87y6u4tn4s.fsf@xemacs.org> <87vdp761ly.fsf@xemacs.org> <87k55m5mno.fsf@xemacs.org> <8763h55d5c.fsf@xemacs.org> Message-ID: <87hc0n4mmx.fsf@xemacs.org> Tony Nelson writes: > No. The useful data for an address field is *properly* a list of > pairs of friendly name, address -- you should read RFC 5322 section > 3.4. The fact that you think I didn't suggests there's really no point in continuing to talk to you. But I'll give it another try. The issues we are dealing with at this point really have very little to do with accurate implementation of the RFCs. We all know that's necessary, but ... it's a Simple Matter Of Programming. At least, that's why Postel, Crocker, et al put so much effort into writing the RFCs, so it would be a SMOP. I think they did a pretty good job. I agree with you that we should make it relatively difficult to put things that *don't* conform to the RFCs on the wire. But that should be the responsibility of the middleware that talks to the file system and to the MTA. I see no reason *at this stage* to burden MUA (in the general sense) developers with all the RFC rules, and MDA/MTA writers "should" only need to worry about it for error handling (__bytes__() should normally do the job for them). (For values of "should" equivalent to "in my dreams", I do fear.) > This makes it very important that the easy way of doing things be > the correct way. With Address fields, that way is Nonsense. You are ignoring the fact that *people* (ie, nobody participating in this thread) read an address field *as text*, and they type in addresses *as text*. We do not extract and inject this information as pickles of Header objects via Firewire sockets implanted in their skulls. There is *no /unique/ correct way* here. > >For example (this is a trick question), in your opinion, what > >should > > > > msg['To'][0] > > > >return if the original header was > > > >To: Stephen J. Turnbull > > > >? > > ('Stephen J. Turnbull', 'stephen at xemacs.org') > > You must be very confused to think this is a trick question. > Try it with the current email package's email.utils.parseaddr(). > Again, see RFC5322 section 3.4. But section 3.4 is not relevant to the trickiness, and parseaddr is not strictly conforming. See the definitions of name-addr, display-name, phrase, word, atom, and atext in sections 3.2.3, 3.2.5, and 3.4 of the RFC you cite. Also see the definition of special. Finally, I commend to your attention the definition of obs-phrase in section 4.1, and the *very* special nature of this particular gotcha as described there. The point is that by parsing that and claiming it's an RFC 5322 section 3.4 name-addr, you have invoked the rather magical Postel Principle. You either have to say "for my purpose I want magic in the API" (which you previously denied), or you have to admit that this is harder than it looks. It is true that section 4.1 says that the obsolete ("interpreting") syntax must be accepted *off the wire*. So there certainly is a justification for having a short obvious elegant spelling for "make an address Header into a sequence". But IMHO that spelling should be "list(msg['To'])", not "msg['To']". The rationale is that---assuming it can be implemented---several of us would like to be able to spell "wire format" as "bytes(msg['To'])" and "display format" as "str(msg['To'])". I bet there are other uses that would be well-served by such indirection. And I would be disappointed if we can't do way better than "msg.get_header('To').flatten()" to get bytes---or should that be string?---out. > > > Internally, the Header whose .useful attribute is returned by > > > "msg['foo']" will contain parsed data, referring to parsed tokens. > > > Flattening those parsed tokens will produce the original data. Not > > > a problem at all, simple to implement, in the most direct way. > > > >And horrid to use, if you mean that the internal representation will > >be a full parse tree according to the augmented BNF in RFCs 822, 2822, > >5322, 2045-2049, etc etc., and that the only other way to access that > >data is via an arbitrarily defined .useful attribute (which, BTW, is > >quite unpythonic if you intend for it to be available as msg['foo'] as > >well: TOOWTDI). > > You put words in my mouth. Of course I don't put words in your mouth. The phrase "if you mean that" clearly indicates that what follows is *my* understanding of the implications of what you wrote. I think that interpretation is quite justifiable based on your insistence that the OOWTDI be your "sequence of (address, display-name) pairs." > Wny assume that I am incompetent, or a fool? I don't assume any such thing. But I become less and less trustful of your goodwill toward requirements other than your own. > Of course the internal representation would include the full parse tree. > Of course the external interface would provide read and write access to the > relevent data. Note that I didn't say it wouldn't. I said it *would*. But I think it's justified, by what you have written so far, to expect that it would be an inconvenient interface (maybe even "horridly" so). > The .useful attribute (need a better name) I like __getitem__(), __str_(), and __bytes__(), for starters. I think we do *need* multiple names, because different presentations are "useful" in different contexts. > is the way to read the useful part of the data extracted from the > parse tree, whatever type of data that is, which depends on the > header field type, determined by its name. Each Header subclass Please remember that Barry says he doesn't like subclassing to deal with issues of variation in header semantics, based on his experience with it in past versions of the email package. I'm not sure how he plans to avoid it (I suspect he'll be forced to give it up because what he comes up with will be horrid), but at this stage we really shouldn't assume that we can freely subclass Header. > would have its own other attributes. The .useful attribute guides > users and is used by .__getitem__() to return that data. As I said before, I agree with RDM (not to mention pretty much everybody but you that has posted on this topic) that there should be one more level of indirection here. Ie, __getitem__ should return a Header object. From stephen at xemacs.org Fri Apr 17 12:20:30 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 17 Apr 2009 19:20:30 +0900 Subject: [Email-SIG] API for Header objects [was: Dropping bytes "support" in json] In-Reply-To: References: <87k55m5mno.fsf@xemacs.org> <200904162302.14641.steve@pearwood.info> Message-ID: <87fxg74m4x.fsf@xemacs.org> R. David Murray writes: > Note that while I want to be able to do str(someHeader) to get a > string representation of a header body, I'm not so enamored of being > able to do > > message['From'] = 'John Smith ' > > and have it get turned into a Header or AddressHeader object. > Frankly, that looks too magical to me. +1 Well, that would make it easy to write scripts that parse lists of addresses and do things with them. Eg, a mailing list manager's "mass subscribe" interface. That would be nice ... but on reflection it's clear that we would want that to be parsed *strictly*. So it raises exceptions, which must be caught and handled, etc etc. In other words, it's actually not so easy to write scripts, no matter what you do, and you also want to be able to specify what kind of magical fixups (the ever-popular "display-name with unquoted period" immediately comes to mind as one example) are acceptable, and which are not, not to mention encoding for non-ASCII text. How about unstructured header bodies, like "Subject"? Should we allow it, for convenience, or not, for consistency? How about unknown fields, eg "X-Are-We-Not-Structured-No-We-Are-Devo"? I think, in the first draft, we should be *consistent* in both cases. From rdmurray at bitdance.com Fri Apr 17 13:19:31 2009 From: rdmurray at bitdance.com (R. David Murray) Date: Fri, 17 Apr 2009 07:19:31 -0400 (EDT) Subject: [Email-SIG] API for Header objects [was: Dropping bytes "support" in json] In-Reply-To: <87fxg74m4x.fsf@xemacs.org> References: <87k55m5mno.fsf@xemacs.org> <200904162302.14641.steve@pearwood.info> <87fxg74m4x.fsf@xemacs.org> Message-ID: On Fri, 17 Apr 2009 at 19:20, Stephen J. Turnbull wrote: > R. David Murray writes: > > > Note that while I want to be able to do str(someHeader) to get a > > string representation of a header body, I'm not so enamored of being > > able to do > > > > message['From'] = 'John Smith ' > > > > and have it get turned into a Header or AddressHeader object. > > Frankly, that looks too magical to me. > > +1 > > Well, that would make it easy to write scripts that parse lists of > addresses and do things with them. Eg, a mailing list manager's "mass > subscribe" interface. That would be nice ... but on reflection it's > clear that we would want that to be parsed *strictly*. So it raises > exceptions, which must be caught and handled, etc etc. In other > words, it's actually not so easy to write scripts, no matter what you > do, and you also want to be able to specify what kind of magical > fixups (the ever-popular "display-name with unquoted period" > immediately comes to mind as one example) are acceptable, and which > are not, not to mention encoding for non-ASCII text. > > How about unstructured header bodies, like "Subject"? Should we allow > it, for convenience, or not, for consistency? > > How about unknown fields, eg "X-Are-We-Not-Structured-No-We-Are-Devo"? > > I think, in the first draft, we should be *consistent* in both cases. Yes, I think consistency is good. Since I'm visualizing a message as being a container for headers (as well as the body...but we aren't talking about that right now), I would expect that I'd have to put Header objects into it. I don't think the "overhead" of having to do message['Subject'] = Header('subject string') is very large, and the code feels better to me that way (if I get Headers out, I should be putting Headers in..."explicit is better than implicit"). As for parsing a list of addresses in a manager interface...presumably we want to provide address-parsing tools to make that job easier. Folding address parsing in to a header setting operation would make that scripting task _harder_. Header creation should provide an easy way to pass in user address input as Unicode strings, but underlying that should be a more atomic address parsing interface. --David From stephen at xemacs.org Fri Apr 17 17:13:06 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sat, 18 Apr 2009 00:13:06 +0900 Subject: [Email-SIG] API for Header objects [was: Dropping bytes "support" in json] In-Reply-To: References: <87k55m5mno.fsf@xemacs.org> <200904162302.14641.steve@pearwood.info> <87fxg74m4x.fsf@xemacs.org> Message-ID: <877i1j48l9.fsf@xemacs.org> R. David Murray writes: > put Header objects into it. I don't think the "overhead" of > having to do > > message['Subject'] = Header('subject string') Hm. Should a Header know which header it is? Ie, should that be message['Subject'] = Header('subject', 'subject string') ? (I assume you would be less than in love with having the assignment magically stuffing "Subject" into the Header as it gets assigned.) From rdmurray at bitdance.com Fri Apr 17 17:21:34 2009 From: rdmurray at bitdance.com (R. David Murray) Date: Fri, 17 Apr 2009 11:21:34 -0400 (EDT) Subject: [Email-SIG] API for Header objects [was: Dropping bytes "support" in json] In-Reply-To: <877i1j48l9.fsf@xemacs.org> References: <87k55m5mno.fsf@xemacs.org> <200904162302.14641.steve@pearwood.info> <87fxg74m4x.fsf@xemacs.org> <877i1j48l9.fsf@xemacs.org> Message-ID: On Sat, 18 Apr 2009 at 00:13, Stephen J. Turnbull wrote: > R. David Murray writes: > > > put Header objects into it. I don't think the "overhead" of > > having to do > > > > message['Subject'] = Header('subject string') > > Hm. Should a Header know which header it is? Ie, should that be > > message['Subject'] = Header('subject', 'subject string') > > ? (I assume you would be less than in love with having the assignment > magically stuffing "Subject" into the Header as it gets assigned.) Hmm. Probably. But: message.addHeader(Header('subject', 'subject string')) would seem sensible. That looses the nice collections interface...but if a Header knows its keyword then it makes sense. However, I'm not convinced a Header should know its keyword. After all, the only difference between a From: header and a To: header is the keyword, and one can easily imagine wanting to do something like: replymessage['to'] = frommessage['from'] --David From barry at python.org Fri Apr 17 18:38:20 2009 From: barry at python.org (Barry Warsaw) Date: Fri, 17 Apr 2009 12:38:20 -0400 Subject: [Email-SIG] API for Header objects [was: Dropping bytes "support" in json] In-Reply-To: References: <87k55m5mno.fsf@xemacs.org> <200904162302.14641.steve@pearwood.info> <87fxg74m4x.fsf@xemacs.org> <877i1j48l9.fsf@xemacs.org> Message-ID: <14BC0DB2-1F8B-4872-BD55-E9824D9D43BA@python.org> Folks, just a quick followup explanation on why I've been radio silent on this sig. Taxes, work, and personal stuff just caught up to me this week. I'm hoping to find some time this weekend to review and respond to the blizzard of messages on this thread. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 304 bytes Desc: This is a digitally signed message part URL: From tonynelson at georgeanelson.com Fri Apr 17 19:25:59 2009 From: tonynelson at georgeanelson.com (Tony Nelson) Date: Fri, 17 Apr 2009 13:25:59 -0400 Subject: [Email-SIG] API for Header objects [was: Dropping bytes "support" in json] In-Reply-To: <14BC0DB2-1F8B-4872-BD55-E9824D9D43BA@python.org> References: <87k55m5mno.fsf@xemacs.org> <200904162302.14641.steve@pearwood.info> <87fxg74m4x.fsf@xemacs.org> <877i1j48l9.fsf@xemacs.org> <14BC0DB2-1F8B-4872-BD55-E9824D9D43BA@python.org> Message-ID: At 12:38 -0400 04/17/2009, Barry Warsaw wrote: >Folks, just a quick followup explanation on why I've been radio silent >on this sig. Taxes, work, and personal stuff just caught up to me >this week. I'm hoping to find some time this weekend to review and >respond to the blizzard of messages on this thread. Nothing else? A Python release? Your adult supervision will be welcome. I, for one, did not learn everything I needed from kindergarten. Also, we need some new threads. We've been keeping everthing in this JSON thread. -- ____________________________________________________________________ TonyN.:' ' From tonynelson at georgeanelson.com Fri Apr 17 19:26:16 2009 From: tonynelson at georgeanelson.com (Tony Nelson) Date: Fri, 17 Apr 2009 13:26:16 -0400 Subject: [Email-SIG] API for Header objects [was: Dropping bytes "support" in json] In-Reply-To: <877i1j48l9.fsf@xemacs.org> References: <87k55m5mno.fsf@xemacs.org> <200904162302.14641.steve@pearwood.info> <87fxg74m4x.fsf@xemacs.org> <877i1j48l9.fsf@xemacs.org> Message-ID: At 00:13 +0900 04/18/2009, Stephen J. Turnbull wrote: >R. David Murray writes: > > > put Header objects into it. I don't think the "overhead" of > > having to do > > > > message['Subject'] = Header('subject string') > >Hm. Should a Header know which header it is? Ie, should that be > > message['Subject'] = Header('subject', 'subject string') ... How about: message['Subject'] = 'subject string' message['To'] = ('joe', 'joe123 at foo.com') Since the Header does indeed know what it is, a Subject: Header could expect a string for input, and an Address header could expect an address tuple or list of them (and cope with possibly needing to coerce the addr-spec into bytes with the ASCII codec). Internally, Message.__setitem__() would look up the name, making and assigning the proper Headere subclass if missing, and pass that object the data. The Header subclass knows what type of data it expects and raises (ValueError?) if it gets something inappropriate. -- ____________________________________________________________________ TonyN.:' ' From rdmurray at bitdance.com Fri Apr 17 19:32:11 2009 From: rdmurray at bitdance.com (R. David Murray) Date: Fri, 17 Apr 2009 13:32:11 -0400 (EDT) Subject: [Email-SIG] API for Header objects [was: Dropping bytes "support" in json] In-Reply-To: References: <87k55m5mno.fsf@xemacs.org> <200904162302.14641.steve@pearwood.info> <87fxg74m4x.fsf@xemacs.org> <877i1j48l9.fsf@xemacs.org> Message-ID: On Fri, 17 Apr 2009 at 13:26, Tony Nelson wrote: > At 00:13 +0900 04/18/2009, Stephen J. Turnbull wrote: >> R. David Murray writes: >> >>> put Header objects into it. I don't think the "overhead" of >>> having to do >>> >>> message['Subject'] = Header('subject string') >> >> Hm. Should a Header know which header it is? Ie, should that be >> >> message['Subject'] = Header('subject', 'subject string') > ... > > How about: > > message['Subject'] = 'subject string' > message['To'] = ('joe', 'joe123 at foo.com') Like I said, personally I find that too magical for my tastes. --David From tonynelson at georgeanelson.com Fri Apr 17 19:37:17 2009 From: tonynelson at georgeanelson.com (Tony Nelson) Date: Fri, 17 Apr 2009 13:37:17 -0400 Subject: [Email-SIG] API for Header objects [was: Dropping bytes "support" in json] In-Reply-To: <87hc0n4mmx.fsf@xemacs.org> References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> <20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com> <49DF8956.5050501@g.nevcal.com> <7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org> <87y6u4tn4s.fsf@xemacs.org> <87vdp761ly.fsf@xemacs.org> <87k55m5mno.fsf@xemacs.org> <8763h55d5c.fsf@xemacs.org> <87hc0n4mmx.fsf@xemacs.org> Message-ID: At 19:09 +0900 04/17/2009, Stephen J. Turnbull wrote: >Tony Nelson writes: > > > No. The useful data for an address field is *properly* a list of > > pairs of friendly name, address -- you should read RFC 5322 section > > 3.4. > >The fact that you think I didn't suggests there's really no point in >continuing to talk to you. But I'll give it another try. > >The issues we are dealing with at this point really have very little >to do with accurate implementation of the RFCs. We all know that's >necessary, but ... it's a Simple Matter Of Programming. At least, >that's why Postel, Crocker, et al put so much effort into writing the >RFCs, so it would be a SMOP. I think they did a pretty good job. > >I agree with you that we should make it relatively difficult to put >things that *don't* conform to the RFCs on the wire. But that should >be the responsibility of the middleware that talks to the file system >and to the MTA. I see no reason *at this stage* to burden MUA (in the >general sense) developers with all the RFC rules, and MDA/MTA writers >"should" only need to worry about it for error handling (__bytes__() >should normally do the job for them). (For values of "should" >equivalent to "in my dreams", I do fear.) You are insisting on is so burdening them. I propose lifting that burden. > > This makes it very important that the easy way of doing things be > > the correct way. With Address fields, that way is > >Nonsense. You are ignoring the fact that *people* (ie, nobody >participating in this thread) read an address field *as text*, >and they type in addresses *as text*. We do not extract and inject >this information as pickles of Header objects via Firewire sockets >implanted in their skulls. There is *no /unique/ correct way* here. If only "People" did that in a way that survived transport. > > > >For example (this is a trick question), in your opinion, what > > >should > > > > > > msg['To'][0] > > > > > >return if the original header was > > > > > >To: Stephen J. Turnbull > > > > > >? > > > > ('Stephen J. Turnbull', 'stephen at xemacs.org') > > > > You must be very confused to think this is a trick question. > > Try it with the current email package's email.utils.parseaddr(). > > Again, see RFC5322 section 3.4. > >But section 3.4 is not relevant to the trickiness, and parseaddr is >not strictly conforming. See the definitions of name-addr, >display-name, phrase, word, atom, and atext in sections 3.2.3, 3.2.5, >and 3.4 of the RFC you cite. Also see the definition of special. >Finally, I commend to your attention the definition of obs-phrase in >section 4.1, and the *very* special nature of this particular gotcha >as described there. What parseaddr() doesn't support is groups. I haven't seen groups used, though. It does support Comments when a name-addr is not present. I still don't see any trick. "Stephen J. Turnbull" has always been accepted as a display-name, RFC 822 notwithstanding. Any useful implementation must take such things into account, even if conformance would have required the display-name (or at least the ".") to have bee quoted. >The point is that by parsing that and claiming it's an RFC 5322 >section 3.4 name-addr, you have invoked the rather magical Postel >Principle. You either have to say "for my purpose I want magic in the >API" (which you previously denied), or you have to admit that this is >harder than it looks. ... No. You want to make it hard for the user of the email package. I want to make it easy for the user of the email package. How hard it is for the programmer is not an issue, but thank you for your concern. -- ____________________________________________________________________ TonyN.:' ' From tonynelson at georgeanelson.com Fri Apr 17 19:37:43 2009 From: tonynelson at georgeanelson.com (Tony Nelson) Date: Fri, 17 Apr 2009 13:37:43 -0400 Subject: [Email-SIG] API for Header objects In-Reply-To: <87iql34mvc.fsf@xemacs.org> References: <87k55m5mno.fsf@xemacs.org> <200904162302.14641.steve@pearwood.info> <87iql34mvc.fsf@xemacs.org> Message-ID: At 19:04 +0900 04/17/2009, Stephen J. Turnbull wrote: >Tony Nelson writes: > > > This example seems tortured and contrived. > >Not at all. I currently use grep, not the email package, but in fact >I extract several headers for use in mailing list moderation. It's >getting to the point where my gradually accreting shell script doesn't >cut it (more because I'm recruiting additional moderators than because >I'm not happy with it), and if I'm going to do this in Python I >definitely want an obvious and elegant way to produce a displayable >string (ie, Unicode) because not all of the messages I get in Chinese >and Korean are spam. Now /that/ is a use case. Spam headers are a poor one in any case, as there are so many different ones. > > Custom code to extract a single header one time to send to someone? > >That is precisely why we want a simple readable short elegant API. > >Like str(msg['To']). Would that return the display-name (friendly name) for the listed mailboxes in one string, presumbably separated by commas? How would you get the addr-specs? How would you get both? Use bytes() to flatten all the data, or just the addr-specs? >This also suggests the sequence interface of msg['To'] should not >contain tuples of strings, but rather NameAddr objects (taken from the >RFC 5322 grammar). Then to flatten a NameAddr, use str or bytes as >appropriate. So to present a list of addressees in a moderation >interface, you could use I was a bit sloppy. The tuples would be character string, byte string: in 2.x, unicode and string; in 3.x, string and bytes. Flattening to bytes (2.x: string) for export would be ._flatten(). In practice, the display-names and addr-specs may have had "defects" when parsing the message. Addr-specs are supposed to be ASCII, but the local-part sometimes isn't. Display-names don't always RFC 2047 decode properly, or may have non-ASCII characters in them. > recips = list(msg['To']) + list(msg['Cc']) > > # We have a utf-8 codec on stdout, between us and the wire. > print("
    \n") > for recip in recips: > print("
  • ") > print(htmlesc(str(recip))) > print("
  • \n") > print("
\n") > >Of course for wire protocol, you just use "bytes" instead of "str". >Hey! that's not bad, even if I do say so myself. You wouldn't like for name, addr in msg['To'] + msg['Cc'] + msg['Bcc']: instead? str(addr) should work (IIUC Py3K) if addr is ASCII, as it should be. >...People (by which I mean nobody participating >in this thread) think of email as text. ... No, they don't. You have to ask them the right questions. Sure they'll say text, but they really expect styled fancy structured colored text with pictures and links and attached documents. Roughly, they think of email as web pages (archived HTML, if they knew the word). Only the most sophisticated or old and stubborn think of it otherwise. >MTAs think of email as octet sequences. Well, disagree with the MTA, and the MTA wins. Messages on the wire /are/ bytes. >Developers (especially Americans) have been sloppy about >that distinction for *five* decades, and because until 2000 at least >email was the sine qua non of networking, backward compatibility has >long demanded incorporating all those mistakes in current practice. > >And now you're doing the same thing. Email messages have at *least* >four ways of manifesting in our world that email-sig needs to worry >about: as byte sequences on the wire, as (mostly, anyway, and >certainly the headers) texts in our MUAs, as whatever-they-really-are, >and as the internal representation of the email package. So depending >on which side of the argument you feel like taking, you insist >(inconsistently) that "an email is a byte string" or "a header is not >a string at all, it's a structured thingie". But it's not that easy. > >What we need to do is come up with an API that respects all of those >aspects *simultaneously*, and allows us to elegantly but accurately >change the perspective we use to view this "whatever-it-really-is". That's why my proposal is so good, as it does this. > > No, email is not text. Email message bodies and some header fields > > may represent text. An email message is a byte sequence. One > > really needs to understand this in order to work with email at a > > low level. > >Hm. And here I was hoping that the email package would *implement* >the low level, leaving me free to think about high-level things. You have that now, and it is terribly hard to use. > > When one does not understand, then the email package should lead > > the user in the right direction. > >No, thank you. Python is a double-opt-in language. We're all >consenting adults here. Programmers who don't understand the RFCs are >likely to be surprised in many places, but they asked for it, they got >it. Battery materials included! Build your own batteries if you can learn how! Some have done it in as little as two years. There are other languages competing with Python, and users can choose to use them instead. Python's email package needs to stop requiring years of study to use correctly. -- ____________________________________________________________________ TonyN.:' ' From rdmurray at bitdance.com Fri Apr 17 22:25:42 2009 From: rdmurray at bitdance.com (R. David Murray) Date: Fri, 17 Apr 2009 16:25:42 -0400 (EDT) Subject: [Email-SIG] API for Header objects [was: Dropping bytes "support" in json] In-Reply-To: References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> <20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com> <49DF8956.5050501@g.nevcal.com> <7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org> <87y6u4tn4s.fsf@xemacs.org> <87vdp761ly.fsf@xemacs.org> <87k55m5mno.fsf@xemacs.org> <8763h55d5c.fsf@xemacs.org> <87hc0n4mmx.fsf@xemacs.org> Message-ID: On Fri, 17 Apr 2009 at 13:37, Tony Nelson wrote: > At 19:09 +0900 04/17/2009, Stephen J. Turnbull wrote: >> Tony Nelson writes: >> >> I agree with you that we should make it relatively difficult to put >> things that *don't* conform to the RFCs on the wire. But that should >> be the responsibility of the middleware that talks to the file system >> and to the MTA. I see no reason *at this stage* to burden MUA (in the >> general sense) developers with all the RFC rules, and MDA/MTA writers >> "should" only need to worry about it for error handling (__bytes__() >> should normally do the job for them). (For values of "should" >> equivalent to "in my dreams", I do fear.) > > You are insisting on is so burdening them. I propose lifting that burden. I don't see how Stephen and my proposals burden the developer more than yours. In fact, I'm pretty sure it's the opposite way around. >>> This makes it very important that the easy way of doing things be >>> the correct way. With Address fields, that way is >> >> Nonsense. You are ignoring the fact that *people* (ie, nobody >> participating in this thread) read an address field *as text*, >> and they type in addresses *as text*. We do not extract and inject >> this information as pickles of Header objects via Firewire sockets >> implanted in their skulls. There is *no /unique/ correct way* here. > > If only "People" did that in a way that survived transport. I don't understand that comment. It's the email package's job to provide a way for the programmer (the user of the email package's API) to allow the text entered by the user (the person actually sending and receiving messages) to survive transport. --David From stephen at xemacs.org Sat Apr 18 09:02:28 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sat, 18 Apr 2009 16:02:28 +0900 Subject: [Email-SIG] API for Header objects In-Reply-To: References: <87k55m5mno.fsf@xemacs.org> <200904162302.14641.steve@pearwood.info> <87iql34mvc.fsf@xemacs.org> Message-ID: <87y6ty30mz.fsf@xemacs.org> Tony Nelson writes: > > > Custom code to extract a single header one time to send to someone? > > > >That is precisely why we want a simple readable short elegant API. > > > >Like str(msg['To']). > > Would that return the display-name (friendly name) for the listed mailboxes > in one string, presumbably separated by commas? How would you get the > addr-specs? How would you get both? Use bytes() to flatten all the data, > or just the addr-specs? Who knows? Who cares? AFAICS it's a SMOP. Are we not hackers? We'll design the internal representation to be lossless, then flatten it out in a simple, straightforward way, as a newline-less, comma-separated string. The question at issue here is "how does an email client request flattening to bytes? to str?" Not "how does email do those things?" > I was a bit sloppy. The tuples would be character string, byte string: in > 2.x, unicode and string; in 3.x, string and bytes. Flattening to bytes > (2.x: string) for export would be ._flatten(). You're still being sloppy. In this part of the thread we were talking about a *text* representation ("extract a single header ... to send to someone", where the header in question is intended to be human-readable). Why are we suddenly using bytes? > In practice, the display-names and addr-specs may have had "defects" when > parsing the message. So what? That is, yes, we all already know that, and you don't need to repeat it unless you're going to tell us something new. Like, how are we going to represent the choice of how to deal with them in the API? What choices are we going to offer? > You wouldn't like > > for name, addr in msg['To'] + msg['Cc'] + msg['Bcc']: > > instead? Not necessarily, because that requires me to special-case situations where name is None, at least, and maybe cases where addr is None (that depends on what the parser does with a header like To: me at home.com, , you at earth.li of course.) > >...People (by which I mean nobody participating > >in this thread) think of email as text. ... > > No, they don't. Will you please cut this out? Everything you say is true. The problem is that you are inconsistently choosing half-truths for the purpose of winning a debate, rather than trying to design a coherent API. The latter is the purpose of this SIG, not the former. I do not have a lot of confidence that I'm *right*. However, the idea that bytes(object) gives wire format, str(object) gives a simple text presentation, and more complex presentation requires either massaging str(object) or direct access to the internal representation of object is a unifying theme in my (so far partial) proposal. You have no unity, just confidence that your API for Headers that are structured as address lists is "right". If you continue with that approach on a Header type by Header type basis, experience suggests that you *will* end up with a horrid API. > >What we need to do is come up with an API that respects all of those > >aspects *simultaneously*, and allows us to elegantly but accurately > >change the perspective we use to view this "whatever-it-really-is". > > That's why my proposal is so good, as it does this. But only for the To: header. There's no generality to it, you will propose a different representation for the "useful data" of other headers, and you still don't deal with the fact that what's useful to you may not serve the needs of others. > >Hm. And here I was hoping that the email package would *implement* > >the low level, leaving me free to think about high-level things. > > You have that now, and it is terribly hard to use. So let's fix the implementation to be easy to use. I see no proof whatsoever > > > > > When one does not understand, then the email package should lead > > > the user in the right direction. > > > >No, thank you. Python is a double-opt-in language. We're all > >consenting adults here. Programmers who don't understand the RFCs are > >likely to be surprised in many places, but they asked for it, they got > >it. > > Battery materials included! Build your own batteries if you can learn how! > Some have done it in as little as two years. > > There are other languages competing with Python, and users can choose to > use them instead. Python's email package needs to stop requiring years of > study to use correctly. The API you propose will require such study, I'm pretty sure. If you want to convince me otherwise (and I believe I'm representative in this), you need to show how your approach will lead to regularity and coherence in the API for *all* headers, not just the example-du-jour. From stephen at xemacs.org Sat Apr 18 11:14:29 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sat, 18 Apr 2009 18:14:29 +0900 Subject: [Email-SIG] API for Header objects In-Reply-To: References: <86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> <20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com> <49DF8956.5050501@g.nevcal.com> <7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org> <87y6u4tn4s.fsf@xemacs.org> <87vdp761ly.fsf@xemacs.org> <87k55m5mno.fsf@xemacs.org> <8763h55d5c.fsf@xemacs.org> <87hc0n4mmx.fsf@xemacs.org> Message-ID: <87ws9i2uiy.fsf@xemacs.org> Tony Nelson writes: > You are insisting on is so burdening them. I propose lifting that burden. I disagree. You have made no concrete arguments that what I propose is a burden, except in the long-since discredited sense of a few extra keystrokes for a single use-case. However, you *could* in principle do so, because I've proposed (the principles for) a fairly generic API. Specifically, to assess the burden of understanding, so far I've implicitly accepted the existing API for Messages: to get one instance of a header: msg[tag] to get the payload: msg.payload which is clearly flawed (see footnote [1] and the random thoughts, below) but not fatally so (and it's not obvious what would be better), and proposed a *generic* API for all types in email: to get the wire format (validated) of obj: bytes(obj) to get a text display (unformatted) of obj: str(obj) to get access to all attributes of obj: obj Guess what? We already have an API nearly sufficient[1] for reading and generating Unicode text/plain messages! It requires four (count them, *four*) identifiers not in Python itself: the classes Message, Header, and Payload, and the Message attribute .payload. Is that burdensome? Very well then, it is burdensome. I *joyfully* impose that burden on you, and of course accept it myself. (I'm cheating a little bit, because I've ignored the issue of how to get valid data into structured Headers when generating a new message. But you haven't addressed that issue for the case of "msg['To'] shall return a list of (display-name, mailbox) tuples" yet, either, and I can use whatever method you define so there's no additional burden.) I've also suggested for many object types where structuring as a sequence makes sense: to get a sequence of subobjects of obj: list(obj) With that addition, I think we're almost ready to write a mailing list manager.<0.5 wink> [[ Some random thoughts apropos this outline ]] It may make sense to apply the list API to msg['Received'], returning the list of values of 'Received' headers. I think it does *not* make sense to apply it to msg['Resent-To'], as resent headers generally come in blocks, and the API should reflect that, I think. That being so, I wonder if it *really* makes sense for msg[tag] to return the list of all instances of the tag field instead of a more or less arbitrary individual, even in the case of a header defined to be unique by the RFCs. Then we'd need an API for accessing blocks (maybe for parsed incoming only, rather than something mutable for setting on outgoing messages). Something like to get the list of blocks of resent headers: msg.blocks['Resent'] where you'd have to define each type of block to the parser. Each block would be a dictionary of the related headers, so you could get the most recent Resent-To field with msg.blocks['Resent'][0]['To']. [[ end random thoughts ]] > What parseaddr() doesn't support is groups. I haven't seen groups used, > though. It does support Comments when a name-addr is not present. > > I still don't see any trick. "Stephen J. Turnbull" has always been > accepted as a display-name, RFC 822 notwithstanding. Not when *validating* a header generator or user input! Note that what you're implying is that your standard of correct is not the RFCs, it's "what has always been done." That's problematic. In fact, the RFC process is carefully designed to account for "what has always been done." And in this case, the RFC authors have had *four* chances to accept 'Stephen J. Turnbull' as valid syntax, and they have refused every single time. > You want to make it hard for the user of the email package. The first time I let it slide. Now that you've repeated it, I think you owe me an apology. Footnotes: [1] What's missing is a way to handle multiple instances of a given field. This is a defect in the Message class, not the Header class, and we haven't really discussed Message at all, so I beg the reader's indulgence. From stephen at xemacs.org Sat Apr 18 11:45:43 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sat, 18 Apr 2009 18:45:43 +0900 Subject: [Email-SIG] API for Header objects [was: Dropping bytes "support" in json] In-Reply-To: References: <87k55m5mno.fsf@xemacs.org> <200904162302.14641.steve@pearwood.info> <87fxg74m4x.fsf@xemacs.org> <877i1j48l9.fsf@xemacs.org> Message-ID: <87vdp22t2w.fsf@xemacs.org> Tony Nelson writes: > How about: > > message['Subject'] = 'subject string' > message['To'] = ('joe', 'joe123 at foo.com') > > Since the Header does indeed know what it is How? Since there's no explicit constructor, in fact the Header doesn't know what it is until the Message tells it what it is. That means that the registry of Header types must be known to the Message class. That may not be a burden on the clients of email, but I can't see it as a warm fuzzy for the maintainers of email. > Internally, Message.__setitem__() would look up the name, making > and assigning the proper Headere subclass if missing, and pass that > object the data. The Header subclass knows what type of data it > expects and raises (ValueError?) if it gets something > inappropriate. I don't think the FLUFL will accept that. First of all, "Mama don' 'low no raisin's round heya." Second, the duck-typing on the 'To' example is a little hairy. Since in your model message['To'] *produces* a sequence of pairs when evaluated, I would expect it to require a sequence of pairs on input. But you seem to be suggesting that if it's a sequence but not a sequence of sequences, it should handle that by assuming it's a (name,addr) pair. Heck, maybe we should do something reasonable if it's not a sequence. I think this is *way* too magical for the basic API of email. I'm also bothered by the complexity of having Message accept responsibility for the validity of data to be input to Header, then delegate that responsiblity back to Header. Users don't care, I suspect, but maintainers will. There will be an extra frame for Message in any trace causing by Header raising ValueError (or whatever), which is annoying since Message should be just passing the argument on ... but it needs to be remembered or looked up, and there will always be the temptation for developers to add a little smarts to Message's handling of Header's arguments (especially in derived classes, eg, a ListPost class that optionally prepends "[listname] " to the front of a post). Finally, I know that if I'm away from the email package for more than 24 hours, I'll forget which order the display name and address come in. From barry at python.org Wed Apr 22 19:09:19 2009 From: barry at python.org (Barry Warsaw) Date: Wed, 22 Apr 2009 13:09:19 -0400 Subject: [Email-SIG] [issue1078919] Email.Header encodes non-ASCII content incorrectly In-Reply-To: <1240416199.6.0.32712734439.issue1078919@psf.upfronthosting.co.za> References: <1240416199.6.0.32712734439.issue1078919@psf.upfronthosting.co.za> Message-ID: <1DFFDE3A-29F5-45D3-A228-9B779AE076C1@python.org> On Apr 22, 2009, at 12:03 PM, Daniel Diniz wrote: > > Changes by Daniel Diniz : > > > ---------- > keywords: +easy I say nothing in email is easy. :-O -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 304 bytes Desc: This is a digitally signed message part URL: From janssen at parc.com Fri Apr 24 22:31:09 2009 From: janssen at parc.com (Bill Janssen) Date: Fri, 24 Apr 2009 13:31:09 PDT Subject: [Email-SIG] Message instances compare as False Message-ID: <79012.1240605069@parc.com> I spent the morning finding and fixing a problem in my IMAP server, which was caused by the fact that certain message instances evaluate as "False", if they have no headers. But that's a valid thing to have as part of a multipart body, I believe. So I had some code that looked like this: foo = msg1 or msg2 and I was getting msg2 even though I thought I should be getting msg1. Might make sense to explicitly bind __nonzero__ in this class. Bill