From esj at harvee.org Mon Jan 4 03:25:56 2010 From: esj at harvee.org (Eric S. Johansson) Date: Sun, 03 Jan 2010 21:25:56 -0500 Subject: [Email-SIG] Design Thoughts Summary In-Reply-To: <4B696451-1885-4670-AB7C-C283132F7638@python.org> References: <1258239230.74.7080@mint-julep.mondoinfo.com> <4B696451-1885-4670-AB7C-C283132F7638@python.org> Message-ID: <4B4151B4.9090406@harvee.org> On 11/15/2009 1:01 PM, Barry Warsaw wrote: > On Nov 14, 2009, at 5:12 PM, Matthew Dixon Cowles wrote: > >> Thank you. I am virtually 100% in agreement that this document >> represents what people have agreed on and that it represents what is >> sensible to do. > > As am I. Fantastic work in pulling this all together David. > > I'm a bit slammed right now, but a quick comment... > >>> * The API needs to at a minimum have hooks available for an >>> application to store data on disk rather than holding everything in >>> memory. >> >> I remain unconvinced that this is worth the trouble. Yes, the Twisted >> folks say that they can't use the email module because they may be >> receiving hundreds of messages at once. But can anyone do anything >> with hundreds of messages at once other than write them to disk? >> >> And would anything actually be improved by reading hundreds of files >> at once, in small chunks, looking for MIME separators? > > Mailman has a similar problem. Even if we get just a few big messages, > they can crush the system. You could argue that the MTA should just > block messages with 50MB bodies if the underlying Mailman code can't > handle it, but I still think we can do better. > > I think we're fine if all the headers and MIME structure were kept in > memory it would be fine. But I do think we just want to be able to never > store the raw body content in memory (perhaps unless needed, on demand). > Mailman for example rarely cares about the bytes of say an image/jpeg body. for what it's worth, I've also experienced the same "crushing blow" caused by large messages in memory. In my case, I immediately dumped all messages to a database (unfortunately, SQL), extracted the essential metadata I needed for my application and kept it in the record selected index and search on it. I also stored the raw message and the processed message in the database as well. Reason being, that I wanted to be able to analyze the raw message if something failed (usually Unicode failure) and be able to retrieve the e-mail object from its json container for quick(er) processing and I would get with parsing the raw message again (and again). This experience makes me a supporter of an e-mail module that has a storage container object that can be searched by any number of metadata fields. these metadata fields would consist of internal (to the message) data sources and external data sources. I believe it would be necessary to specify what searchable fields you want before creating the storage container. I hope that it would be possible to make the storage container backend Storage Technology independent so that people like me who will detest SQL until the heat death of the universe can use something else to store mail messages. I would also recommend not depending on the file system because in my experience, performance declined dramatically around 500 messages (ext3 adn jfs). Even though I was using an SQL database (SQLite), it was significantly faster using the database. Thanks to all who are working on this project. I wish I could participate more but, life has other plans for me. From barry at python.org Mon Jan 4 15:57:55 2010 From: barry at python.org (Barry Warsaw) Date: Mon, 4 Jan 2010 09:57:55 -0500 Subject: [Email-SIG] Design Thoughts Summary In-Reply-To: <4B4151B4.9090406@harvee.org> References: <1258239230.74.7080@mint-julep.mondoinfo.com> <4B696451-1885-4670-AB7C-C283132F7638@python.org> <4B4151B4.9090406@harvee.org> Message-ID: <691D78A9-1931-4276-9AA0-BD8061536104@python.org> On Jan 3, 2010, at 9:25 PM, Eric S. Johansson wrote: > I hope that it would be possible to make the storage container backend Storage Technology independent so that people like me who will detest SQL until the heat death of the universe can use something else to store mail messages. I would also recommend not depending on the file system because in my experience, performance declined dramatically around 500 messages (ext3 adn jfs). Even though I was using an SQL database (SQLite), it was significantly faster using the database. The standard library should probably have an API and one or two reference implementations. I think you can a file-system based implementation moderately non-horrible. :) -Barry From janssen at parc.com Mon Jan 4 18:45:43 2010 From: janssen at parc.com (Bill Janssen) Date: Mon, 4 Jan 2010 09:45:43 PST Subject: [Email-SIG] Design Thoughts Summary In-Reply-To: <691D78A9-1931-4276-9AA0-BD8061536104@python.org> References: <1258239230.74.7080@mint-julep.mondoinfo.com> <4B696451-1885-4670-AB7C-C283132F7638@python.org> <4B4151B4.9090406@harvee.org> <691D78A9-1931-4276-9AA0-BD8061536104@python.org> Message-ID: <97249.1262627143@parc.com> Barry Warsaw wrote: > On Jan 3, 2010, at 9:25 PM, Eric S. Johansson wrote: > > > I hope that it would be possible to make the storage container backend Storage Technology independent so that people like me who will detest SQL until the heat death of the universe can use something else to store mail messages. I would also recommend not depending on the file system because in my experience, performance declined dramatically around 500 messages (ext3 adn jfs). Even though I was using an SQL database (SQLite), it was significantly faster using the database. > > The standard library should probably have an API and one or two reference implementations. I think you can a file-system based implementation moderately non-horrible. :) > > -Barry Considering all the IMAP implementations using file system containers, I think you're right :-). Bill From Sypniewski at rowan.edu Mon Jan 11 17:54:58 2010 From: Sypniewski at rowan.edu (Sypniewski, Bernard Paul) Date: Mon, 11 Jan 2010 11:54:58 -0500 Subject: [Email-SIG] smtp question Message-ID: <56938804844D4D488FFE514C657AC3E1022F3B2D@EX2K3-2.rowanads.rowan.edu> Dear SIG members: I am writing a reading comprehension program along with a teacher of Developmental Reading here at Rowan. She wants results of exercises emailed to her. Here is the problem that we have encountered. The usual PYTHON email modules require, as is only sensible, certain email configuration information. Because of the audience for which we are writing a program, not only is it inlikely that students will have the required information but may not be literate enough to understand the instructions about what information to get and how to get it. So, here I am writing to you asking whether any of you know a way that we can get the required SMTP and POP3 (we will distribute the program to others) information through code so that we do not have to ask the students for information that they may have significant difficulty understanding and obtaining. We are working exclusively on WINDOWS platforms. Bernard Sypniewski Department of Computer Science Rowan University - Camden Campus Broadway and Cooper Street Camden, NJ 08102 USA -------------- next part -------------- An HTML attachment was scrubbed... URL: From matt at mondoinfo.com Mon Jan 11 19:22:08 2010 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Mon, 11 Jan 2010 12:22:08 -0600 (CST) Subject: [Email-SIG] smtp question Message-ID: <1263234060.75.28452@mint-julep.mondoinfo.com> [Sent direct reply separately] Dear Bernard, > know a way that we can get the required SMTP and POP3 (we will > distribute the program to others) information through code so that > we do not have to ask the students Someone who knows more about Windows than I do may correct me if I'm mistaken, but I doubt that there's a simple and reliable way to do that. If you know what email program they're using, you might be able to dig the information out of its configuration. But even if you went to that trouble for several email programs, there would be someone who was using a different one. You could do DNS lookups to (try to) get the user's domain name and then try the hosts smtp, pop, and mail. But POP requires a username and password and SMTP usually does these days. So even if that were successful, you'd still have to ask for those. If you were willing spend the time and money, you could set up your own SMTP server that all instances of your program would be hard-coded to use. You'd want to be very careful about its configuration of course. Many networks block port 25 so you'd probably want to use SMTP over SSL on port 587. I think that some networks even block that port so you'd probably also want to have your server listening on some unusual port. If you chose to go that route, I'd suggest that you have that machine configured and run by a sysadmin who knows exactly what they're doing. Regards, Matt From mark at msapiro.net Mon Jan 11 19:55:07 2010 From: mark at msapiro.net (Mark Sapiro) Date: Mon, 11 Jan 2010 10:55:07 -0800 Subject: [Email-SIG] smtp question In-Reply-To: <1263234060.75.28452@mint-julep.mondoinfo.com> References: <1263234060.75.28452@mint-julep.mondoinfo.com> Message-ID: <4B4B740B.1060207@msapiro.net> On 1/11/2010 10:22 AM, Matthew Dixon Cowles wrote: > >> know a way that we can get the required SMTP and POP3 (we will >> distribute the program to others) information through code so that >> we do not have to ask the students > > Someone who knows more about Windows than I do may correct me if I'm > mistaken, but I doubt that there's a simple and reliable way to do > that. You might look at what Thunderbird 3 does when you set up a new POP3 account. It asks for the email address first and then appears to just try various names based on the email address domain using server names like mail, pop3, smtp and the various standard ports until it finds ones that work. Basically, it looks at the email address, guesses at the possible POP3 and SMTP server names and ports and tries to find ones that work. Sometimes it succeeds admirably well and sometimes it fails at one or the other or both. It may also have additional knowledge of popular domains. You probably can't do better than that. -- Mark Sapiro The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan From avbidder at fortytwo.ch Mon Jan 11 21:54:36 2010 From: avbidder at fortytwo.ch (Adrian von Bidder) Date: Mon, 11 Jan 2010 21:54:36 +0100 Subject: [Email-SIG] smtp question In-Reply-To: <56938804844D4D488FFE514C657AC3E1022F3B2D@EX2K3-2.rowanads.rowan.edu> References: <56938804844D4D488FFE514C657AC3E1022F3B2D@EX2K3-2.rowanads.rowan.edu> Message-ID: <201001112154.36796@fortytwo.ch> Hi, On Monday 11 January 2010 17:54:58 Sypniewski, Bernard Paul wrote: > So, here I am writing to you asking whether any of you know a way that we > can get the required SMTP and POP3 (we will distribute the program to > others) information through code so that we do not have to ask the > students for information that they may have significant difficulty > understanding and obtaining. We are working exclusively on WINDOWS > platforms. For the "guessing" solution see the other emails. As an engineer I get goosepimples when I read about such solutions... As far as I can see the only "serious" thing you could do is, since you're on pure Windows(tm) anyway, use some kind of Windows specific system wide email interface to send the email (MAPI? I'm not a Windows person at all but I *think* I remember some trojans are/were using this to send their spam.) Of course that might depend on the people having Outlook (Express?) configured, and I don't know if a Python wrapper for this interface exists. cheers -- vbi -- > I have confiscated his Commodore 64 and acoustic coupler. You mean he couples acoustically with Commodore 64?! That explains a lot. Brings a new meaning to the word "audiophile", at least. -- Dolphin in news.admin.net-abuse.email -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 389 bytes Desc: This is a digitally signed message part. URL: From stef.mientki at gmail.com Fri Jan 15 00:54:39 2010 From: stef.mientki at gmail.com (Stef Mientki) Date: Fri, 15 Jan 2010 00:54:39 +0100 Subject: [Email-SIG] smtp question In-Reply-To: <201001112154.36796@fortytwo.ch> References: <56938804844D4D488FFE514C657AC3E1022F3B2D@EX2K3-2.rowanads.rowan.edu> <201001112154.36796@fortytwo.ch> Message-ID: <4B4FAEBF.5010008@gmail.com> On 11-01-2010 21:54, Adrian von Bidder wrote: > Hi, > > On Monday 11 January 2010 17:54:58 Sypniewski, Bernard Paul wrote: > >> So, here I am writing to you asking whether any of you know a way that we >> can get the required SMTP and POP3 (we will distribute the program to >> others) information through code so that we do not have to ask the >> students for information that they may have significant difficulty >> understanding and obtaining. We are working exclusively on WINDOWS >> platforms. >> > For the "guessing" solution see the other emails. As an engineer I get > goosepimples when I read about such solutions... > > As far as I can see the only "serious" thing you could do is, since you're on > pure Windows(tm) anyway, use some kind of Windows specific system wide email > interface to send the email (MAPI? I'm not a Windows person at all but I > *think* I remember some trojans are/were using this to send their spam.) > > Of course that might depend on the people having Outlook (Express?) > configured, and I don't know if a Python wrapper for this interface exists. > > These days, more and more people (on windows), don't have an email client installed, they use a webinterface for mail ( and other programs), so the only reliable way seems to setup your own mail server. cheers, Stef > cheers > -- vbi > > > > > _______________________________________________ > Email-SIG mailing list > Email-SIG at python.org > Your options: http://mail.python.org/mailman/options/email-sig/stef.mientki%40gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdmurray at bitdance.com Mon Jan 18 21:44:51 2010 From: rdmurray at bitdance.com (R. David Murray) Date: Mon, 18 Jan 2010 15:44:51 -0500 Subject: [Email-SIG] Kick starting email 6.0 development Message-ID: <20100118204451.B5E0B1FBB0E@kimball.webabinitio.net> With Barry's encouragement, and based on the summaries I prepared for the Email Wiki and discussion with Barry, I submitted a proposal to the PSF to fund me to do development work on the email 6 module. As you will read in the proposal[1], this works for me since I do contract programming work as part of my income, and with funding I can devote that portion of my time (and more, I hope) to this project. The PSF did not fully fund the proposal[2]. However, they did provide seed funding covering the first two months, and will be helping with fundraising for the rest in the form of "fiscal sponsorship" (from what I understand we'll learn more about that at PyCon). In addition, for every four dollars raised from other sources they'll put in another dollar. So far I've done the first item on the task list in the budget projection, the review of all outstanding issues in the tracker. The results are posted[3] on my web site, which I have linked from the wiki. (I put them on my website rather than the wiki because I wrote a little Sphinx plugin to query the tracker for issue titles and status, so that links are generated automatically and the page will get updated as bugs are closed or the titles changed.) If you know of any bug reports I've missed, or disagree with my analysis of any of the bugs, please let me know. >From here the next steps are to start refactoring and adding to the tests, and to move the discussion of the new API into the territory of concrete proposals. I'll be posting about both of those as the week goes along. My thought is that all the work should be done using a DVCS, to give maximum opportunity for others to contribute as their time allows. I welcome thoughts about how best to set this up to provide maximum access for this community of interest. I'm also interested to know who will be at PyCon and interested in BOF and/or sprint activities involving the email package. --David [1] http://www.bitdance.com/test/projects/email6/psfproposal/ [2] http://www.python.org/psf/records/board/minutes/2009-12-14/#funding-for-python-3-email-module [3] http://www.bitdance.com/test/projects/email6/issues/ From barry at python.org Mon Jan 18 23:24:20 2010 From: barry at python.org (Barry Warsaw) Date: Mon, 18 Jan 2010 17:24:20 -0500 Subject: [Email-SIG] Kick starting email 6.0 development In-Reply-To: <20100118204451.B5E0B1FBB0E@kimball.webabinitio.net> References: <20100118204451.B5E0B1FBB0E@kimball.webabinitio.net> Message-ID: <20100118172420.029cdbf7@freewill> On Jan 18, 2010, at 03:44 PM, R. David Murray wrote: >With Barry's encouragement, and based on the summaries I prepared for >the Email Wiki and discussion with Barry, I submitted a proposal to the >PSF to fund me to do development work on the email 6 module. As you >will read in the proposal[1], this works for me since I do contract >programming work as part of my income, and with funding I can devote >that portion of my time (and more, I hope) to this project. This is awesome David! >My thought is that all the work should be done using a DVCS, to give >maximum opportunity for others to contribute as their time allows. >I welcome thoughts about how best to set this up to provide maximum >access for this community of interest. Since Python itself has no DVCS still, I might propose using Bazaar and Launchpad to track the work. I already have three branches that may or may not have anything useful in them (they represent my previous attempts at this): * lp:~barry/+junk/email-ng * lp:~barry/python/30email * lp:~barry/python/email6 I'm in the process of re-establishing code imports of the various Python branches on Launchpad. They had been using the bzr mirrors on code.python.org, but those haven't been updated in a very long time. I'm going to blow those away and re-import from the Subversion branches. After that's working it should allow you to bzr branch any active Python branch and hack on things from there. That might make the most sense since some of the bugs you've identified affect other than just the email package. >I'm also interested to know who will be at PyCon and interested in BOF >and/or sprint activities involving the email package. I'll be at PyCon though I don't yet know exactly what I'll be sprinting on. email package is a possibility. The sprint sign up page is here: http://us.pycon.org/2010/sprints/signup/ >[3] http://www.bitdance.com/test/projects/email6/issues/ That one is *scary* :). -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 835 bytes Desc: not available URL: From orsenthil at gmail.com Tue Jan 19 02:32:26 2010 From: orsenthil at gmail.com (Senthil Kumaran) Date: Tue, 19 Jan 2010 07:02:26 +0530 Subject: [Email-SIG] Kick starting email 6.0 development In-Reply-To: <20100118204451.B5E0B1FBB0E@kimball.webabinitio.net> References: <20100118204451.B5E0B1FBB0E@kimball.webabinitio.net> Message-ID: <20100119013226.GA5255@ubuntu.ubuntu-domain> On Mon, Jan 18, 2010 at 03:44:51PM -0500, R. David Murray wrote: > >From here the next steps are to start refactoring and adding to the tests, > and to move the discussion of the new API into the territory of concrete > proposals. I'll be posting about both of those as the week goes along. I have been following the discussions actively. I am willing to get involved and shall pitch in with bug fixes/tests and other things. > My thought is that all the work should be done using a DVCS, to give > maximum opportunity for others to contribute as their time allows. > I welcome thoughts about how best to set this up to provide maximum > access for this community of interest. > > I'm also interested to know who will be at PyCon and interested in BOF > and/or sprint activities involving the email package. Yes, I would be there. I have identified urllib related issues/enhancements, but I see email module work is something I am interested too. Looked at the list and I see a couple of interesting ones. So, here is a +1. Thank you, Senthil -- we just switched to Sprint. From rdmurray at bitdance.com Thu Jan 21 20:45:25 2010 From: rdmurray at bitdance.com (R. David Murray) Date: Thu, 21 Jan 2010 14:45:25 -0500 Subject: [Email-SIG] Kick starting email 6.0 development In-Reply-To: <20100118172420.029cdbf7@freewill> References: <20100118204451.B5E0B1FBB0E@kimball.webabinitio.net> <20100118172420.029cdbf7@freewill> Message-ID: <20100121194525.74C6C18F90C@kimball.webabinitio.net> On Mon, 18 Jan 2010 17:24:20 -0500, Barry Warsaw wrote: > On Jan 18, 2010, at 03:44 PM, R. David Murray wrote: > Since Python itself has no DVCS still, I might propose using Bazaar and > Launchpad to track the work. I already have three branches that may or may > not have anything useful in them (they represent my previous attempts at > this): > > * lp:~barry/+junk/email-ng > * lp:~barry/python/30email > * lp:~barry/python/email6 I've looked briefly at your email6 branch, and will take a look at the others. But unless you think there are specifically relevant bits to look at, I probably won't look too closely until we have the new API design roughed out. Did you start doing any test refactoring in any of the branches? > I'm in the process of re-establishing code imports of the various > Python branches on Launchpad. They had been using the bzr mirrors on > code.python.org, but those haven't been updated in a very long time. > I'm going to blow those away and re-import from the Subversion > branches. After that's working it should allow you to bzr branch any > active Python branch and hack on things from there. That might make > the most sense since some of the bugs you've identified affect other > than just the email package. I think since you are the email czar and you are deeply involved with bzr and launchpad, that this makes sense :) I'm not sure how best to integrate this with Launchpad to make it public, though. I created a project (python-email6), pulled py3k according to your instructions on python-dev, pushed it to launchpad, and linked it to to the project as trunk. Was that the right thing to do? Or should I request membership in the Python team and create an 'email6' branch there instead? > >I'm also interested to know who will be at PyCon and interested in BOF > >and/or sprint activities involving the email package. > > I'll be at PyCon though I don't yet know exactly what I'll be sprinting on. > email package is a possibility. The sprint sign up page is here: > > http://us.pycon.org/2010/sprints/signup/ The sprint page doesn't list the core sprint yet. Any idea who is organizing it this year? --David From barry at python.org Fri Jan 22 15:30:15 2010 From: barry at python.org (Barry Warsaw) Date: Fri, 22 Jan 2010 09:30:15 -0500 Subject: [Email-SIG] Kick starting email 6.0 development In-Reply-To: <20100121194525.74C6C18F90C@kimball.webabinitio.net> References: <20100118204451.B5E0B1FBB0E@kimball.webabinitio.net> <20100118172420.029cdbf7@freewill> <20100121194525.74C6C18F90C@kimball.webabinitio.net> Message-ID: <20100122093015.1c1bcc3f@freewill> On Jan 21, 2010, at 02:45 PM, R. David Murray wrote: >I've looked briefly at your email6 branch, and will take a look at >the others. But unless you think there are specifically relevant >bits to look at, I probably won't look too closely until we have >the new API design roughed out. > >Did you start doing any test refactoring in any of the branches? I did, but honestly it's been so long since I hacked on these branches, I don't remember what's in them. :/ email6 has I think the latest changes aligned to my thinking about API improvements and probably also includes some test refactoring email-ng is just the email package and contains some header module refactoring and a start on doctests for headers. The nice thing about using more doctests is that it can serve as the basis for improved documentation. 30email has some additional refactoring (that might be similar to what's in email6) along with the results of work done at last year's pycon. I'm sorry that it's so disorganized, but you seem to be pretty good at untangling nasty knots. :) Probably best to start with email6 and then just review the top few revisions in the other branches to see if there's anything useful in them. >> I'm in the process of re-establishing code imports of the various >> Python branches on Launchpad. They had been using the bzr mirrors on >> code.python.org, but those haven't been updated in a very long time. >> I'm going to blow those away and re-import from the Subversion >> branches. After that's working it should allow you to bzr branch any >> active Python branch and hack on things from there. That might make >> the most sense since some of the bugs you've identified affect other >> than just the email package. > >I think since you are the email czar and you are deeply involved with >bzr and launchpad, that this makes sense :) Yay! :) >I'm not sure how best to integrate this with Launchpad to make it public, >though. I created a project (python-email6), pulled py3k according to >your instructions on python-dev, pushed it to launchpad, and linked it >to to the project as trunk. Was that the right thing to do? Or should >I request membership in the Python team and create an 'email6' branch >there instead? What you did isn't too bad actually. Whether it was the right thing to do depends on how we want to manage commit access to the email branch. I see no problem adding you to the ~python-dev team on Launchpad and creating the branch in the python project. That just means everyone in ~python-dev could commit to the branch (by default). I'd have no problem with that. Alternatively, we'd need to create a team for the python-email6 project so that folks other than just you have commit access to the branch. It's only a little more work that way, just because it's more things to set up, but it's not that big of a deal. So the question is: how locked down do you want to make this branch, and are there folks who would like to commit to this branch that shouldn't be added to ~python-dev? Either way, we have to remember to occasionally merge the py3k branch back into our branch so as to keep up on changes there. >> >I'm also interested to know who will be at PyCon and interested in BOF >> >and/or sprint activities involving the email package. >> >> I'll be at PyCon though I don't yet know exactly what I'll be sprinting on. >> email package is a possibility. The sprint sign up page is here: >> >> http://us.pycon.org/2010/sprints/signup/ > >The sprint page doesn't list the core sprint yet. Any idea who is >organizing it this year? It might be me, but I'm not sure :). I think Brett is not going to attend the sprints so I'm slated to give the intro-to-sprinting talk. I'm not sure if that also means I've "volunteered" to organize the core sprint or not. I'll try to figure that out. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 835 bytes Desc: not available URL: From rdmurray at bitdance.com Fri Jan 22 17:36:59 2010 From: rdmurray at bitdance.com (R. David Murray) Date: Fri, 22 Jan 2010 11:36:59 -0500 Subject: [Email-SIG] Kick starting email 6.0 development In-Reply-To: <20100122093015.1c1bcc3f@freewill> References: <20100118204451.B5E0B1FBB0E@kimball.webabinitio.net> <20100118172420.029cdbf7@freewill> <20100121194525.74C6C18F90C@kimball.webabinitio.net> <20100122093015.1c1bcc3f@freewill> Message-ID: <20100122163659.8DB641BC378@kimball.webabinitio.net> On Fri, 22 Jan 2010 09:30:15 -0500, Barry Warsaw wrote: > On Jan 21, 2010, at 02:45 PM, R. David Murray wrote: > email-ng is just the email package and contains some header module refactoring > and a start on doctests for headers. The nice thing about using more doctests > is that it can serve as the basis for improved documentation. I will take a look at that, since I'm actively working on Header right now. As for doctests...I agree that they are good for helping to document the API, but IMO we are going to need more than that to get a really good set of validation tests. I have some thoughts about that that I'm experimenting with and will report back when I'm satisfied that my idea is at least workable. (Whether it's a *good* idea remains to be seen :) > I'm sorry that it's so disorganized, but you seem to be pretty good at > untangling nasty knots. :) Probably best to start with email6 and then just > review the top few revisions in the other branches to see if there's anything > useful in them. No problem; at least you got started on it :) > What you did isn't too bad actually. Whether it was the right thing to do > depends on how we want to manage commit access to the email branch. I see no > problem adding you to the ~python-dev team on Launchpad and creating the > branch in the python project. That just means everyone in ~python-dev could > commit to the branch (by default). I'd have no problem with that. > > Alternatively, we'd need to create a team for the python-email6 project so > that folks other than just you have commit access to the branch. It's only a > little more work that way, just because it's more things to set up, but it's > not that big of a deal. So the question is: how locked down do you want to > make this branch, and are there folks who would like to commit to this branch > that shouldn't be added to ~python-dev? My goal in using the DVCS is to make it easy for anyone to submit patches for review, which I believe launchpad facilitates. ("Propose for merge", right?) I don't want the branch locked too tightly, I'd rather facilitate active contribution. So possibly making an email6 team is better, but since I don't know what the consequences of adding someone to ~python-dev are, I don't know what would make it a bad thing for someone to be added to it :). > Either way, we have to remember to occasionally merge the py3k branch back > into our branch so as to keep up on changes there. Yes. That will probably be my job. > >The sprint page doesn't list the core sprint yet. Any idea who is > >organizing it this year? > > It might be me, but I'm not sure :). I think Brett is not going to attend the > sprints so I'm slated to give the intro-to-sprinting talk. I'm not sure if > that also means I've "volunteered" to organize the core sprint or not. I'll > try to figure that out. Well if you are we could try to hijack the whole core sprint to work on email :) Seriously, though, if I can be of assistance, let me know. --David From barry at python.org Fri Jan 22 17:45:59 2010 From: barry at python.org (Barry Warsaw) Date: Fri, 22 Jan 2010 11:45:59 -0500 Subject: [Email-SIG] Kick starting email 6.0 development In-Reply-To: <20100122163659.8DB641BC378@kimball.webabinitio.net> References: <20100118204451.B5E0B1FBB0E@kimball.webabinitio.net> <20100118172420.029cdbf7@freewill> <20100121194525.74C6C18F90C@kimball.webabinitio.net> <20100122093015.1c1bcc3f@freewill> <20100122163659.8DB641BC378@kimball.webabinitio.net> Message-ID: <20100122114559.4bb73054@freewill> On Jan 22, 2010, at 11:36 AM, R. David Murray wrote: >I will take a look at that, since I'm actively working on Header >right now. As for doctests...I agree that they are good for helping to >document the API, but IMO we are going to need more than that to get a >really good set of validation tests. I have some thoughts about that >that I'm experimenting with and will report back when I'm satisfied that >my idea is at least workable. (Whether it's a *good* idea remains to >be seen :) I definitely agree we can't use doctests exclusively. Nobody in their right mind would ever want to read those things! A good mix of (separate file) doctest and unittests would probably work, but I'm eager to hear how your experiment turns out. :) >My goal in using the DVCS is to make it easy for anyone to submit patches >for review, which I believe launchpad facilitates. ("Propose for merge", >right?) Yep. It's a great way to go. I also suggest that as things stabilize we move to a model where branches proposed for merging are always linked to a bug. But that might not always be feasible while there's lots of churn. >I don't want the branch locked too tightly, I'd rather facilitate active >contribution. So possibly making an email6 team is better, but since I don't >know what the consequences of adding someone to ~python-dev are, I don't know >what would make it a bad thing for someone to be added to it :). I'm thinking it does make sense to make an email6 team and keep this branch in a separate package. I've just created the team: https://edge.launchpad.net/~email6 and made you a co-admin. I think it's up to you to make the team the owner of the python-email project. >Well if you are we could try to hijack the whole core sprint to work on >email :) That's like the logical extension of Zawinksi's law. :) >Seriously, though, if I can be of assistance, let me know. Thanks! -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 835 bytes Desc: not available URL: From rdmurray at bitdance.com Fri Jan 22 19:04:33 2010 From: rdmurray at bitdance.com (R. David Murray) Date: Fri, 22 Jan 2010 13:04:33 -0500 Subject: [Email-SIG] Kick starting email 6.0 development In-Reply-To: <20100122114559.4bb73054@freewill> References: <20100118204451.B5E0B1FBB0E@kimball.webabinitio.net> <20100118172420.029cdbf7@freewill> <20100121194525.74C6C18F90C@kimball.webabinitio.net> <20100122093015.1c1bcc3f@freewill> <20100122163659.8DB641BC378@kimball.webabinitio.net> <20100122114559.4bb73054@freewill> Message-ID: <20100122180433.E7B351BC3AE@kimball.webabinitio.net> On Fri, 22 Jan 2010 11:45:59 -0500, Barry Warsaw wrote: > On Jan 22, 2010, at 11:36 AM, R. David Murray wrote: > I'm thinking it does make sense to make an email6 team and keep this branch in > a separate package. I've just created the team: > > https://edge.launchpad.net/~email6 > > and made you a co-admin. I think it's up to you to make the team the owner of > the python-email project. OK, done. (It took me a while to figure out how...I assumed that 'change details' would include everything about the project that I could change, but it turns out that changing the maintainer is a separate page.) It's not obvious from the interface whether or not this gives the team permission to update the branch, and I don't see a way to change the ownership of the branch to the team. --David From barry at python.org Fri Jan 22 19:10:12 2010 From: barry at python.org (Barry Warsaw) Date: Fri, 22 Jan 2010 13:10:12 -0500 Subject: [Email-SIG] Kick starting email 6.0 development In-Reply-To: <20100122180433.E7B351BC3AE@kimball.webabinitio.net> References: <20100118204451.B5E0B1FBB0E@kimball.webabinitio.net> <20100118172420.029cdbf7@freewill> <20100121194525.74C6C18F90C@kimball.webabinitio.net> <20100122093015.1c1bcc3f@freewill> <20100122163659.8DB641BC378@kimball.webabinitio.net> <20100122114559.4bb73054@freewill> <20100122180433.E7B351BC3AE@kimball.webabinitio.net> Message-ID: <20100122131012.4ba93eaf@freewill> On Jan 22, 2010, at 01:04 PM, R. David Murray wrote: >OK, done. (It took me a while to figure out how...I assumed that 'change >details' would include everything about the project that I could change, >but it turns out that changing the maintainer is a separate page.) Yeah, sometimes things aren't where you expect them to be, like... >It's not obvious from the interface whether or not this gives the >team permission to update the branch, and I don't see a way to change >the ownership of the branch to the team. Click on the branch, then click on "Change details". Set the owner to the email6 team and then OK. That should be enough to allow anyone in the team to push changes to it. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 835 bytes Desc: not available URL: From rdmurray at bitdance.com Fri Jan 22 19:14:11 2010 From: rdmurray at bitdance.com (R. David Murray) Date: Fri, 22 Jan 2010 13:14:11 -0500 Subject: [Email-SIG] Kick starting email 6.0 development In-Reply-To: <20100122131012.4ba93eaf@freewill> References: <20100118204451.B5E0B1FBB0E@kimball.webabinitio.net> <20100118172420.029cdbf7@freewill> <20100121194525.74C6C18F90C@kimball.webabinitio.net> <20100122093015.1c1bcc3f@freewill> <20100122163659.8DB641BC378@kimball.webabinitio.net> <20100122114559.4bb73054@freewill> <20100122180433.E7B351BC3AE@kimball.webabinitio.net> <20100122131012.4ba93eaf@freewill> Message-ID: <20100122181411.CB2131BC3E7@kimball.webabinitio.net> On Fri, 22 Jan 2010 13:10:12 -0500, Barry Warsaw wrote: > On Jan 22, 2010, at 01:04 PM, R. David Murray wrote: > Yeah, sometimes things aren't where you expect them to be, like... > > >It's not obvious from the interface whether or not this gives the > >team permission to update the branch, and I don't see a way to change > >the ownership of the branch to the team. > > Click on the branch, then click on "Change details". Set the owner to the > email6 team and then OK. That should be enough to allow anyone in the team > to push changes to it. Done. I didn't even see the change details link because it was down below the recent commits list. --David From barry at python.org Fri Jan 22 19:35:29 2010 From: barry at python.org (Barry Warsaw) Date: Fri, 22 Jan 2010 13:35:29 -0500 Subject: [Email-SIG] Kick starting email 6.0 development In-Reply-To: <20100122181411.CB2131BC3E7@kimball.webabinitio.net> References: <20100118204451.B5E0B1FBB0E@kimball.webabinitio.net> <20100118172420.029cdbf7@freewill> <20100121194525.74C6C18F90C@kimball.webabinitio.net> <20100122093015.1c1bcc3f@freewill> <20100122163659.8DB641BC378@kimball.webabinitio.net> <20100122114559.4bb73054@freewill> <20100122180433.E7B351BC3AE@kimball.webabinitio.net> <20100122131012.4ba93eaf@freewill> <20100122181411.CB2131BC3E7@kimball.webabinitio.net> Message-ID: <20100122133529.0854a8fa@freewill> On Jan 22, 2010, at 01:14 PM, R. David Murray wrote: >I didn't even see the change details link because it was down below the >recent commits list. Looks good to me. That was the easy part! :) -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 835 bytes Desc: not available URL: From rdmurray at bitdance.com Mon Jan 25 21:10:34 2010 From: rdmurray at bitdance.com (R. David Murray) Date: Mon, 25 Jan 2010 15:10:34 -0500 Subject: [Email-SIG] Thoughts on the general API, and the Header API. Message-ID: <20100125201034.190CC1BC4B4@kimball.webabinitio.net> OK, so we've agreed that we need to handle bytes and text at pretty much all API levels, and that the "original data" that informs the data structure can be either bytes or text. We want to be able to recover that original data, especially in the bytes case, but arguably also in the text case. Then there's also the issue of transforming a message once we have it in a data structure, and the consequent issue of what it means to serialize the resulting modified message. (This last comes up in a very specific way in issues 968430 and 1670765, which are about preserving the *exact* byte representation of a multipart/signed message). We've also agreed that whatever we decide to do with the __str__ and __bytes__ magic methods, they will be implemented in terms of other parts of the API. So I'll ignore those for now. I think we want to decide on a general API structure that is implemented at all levels and objects where it makes sense, this being the API for creating and accessing the following information about any part of the model: * create object from bytes * create object from text * obtain the defect list resulting from creating the object * serialize object to bytes * serialize object to text * obtain original input data * discover the type of the original input data At the moment I see no reason to change the API for defects (a .defects attribute on the object holding a list of defects), so I'm going to ignore that for now as well. I spent a bunch of time trying to define an API for Headers that provided methods for all of the above. As I was writing the descriptions for the various methods, and especially trying to specify the "correct" behavior for both the raw-data-is-bytes and raw-data-is-text cases (especially for the methods that serialize the data), the whole thing began to give off a bad code smell. After setting it aside for a bit, I had what I think is a little epiphany: our need is to deal with messages (and parts of messages) that could be in either bytes form or text form. The things we need to do with them are similar regardless of their form, and so we have been talking about a "dual API": one method for bytes and a parallel method for text. What if we recognize that we have two different data types, bytes messages and text messages? Then the "dual API" becomes a more uniform, almost single, API, but with two possible underlying data types. In the context specifically of the proposed new Header object, I propose that we have a StringHeader and a BytesHeader, and an API that looks something like this: StringHeader properties: raw_header (None unless from_full_header was used) raw_name raw_value name value __init__(name, value) from_full_header(header) serialize(max_line_len=78, newline='\n', use_raw_data_if_possible=False) encode(charset='utf-8') BytesHeader would be exactly the same, with the exception of the signature for serialize and the fact that it has a 'decode' method rather than an 'encode' method. Serialize would be different only in the fact that it would have an additional keyword parameter, must_be_7bit=True. The magic of this approach is in those encode/decode methods. Encoding a StringHeader would yield a BytesHeader containing the same data, but encoded per RFC2047 using the specified charset. Decoding a BytesHeader would yield a StringHeader with the same data, but decoded to unicode per RFC2047, with any 8bit parts decoded (in the unicode sense, not the RFC2047 sense) using the specified charset (which would default to ASCII, meaning bare 8bit bytes in headers would throw an error). (What to with RFC2047 charsets like unknown-8bit is an open question...probably throw an error). (Encoding or decoding a Message would cause the Message to recursively encode or decode its subparts. This means you are making a complete new copy of the Message in memory. If you don't want to do that you can walk the Message and convert it piece by piece (we could provide a generator that does this).) raw_header would be the data passed in to the constructor if from_full_header is used, and None otherwise. If encode/decode call the regular constructor, then this attribute would also act as a flag as to whether or not the header was constructed from raw input data or via program. raw_name and raw_value would be the fieldname and fieldbody, either what was passed in to the __init__ constructor, or the result of splitting what was passed to the from_full_header constructor on the first ':'. (These are convenience attributes and are not essential to the proposed API). name would be the fieldname stripped of trailing whitespace. value would be the *unfolded* fieldbody stripped of leading and trailing whitespace (but with internal whitespace intact). As for serialize, my thought here is that every object in the tree has a serialize method with the same signature, and serialization is a matter of recursively passing the specified parameters downward. max_line_len is obvious, and defaults to the RFC recommended max. (If you want the unfolded header, use its .value attribute). newline resolves issue 1349106, allowing an email package client to generate completely wire-format messages if it needs to. use_raw_data_if_possible would mean to emit the original raw data if it exists (modulo changing the flavor of newline if needed, for those object types (such as headers) where that makes sense). The serialize method of specific sub-types can do specialized things (eg: multipart/signed can make use_raw_data_if_possible default to True). For Bytes types, the extra 'must_be_7bit' flag would cause any 8bit data to be transport encoded to be 7bit clean. (For headers, this would mean raw 8bit data would get the charset 'unknown-8bit', and we might want to provide more control over that in some way: an error and way to provide an error handler, or some other way to specify a charset to use for such encodings.) use_raw_data_if_possible would cause this flag to be ignored when raw data was available for the object. (If you want the text version of the transport-encoded message for some reason, you can serialize the Bytes form using must_be_7bit and decode the result as ASCII.) Subclasses of these classes for structured headers would have additional methods that would return either specialized object types (datetimes, address objects) or bytes/strings, and these may or may not exist in both Bytes and String forms (that depends on the use cases, I think). I also think that the Bytes and Strings versions of objects that have them can share large portions of their implementation through a base class. I think that makes this approach both easier to code than a single-type-dual-API approach, and more robust in the face of changes. So, those are my thoughts, and I'm sure I haven't thought of all the corner cases. The biggest question is, does it seem like this general scheme is worth pursuing? --David From v+python at g.nevcal.com Tue Jan 26 01:55:15 2010 From: v+python at g.nevcal.com (Glenn Linderman) Date: Mon, 25 Jan 2010 16:55:15 -0800 Subject: [Email-SIG] Thoughts on the general API, and the Header API. In-Reply-To: <20100125201034.190CC1BC4B4@kimball.webabinitio.net> References: <20100125201034.190CC1BC4B4@kimball.webabinitio.net> Message-ID: <4B5E3D73.1070900@g.nevcal.com> On approximately 1/25/2010 12:10 PM, came the following characters from the keyboard of R. David Murray: > So, those are my thoughts, and I'm sure I haven't thought of all the > corner cases. The biggest question is, does it seem like this general > scheme is worth pursuing? Moving your last question to the front, yes. And of course, we do need to think through most of the corner cases before absolutely committing to this approach. But it sounds viable, and avoids an awful lot of duplicate APIs, and would allow simple email clients to be written primarily or even fully in bytes or primarily or even fully in strings. A simple email client that is written fully in strings would "simply" reject/bounce messages that cannot be decoded to strings. This is simple; it works for 100% properly encoded messages; in an environment where a client is coded to process messages from some generator, once they are both debugged to the extent of generating messages that can be consumed, then all is well, and no messages would be rejected. This would not be an appropriate model for a general email server; while I'd like to see a popular mailing list submission client that would bounce messages that are improperly formed -- forcing contributors to use RFC conformant clients, and thus encouraging the of those clients that are not RFC conformant, but I'm not going to hold my breath. I think there can be enough power in an API designed in this manner to allow the full nitty-gritty access as required. I have some questions and concerns; I haven't thought through all of them; perhaps some of them are corner cases, if so, they are corner cases that are particularly interesting to me. > OK, so we've agreed that we need to handle bytes and text at pretty > much all API levels, and that the "original data" that informs the data > structure can be either bytes or text. We want to be able to recover > that original data, especially in the bytes case, but arguably also in > the text case. > > Then there's also the issue of transforming a message once we have it in > a data structure, and the consequent issue of what it means to serialize > the resulting modified message. (This last comes up in a very specific > way in issues 968430 and 1670765, which are about preserving the *exact* > byte representation of a multipart/signed message). > > We've also agreed that whatever we decide to do with the __str__ and > __bytes__ magic methods, they will be implemented in terms of other > parts of the API. So I'll ignore those for now. > > I think we want to decide on a general API structure that is implemented > at all levels and objects where it makes sense, this being the API > for creating and accessing the following information about any part of > the model: > > * create object from bytes > * create object from text > * obtain the defect list resulting from creating the object > * serialize object to bytes > * serialize object to text > * obtain original input data > * discover the type of the original input data > > At the moment I see no reason to change the API for defects (a .defects > attribute on the object holding a list of defects), so I'm going to > ignore that for now as well. > > I spent a bunch of time trying to define an API for Headers that provided > methods for all of the above. As I was writing the descriptions for > the various methods, and especially trying to specify the "correct" > behavior for both the raw-data-is-bytes and raw-data-is-text cases > (especially for the methods that serialize the data), the whole thing > began to give off a bad code smell. > > After setting it aside for a bit, I had what I think is a little epiphany: > our need is to deal with messages (and parts of messages) that could be > in either bytes form or text form. The things we need to do with them > are similar regardless of their form, and so we have been talking about a > "dual API": one method for bytes and a parallel method for text. > > What if we recognize that we have two different data types, bytes messages > and text messages? Then the "dual API" becomes a more uniform, almost > single, API, but with two possible underlying data types. > > In the context specifically of the proposed new Header object, I propose > that we have a StringHeader and a BytesHeader, and an API that looks > something like this: > > StringHeader > > properties: > raw_header (None unless from_full_header was used) > raw_name > raw_value > name > value > > __init__(name, value) > from_full_header(header) > serialize(max_line_len=78, > newline='\n', > use_raw_data_if_possible=False) > encode(charset='utf-8') > If it was stated, I missed it: is from_full_header a way of producing an object from a raw data value? Whereas __init__ would obviously be used to produce one from string or bytes values. If so, then it would be a requirement that this from_full_header API would never produce an exception? Rather it would produce an object with or without defects? Are there any other *Header APIs that would be required not to produce exceptions? I don't yet perceive any. The "charset" parameter... is that not mostly needed for data parts? Headers are either ASCII, or contain self-describing charset info. I guess I could see an intermediate decode from string to some charset, before serialization, as a hint that when generating headers, that all the characters in the header that are not ASCII are in the specified charset... and that that charset is the one to be used in the self-describing serialized ASCII stream? The full generality of the RFCs, however, allows pieces of headers to be encoded using different charsets... with this API, it would seem that that could only be created containing one charset... the serialization primitives were made available, so that piecewise construction of a header value could be done with different charsets, and then the from_full_header API used to create the complex value. I don't see this as a severe limitation, I just want to understand your intention, and document the limitation, or my misunderstanding. > BytesHeader would be exactly the same, with the exception of the signature > for serialize and the fact that it has a 'decode' method rather than an > 'encode' method. Serialize would be different only in the fact that > it would have an additional keyword parameter, must_be_7bit=True. > I am not clear on why StringHeader's serialize would not need the must_be_7bit parameter... or do I misunderstand that StringHeader.serialize produces wire-format data? > The magic of this approach is in those encode/decode methods. > > Encoding a StringHeader would yield a BytesHeader containing the same > data, but encoded per RFC2047 using the specified charset. Decoding a > BytesHeader would yield a StringHeader with the same data, but decoded to > unicode per RFC2047, with any 8bit parts decoded (in the unicode sense, > not the RFC2047 sense) using the specified charset (which would default to > ASCII, meaning bare 8bit bytes in headers would throw an error). (What to > with RFC2047 charsets like unknown-8bit is an open question...probably > throw an error). > Would the encoding to/from StringHeader/BytesHeader preserve the from_full_header state and value? > (Encoding or decoding a Message would cause the Message to recursively > encode or decode its subparts. This means you are making a complete > new copy of the Message in memory. If you don't want to do that you > can walk the Message and convert it piece by piece (we could provide a > generator that does this).) > Walking it piece by piece would allow the old pieces to be discarded, to save total memory consumption, where that is appropriate. Perhaps one generator that would be commonly used, would be to convert headers only, and leave MIME data parts alone, accessing and converting them only with the registered methods? This would mean that a "complete copy" wouldn't generally be very big, if the data parts were excluded from implicit conversion. Perhaps the "external storage protocol" might also only be defined for MIME data parts, and walking the tree with this generator would not need to reference the MIME data parts, nor bring them in from "external storage". > raw_header would be the data passed in to the constructor if > from_full_header is used, and None otherwise. If encode/decode call > the regular constructor, then this attribute would also act as a flag > as to whether or not the header was constructed from raw input data > or via program. > This _implies_ that from_full_header always accepts raw data bytes... even for the StringHeader. And that implies the need for an implicit decode, and therefore, perhaps a charset parameter? No, not a charset parameter, since they are explicitly contained in the header values. Decode for header values may not need a charset value at all! No comments for the rest. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From rdmurray at bitdance.com Tue Jan 26 03:51:46 2010 From: rdmurray at bitdance.com (R. David Murray) Date: Mon, 25 Jan 2010 21:51:46 -0500 Subject: [Email-SIG] Thoughts on the general API, and the Header API. In-Reply-To: <4B5E3D73.1070900@g.nevcal.com> References: <20100125201034.190CC1BC4B4@kimball.webabinitio.net> <4B5E3D73.1070900@g.nevcal.com> Message-ID: <20100126025146.CF0AC1BC2FF@kimball.webabinitio.net> On Mon, 25 Jan 2010 16:55:15 -0800, Glenn Linderman wrote: > On approximately 1/25/2010 12:10 PM, came the following characters from > the keyboard of R. David Murray: > > So, those are my thoughts, and I'm sure I haven't thought of all the > > corner cases. The biggest question is, does it seem like this general > > scheme is worth pursuing? > > If it was stated, I missed it: is from_full_header a way of producing > an object from a raw data value? Whereas __init__ would obviously be Yes. > used to produce one from string or bytes values. If so, then it would Well, StringHeader.from_full_header would take a string as input, while BytesHeader.from_full_headerwould take bytes as input. __init__ would be used to construct a header in your program: StringHeader('MyHeader', 'my value') BytesHeader(b'MyHeader', b'my value'). > be a requirement that this from_full_header API would never produce an > exception? Rather it would produce an object with or without defects? Yes. > Are there any other *Header APIs that would be required not to produce > exceptions? I don't yet perceive any. I don't think so. from_full_header is the only one involved in parsing raw data. Whether __init__ throws errors or records defects is an open question, but I lean toward it throwing errors. The reason there is an open question is because an email manipulating application may want to convert to text to process an incoming message, and there are things that a BytesHeader can hold that would cause errors when encoded to a StringHeader (specifically, 8 bit bytes that aren't transfer encoded). So it may be that decode, at least, should not throw errors but instead record additional defects in the resulting StringHeader. I think that even in that case __init__ should still throw errors, though; decode could deal with the defects before calling StringHeader.__init__, or (more likely) catch the errors throw by __init__, fix/record the defects, and call it again. Note, by the way, that by 'raw data' I mean what you are feeding in. Raw data fed to a BytesHeader would be bytes, but raw data fed to a StringHeader would be text (eg: if read from a file in text mode). > The "charset" parameter... is that not mostly needed for data parts? No, if you start with a unicode string in a StringHeader, you need to know what charset to encode the unicode to and therefore to specify as the charset in the RFC 2047 encoded words. > Headers are either ASCII, or contain self-describing charset info. That's true for BytesHeaders, but not for StringHeaders. So as I said above charset for StringHeader says which charset to put into the encoded words when converting to BytesHeader form. I specified a charset parameter for 'decode' only to handle the case of raw bytes data that contains 8 bit data that is not in encoded words (ie: is not RFC compliant). I am visualizing this as satisfying a use case where you have non-email (non RFC compliant) data where you allow 8 bit data in the header bodies because it's in internal ap and you know the encoding. You can then use decode(charset) to decode those BytesHeaders into StringHeaders. > I guess I could see an intermediate decode from string to some charset, > before serialization, as a hint that when generating headers, that all > the characters in the header that are not ASCII are in the specified > charset... and that that charset is the one to be used in the > self-describing serialized ASCII stream? The full generality of the Exactly. > RFCs, however, > allows pieces of headers to be encoded using different charsets... with > this API, it would seem that that could only be created containing one > charset... the serialization primitives were made available, so that > piecewise construction of a header value could be done with different > charsets, and then the from_full_header API used to create the complex > value. I don't see this as a severe limitation, I just want to > understand your intention, and document the limitation, or my > misunderstanding. Right. I'm visualizing the "normal case" being encoding a StringHeader using the default utf-8 charset or another specified charset, turning the words containing non-ASCII characters into encoded words using that charset. The utility methods that turn unicode into encoded words would be exposed, and an application that needs to create a header with mixed charsets can use those utilities to build RFC compliant bytes data and pass that to one of the BytesHeader constructors. (Make the common case easy, and the complicated cases possible.) > > BytesHeader would be exactly the same, with the exception of the signature > > for serialize and the fact that it has a 'decode' method rather than an > > 'encode' method. Serialize would be different only in the fact that > > it would have an additional keyword parameter, must_be_7bit=True. > > I am not clear on why StringHeader's serialize would not need the > must_be_7bit parameter... or do I misunderstand that > StringHeader.serialize produces wire-format data? The latter. StringHeader serialize does not produce wire-format data, it produces text (for example, for display to the user). If you want wire format, you encode the StringHeader and use the resulting BytesHeader serialize. > > The magic of this approach is in those encode/decode methods. > > > > Encoding a StringHeader would yield a BytesHeader containing the same > > data, but encoded per RFC2047 using the specified charset. Decoding a > > BytesHeader would yield a StringHeader with the same data, but decoded to > > unicode per RFC2047, with any 8bit parts decoded (in the unicode sense, > > not the RFC2047 sense) using the specified charset (which would default to > > ASCII, meaning bare 8bit bytes in headers would throw an error). (What to > > with RFC2047 charsets like unknown-8bit is an open question...probably > > throw an error). > > > > Would the encoding to/from StringHeader/BytesHeader preserve the > from_full_header state and value? My thought is no. Once you encode/decode the header, your program has transformed it, and I think it is better to treat the original raw data as gone. The motivation for this is that the 'raw data' of a StringHeader is the *text* string used to create it. Keeping a bytes string 'raw data' around as well would get us back into the mess that I developed this approach to avoid, where we'd need to specify carefully the difference between handing a header whose 'original' raw data was bytes vs string, for each of the BytesHeader and StringHeader cases. Better, I think, to put the (small) burden on the application programmer: if you want to preserve the original input data, do so by keeping the original object around. Once you mutate the object model, the original raw data for the mutated piece is gone. There are some use-case questions here, though, with regards to preservation of as much original information/format as possible, and how valuable that is. I think we'll have to figure that out by examining concrete use cases in detail. (It is not something that the current email package supports very well, by the way...headers currently get modified significantly in the parse/generate cycle, even without bytes-to-string transformations happening.) > > (Encoding or decoding a Message would cause the Message to recursively > > encode or decode its subparts. This means you are making a complete > > new copy of the Message in memory. If you don't want to do that you > > can walk the Message and convert it piece by piece (we could provide a > > generator that does this).) > > Walking it piece by piece would allow the old pieces to be discarded, to > save total memory consumption, where that is appropriate. > > Perhaps one generator that would be commonly used, would be to convert > headers only, and leave MIME data parts alone, accessing and converting > them only with the registered methods? This would mean that a "complete > copy" wouldn't generally be very big, if the data parts were excluded > from implicit conversion. Perhaps the "external storage protocol" might > also only be defined for MIME data parts, and walking the tree with this > generator would not need to reference the MIME data parts, nor bring > them in from "external storage". That's true. The Bytes and String versions of binary MIME parts, which are likely to be the large ones, will probably have a common representation for the payload, and could potentially point to the same object. That breaking of of the expectation that 'encode' and 'decode' return new objects (in analogy to how encode and decode of strings/bytes works) might not be a good thing, though. In any case, text MIME parts have the same bytes vs string issues as headers do, and should, IMO, be converted from one to the other on encode/decode. Another possible approach would be some sort of 'encode/decode on demand' system, although that would need to retain a pointer to the original object, which might get us into suboptimal reference cycle difficulties. These bits are implementation details, though, and don't affect the API design. > > raw_header would be the data passed in to the constructor if > > from_full_header is used, and None otherwise. If encode/decode call > > the regular constructor, then this attribute would also act as a flag > > as to whether or not the header was constructed from raw input data > > or via program. > > > > This _implies_ that from_full_header always accepts raw data bytes... > even for the StringHeader. And that implies the need for an implicit > decode, and therefore, perhaps a charset parameter? No, not a charset > parameter, since they are explicitly contained in the header values. Your confusion was my confusing use of the term 'raw data' to mean whatever was input to the from_full_header constructor, which is bytes for a BytesHeader and text for a StringHeade. > Decode for header values may not need a charset value at all! Normally it would not. charset would be useful in decode only for non-RFC compliant headers. > No comments for the rest. Thanks for your feedback. --David From v+python at g.nevcal.com Tue Jan 26 05:10:47 2010 From: v+python at g.nevcal.com (Glenn Linderman) Date: Mon, 25 Jan 2010 20:10:47 -0800 Subject: [Email-SIG] Thoughts on the general API, and the Header API. In-Reply-To: <20100126025146.CF0AC1BC2FF@kimball.webabinitio.net> References: <20100125201034.190CC1BC4B4@kimball.webabinitio.net> <4B5E3D73.1070900@g.nevcal.com> <20100126025146.CF0AC1BC2FF@kimball.webabinitio.net> Message-ID: <4B5E6B47.9090307@g.nevcal.com> On approximately 1/25/2010 6:51 PM, came the following characters from the keyboard of R. David Murray: > On Mon, 25 Jan 2010 16:55:15 -0800, Glenn Linderman wrote: > >> Are there any other *Header APIs that would be required not to produce >> exceptions? I don't yet perceive any. >> > I don't think so. from_full_header is the only one involved in parsing > raw data. Whether __init__ throws errors or records defects is an open > question, but I lean toward it throwing errors. The reason there is an > open question is because an email manipulating application may want to > convert to text to process an incoming message, and there are things > that a BytesHeader can hold that would cause errors when encoded to a > StringHeader (specifically, 8 bit bytes that aren't transfer encoded). > So it may be that decode, at least, should not throw errors but instead > record additional defects in the resulting StringHeader. I think that > even in that case __init__ should still throw errors, though; decode > could deal with the defects before calling StringHeader.__init__, or > (more likely) catch the errors throw by __init__, fix/record the defects, > and call it again. > > Note, by the way, that by 'raw data' I mean what you are feeding in. > Raw data fed to a BytesHeader would be bytes, but raw data fed to > a StringHeader would be text (eg: if read from a file in text mode). > Glad you clarified that; it wasn't obvious, without typed parameters to the APIs. I had assumed that serialize and from_full_header would produce/consume bytes, and I think that showed up in my comments, and you've probably addressed that below. Of course, the reason that I assumed that, is that there are no RFCs to describe a string format email message, either on the wire, in memory, or, particularly, stored in a file. So it is really up to the application to define that, if it wants that. Now since py3 has a natural string format manipulation capability, and since the emaillib wants to provide the interface between them, I suppose it is a somewhat obvious thing that you might want to store a whole email message in string format... I say somewhat obvious, because you thought of it, but I didn't, until you clarified the above. Perhaps the reason I didn't think of it, is simply that all the currently used email message storage containers of which I am aware use wire format. So using string format for that purpose would require inventing a new storage container (perhaps a trivial extension of an existing one, but new, nonetheless). I sort of expected email clients would, given the capabilities of the emaillib, simply continue to save/read in wire format. In fact, it may be the only choice of format that can completely preserve raw format messages for later processing, in the presence of defects. >> The "charset" parameter... is that not mostly needed for data parts? >> > No, if you start with a unicode string in a StringHeader, you need to > know what charset to encode the unicode to and therefore to specify as > the charset in the RFC 2047 encoded words. > > >> Headers are either ASCII, or contain self-describing charset info. >> > That's true for BytesHeaders, but not for StringHeaders. So as I > said above charset for StringHeader says which charset to put into > the encoded words when converting to BytesHeader form. > > I specified a charset parameter for 'decode' only to handle the case > of raw bytes data that contains 8 bit data that is not in encoded words > (ie: is not RFC compliant). I am visualizing this as satisfying a use > case where you have non-email (non RFC compliant) data where you allow > 8 bit data in the header bodies because it's in internal ap and you > know the encoding. You can then use decode(charset) to decode those > BytesHeaders into StringHeaders. > > >> I guess I could see an intermediate decode from string to some charset, >> before serialization, as a hint that when generating headers, that all >> the characters in the header that are not ASCII are in the specified >> charset... and that that charset is the one to be used in the >> self-describing serialized ASCII stream? The full generality of the >> > Exactly. > OK, I'm with you now on the charset parameter, for encoding and decoding. >> RFCs, however, >> allows pieces of headers to be encoded using different charsets... with >> this API, it would seem that that could only be created containing one >> charset... the serialization primitives were made available, so that >> piecewise construction of a header value could be done with different >> charsets, and then the from_full_header API used to create the complex >> value. I don't see this as a severe limitation, I just want to >> understand your intention, and document the limitation, or my >> misunderstanding. >> > Right. I'm visualizing the "normal case" being encoding a StringHeader > using the default utf-8 charset or another specified charset, turning > the words containing non-ASCII characters into encoded words using that > charset. The utility methods that turn unicode into encoded words would > be exposed, and an application that needs to create a header with mixed > charsets can use those utilities to build RFC compliant bytes data and > pass that to one of the BytesHeader constructors. (Make the common case > easy, and the complicated cases possible.) > Thanks for this clarification also. >>> BytesHeader would be exactly the same, with the exception of the signature >>> for serialize and the fact that it has a 'decode' method rather than an >>> 'encode' method. Serialize would be different only in the fact that >>> it would have an additional keyword parameter, must_be_7bit=True. >>> >> I am not clear on why StringHeader's serialize would not need the >> must_be_7bit parameter... or do I misunderstand that >> StringHeader.serialize produces wire-format data? >> > The latter. StringHeader serialize does not produce wire-format data, > it produces text (for example, for display to the user). If you want > wire format, you encode the StringHeader and use the resulting BytesHeader > serialize. > OK, I'm with you here now too. So it may be nice to have a recursive operation that would convert String format stuff to Bytes and then to wire format, in one go, discarding the intermediate Bytes format stuffh along the way to avoid three copies of the data, for simple email clients that only use the String format interfaces. >>> The magic of this approach is in those encode/decode methods. >>> >>> Encoding a StringHeader would yield a BytesHeader containing the same >>> data, but encoded per RFC2047 using the specified charset. Decoding a >>> BytesHeader would yield a StringHeader with the same data, but decoded to >>> unicode per RFC2047, with any 8bit parts decoded (in the unicode sense, >>> not the RFC2047 sense) using the specified charset (which would default to >>> ASCII, meaning bare 8bit bytes in headers would throw an error). (What to >>> with RFC2047 charsets like unknown-8bit is an open question...probably >>> throw an error). >>> >> Would the encoding to/from StringHeader/BytesHeader preserve the >> from_full_header state and value? >> > My thought is no. Once you encode/decode the header, your program has > transformed it, and I think it is better to treat the original raw data > as gone. The motivation for this is that the 'raw data' of a StringHeader > is the *text* string used to create it. Keeping a bytes string 'raw data' > around as well would get us back into the mess that I developed this > approach to avoid, where we'd need to specify carefully the difference > between handing a header whose 'original' raw data was bytes vs string, > for each of the BytesHeader and StringHeader cases. Better, I think, > to put the (small) burden on the application programmer: if you want to > preserve the original input data, do so by keeping the original object > around. Once you mutate the object model, the original raw data for > the mutated piece is gone. > > There are some use-case questions here, though, with regards to > preservation of as much original information/format as possible, and how > valuable that is. I think we'll have to figure that out by examining > concrete use cases in detail. (It is not something that the current email > package supports very well, by the way...headers currently get modified > significantly in the parse/generate cycle, even without bytes-to-string > transformations happening.) > Not every transformation is intended to be a change. Until there is a change, it would be nice to be able to retain the original byte stream, for invertibility, without requiring that a simple email client deal with bytes interfaces for RFC conformant messages. I hear you regarding the mess... here's an brainstorming idea, tossed out mostly to get your creative juices flowing in this direction, not because I think it is "definitely the way to go". The decode API could, in addition to your description, have an option to preserve itself and the decode charset, within the String object... If encode "discovers" a preserved Bytes object, and the same charset is provided, it would return the preserved Bytes object, rather than creating a new one. There may be no need to drop the Bytes object explicitly; as it seems the only API for making changes to a Header object is to create a new one, and substitute the new one for the old one. Or maybe from_full_header does a modify. Or maybe the properties are assignable (that is not explicitly stated, by the way). So if there are modify operations, they should drop the Bytes object. >>> (Encoding or decoding a Message would cause the Message to recursively >>> encode or decode its subparts. This means you are making a complete >>> new copy of the Message in memory. If you don't want to do that you >>> can walk the Message and convert it piece by piece (we could provide a >>> generator that does this).) >>> >> Walking it piece by piece would allow the old pieces to be discarded, to >> save total memory consumption, where that is appropriate. >> >> Perhaps one generator that would be commonly used, would be to convert >> headers only, and leave MIME data parts alone, accessing and converting >> them only with the registered methods? This would mean that a "complete >> copy" wouldn't generally be very big, if the data parts were excluded >> from implicit conversion. Perhaps the "external storage protocol" might >> also only be defined for MIME data parts, and walking the tree with this >> generator would not need to reference the MIME data parts, nor bring >> them in from "external storage". >> > That's true. The Bytes and String versions of binary MIME parts, > which are likely to be the large ones, will probably have a common > representation for the payload, and could potentially point to the same > object. That breaking of of the expectation that 'encode' and 'decode' > return new objects (in analogy to how encode and decode of strings/bytes > works) might not be a good thing, though. > Well, one generator could provide the expectation that everything is new; another could provide different expectations. The differences between them, and the tradeoffs would be documented, of course, were both provided. I'm not convinced that treating headers and data exactly the same at all times is a good thing... a convenient option at times, perhaps, but I can see it as a serious inefficiency in many use cases involving large data. This deserves a bit more thought/analysis/discussion, perhaps. More than I have time for tonight, but I may reply again, perhaps after others have responded, if they do. > In any case, text MIME parts have the same bytes vs string issues as > headers do, and should, IMO, be converted from one to the other on > encode/decode. > To me, your first phrase implies that they should share common encode/decode routines, but not the other. I can clearly see a use case where your opinion is the right approach, but I think there are use cases where it might not be... while text MIME parts are generally smaller than binary MIME parts, that is neither a requirement, nor always true (think about transferring an XML format database... could be huge... and is text of sorts -- human decipherable, more easily than hex dumps, but not what I would call "human readable"). > Another possible approach would be some sort of 'encode/decode on demand' > system, although that would need to retain a pointer to the original > object, which might get us into suboptimal reference cycle difficulties. > Hmm. Brainstorming again. decode could minimally create the String format object, with only the Bytes format object and charset parameter set (from the above brainstorming idea). Then the real decoding could be done if the properties are accessed. If the properties are not accessed (because the client/application makes its decisions based on access to other components of the email), the decoding need never be done for some objects. Perhaps this would also neatly deal with my desire to delay the decode of MIME data parts as well? > These bits are implementation details, though, and don't affect the API > design. > Well, one impact of the above brainstorming would be an interface to create the StringHeader containing the BytesHeader and charset parameters. Or maybe that would be a private interface, not considered to be part of the API? >>> raw_header would be the data passed in to the constructor if >>> from_full_header is used, and None otherwise. If encode/decode call >>> the regular constructor, then this attribute would also act as a flag >>> as to whether or not the header was constructed from raw input data >>> or via program. >>> >>> >> This _implies_ that from_full_header always accepts raw data bytes... >> even for the StringHeader. And that implies the need for an implicit >> decode, and therefore, perhaps a charset parameter? No, not a charset >> parameter, since they are explicitly contained in the header values. >> > Your confusion was my confusing use of the term 'raw data' to mean > whatever was input to the from_full_header constructor, which is > bytes for a BytesHeader and text for a StringHeade. > If we are going to invent a new "string format raw data" element, maybe we should invent a term to describe it, also... maybe "raw data" should be split into "raw bytes" and "raw string", and "raw data" become a synonym for "raw bytes", as that is what it was historically? -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From glenn at nevcal.com Tue Jan 26 05:10:01 2010 From: glenn at nevcal.com (Glenn Linderman) Date: Mon, 25 Jan 2010 20:10:01 -0800 Subject: [Email-SIG] Thoughts on the general API, and the Header API. In-Reply-To: <20100126025146.CF0AC1BC2FF@kimball.webabinitio.net> References: <20100125201034.190CC1BC4B4@kimball.webabinitio.net> <4B5E3D73.1070900@g.nevcal.com> <20100126025146.CF0AC1BC2FF@kimball.webabinitio.net> Message-ID: <4B5E6B19.4050503@nevcal.com> On approximately 1/25/2010 6:51 PM, came the following characters from the keyboard of R. David Murray: > On Mon, 25 Jan 2010 16:55:15 -0800, Glenn Linderman wrote: > >> Are there any other *Header APIs that would be required not to produce >> exceptions? I don't yet perceive any. >> > I don't think so. from_full_header is the only one involved in parsing > raw data. Whether __init__ throws errors or records defects is an open > question, but I lean toward it throwing errors. The reason there is an > open question is because an email manipulating application may want to > convert to text to process an incoming message, and there are things > that a BytesHeader can hold that would cause errors when encoded to a > StringHeader (specifically, 8 bit bytes that aren't transfer encoded). > So it may be that decode, at least, should not throw errors but instead > record additional defects in the resulting StringHeader. I think that > even in that case __init__ should still throw errors, though; decode > could deal with the defects before calling StringHeader.__init__, or > (more likely) catch the errors throw by __init__, fix/record the defects, > and call it again. > > Note, by the way, that by 'raw data' I mean what you are feeding in. > Raw data fed to a BytesHeader would be bytes, but raw data fed to > a StringHeader would be text (eg: if read from a file in text mode). > Glad you clarified that; it wasn't obvious, without typed parameters to the APIs. I had assumed that serialize and from_full_header would produce/consume bytes, and I think that showed up in my comments, and you've probably addressed that below. Of course, the reason that I assumed that, is that there are no RFCs to describe a string format email message, either on the wire, in memory, or, particularly, stored in a file. So it is really up to the application to define that, if it wants that. Now since py3 has a natural string format manipulation capability, and since the emaillib wants to provide the interface between them, I suppose it is a somewhat obvious thing that you might want to store a whole email message in string format... I say somewhat obvious, because you thought of it, but I didn't, until you clarified the above. Perhaps the reason I didn't think of it, is simply that all the currently used email message storage containers of which I am aware use wire format. So using string format for that purpose would require inventing a new storage container (perhaps a trivial extension of an existing one, but new, nonetheless). I sort of expected email clients would, given the capabilities of the emaillib, simply continue to save/read in wire format. In fact, it may be the only choice of format that can completely preserve raw format messages for later processing, in the presence of defects. >> The "charset" parameter... is that not mostly needed for data parts? >> > No, if you start with a unicode string in a StringHeader, you need to > know what charset to encode the unicode to and therefore to specify as > the charset in the RFC 2047 encoded words. > > >> Headers are either ASCII, or contain self-describing charset info. >> > That's true for BytesHeaders, but not for StringHeaders. So as I > said above charset for StringHeader says which charset to put into > the encoded words when converting to BytesHeader form. > > I specified a charset parameter for 'decode' only to handle the case > of raw bytes data that contains 8 bit data that is not in encoded words > (ie: is not RFC compliant). I am visualizing this as satisfying a use > case where you have non-email (non RFC compliant) data where you allow > 8 bit data in the header bodies because it's in internal ap and you > know the encoding. You can then use decode(charset) to decode those > BytesHeaders into StringHeaders. > > >> I guess I could see an intermediate decode from string to some charset, >> before serialization, as a hint that when generating headers, that all >> the characters in the header that are not ASCII are in the specified >> charset... and that that charset is the one to be used in the >> self-describing serialized ASCII stream? The full generality of the >> > Exactly. > OK, I'm with you now on the charset parameter, for encoding and decoding. >> RFCs, however, >> allows pieces of headers to be encoded using different charsets... with >> this API, it would seem that that could only be created containing one >> charset... the serialization primitives were made available, so that >> piecewise construction of a header value could be done with different >> charsets, and then the from_full_header API used to create the complex >> value. I don't see this as a severe limitation, I just want to >> understand your intention, and document the limitation, or my >> misunderstanding. >> > Right. I'm visualizing the "normal case" being encoding a StringHeader > using the default utf-8 charset or another specified charset, turning > the words containing non-ASCII characters into encoded words using that > charset. The utility methods that turn unicode into encoded words would > be exposed, and an application that needs to create a header with mixed > charsets can use those utilities to build RFC compliant bytes data and > pass that to one of the BytesHeader constructors. (Make the common case > easy, and the complicated cases possible.) > Thanks for this clarification also. >>> BytesHeader would be exactly the same, with the exception of the signature >>> for serialize and the fact that it has a 'decode' method rather than an >>> 'encode' method. Serialize would be different only in the fact that >>> it would have an additional keyword parameter, must_be_7bit=True. >>> >> I am not clear on why StringHeader's serialize would not need the >> must_be_7bit parameter... or do I misunderstand that >> StringHeader.serialize produces wire-format data? >> > The latter. StringHeader serialize does not produce wire-format data, > it produces text (for example, for display to the user). If you want > wire format, you encode the StringHeader and use the resulting BytesHeader > serialize. > OK, I'm with you here now too. So it may be nice to have a recursive operation that would convert String format stuff to Bytes and then to wire format, in one go, discarding the intermediate Bytes format stuffh along the way to avoid three copies of the data, for simple email clients that only use the String format interfaces. >>> The magic of this approach is in those encode/decode methods. >>> >>> Encoding a StringHeader would yield a BytesHeader containing the same >>> data, but encoded per RFC2047 using the specified charset. Decoding a >>> BytesHeader would yield a StringHeader with the same data, but decoded to >>> unicode per RFC2047, with any 8bit parts decoded (in the unicode sense, >>> not the RFC2047 sense) using the specified charset (which would default to >>> ASCII, meaning bare 8bit bytes in headers would throw an error). (What to >>> with RFC2047 charsets like unknown-8bit is an open question...probably >>> throw an error). >>> >> Would the encoding to/from StringHeader/BytesHeader preserve the >> from_full_header state and value? >> > My thought is no. Once you encode/decode the header, your program has > transformed it, and I think it is better to treat the original raw data > as gone. The motivation for this is that the 'raw data' of a StringHeader > is the *text* string used to create it. Keeping a bytes string 'raw data' > around as well would get us back into the mess that I developed this > approach to avoid, where we'd need to specify carefully the difference > between handing a header whose 'original' raw data was bytes vs string, > for each of the BytesHeader and StringHeader cases. Better, I think, > to put the (small) burden on the application programmer: if you want to > preserve the original input data, do so by keeping the original object > around. Once you mutate the object model, the original raw data for > the mutated piece is gone. > > There are some use-case questions here, though, with regards to > preservation of as much original information/format as possible, and how > valuable that is. I think we'll have to figure that out by examining > concrete use cases in detail. (It is not something that the current email > package supports very well, by the way...headers currently get modified > significantly in the parse/generate cycle, even without bytes-to-string > transformations happening.) > Not every transformation is intended to be a change. Until there is a change, it would be nice to be able to retain the original byte stream, for invertibility, without requiring that a simple email client deal with bytes interfaces for RFC conformant messages. I hear you regarding the mess... here's an brainstorming idea, tossed out mostly to get your creative juices flowing in this direction, not because I think it is "definitely the way to go". The decode API could, in addition to your description, have an option to preserve itself and the decode charset, within the String object... If encode "discovers" a preserved Bytes object, and the same charset is provided, it would return the preserved Bytes object, rather than creating a new one. There may be no need to drop the Bytes object explicitly; as it seems the only API for making changes to a Header object is to create a new one, and substitute the new one for the old one. Or maybe from_full_header does a modify. Or maybe the properties are assignable (that is not explicitly stated, by the way). So if there are modify operations, they should drop the Bytes object. >>> (Encoding or decoding a Message would cause the Message to recursively >>> encode or decode its subparts. This means you are making a complete >>> new copy of the Message in memory. If you don't want to do that you >>> can walk the Message and convert it piece by piece (we could provide a >>> generator that does this).) >>> >> Walking it piece by piece would allow the old pieces to be discarded, to >> save total memory consumption, where that is appropriate. >> >> Perhaps one generator that would be commonly used, would be to convert >> headers only, and leave MIME data parts alone, accessing and converting >> them only with the registered methods? This would mean that a "complete >> copy" wouldn't generally be very big, if the data parts were excluded >> from implicit conversion. Perhaps the "external storage protocol" might >> also only be defined for MIME data parts, and walking the tree with this >> generator would not need to reference the MIME data parts, nor bring >> them in from "external storage". >> > That's true. The Bytes and String versions of binary MIME parts, > which are likely to be the large ones, will probably have a common > representation for the payload, and could potentially point to the same > object. That breaking of of the expectation that 'encode' and 'decode' > return new objects (in analogy to how encode and decode of strings/bytes > works) might not be a good thing, though. > Well, one generator could provide the expectation that everything is new; another could provide different expectations. The differences between them, and the tradeoffs would be documented, of course, were both provided. I'm not convinced that treating headers and data exactly the same at all times is a good thing... a convenient option at times, perhaps, but I can see it as a serious inefficiency in many use cases involving large data. This deserves a bit more thought/analysis/discussion, perhaps. More than I have time for tonight, but I may reply again, perhaps after others have responded, if they do. > In any case, text MIME parts have the same bytes vs string issues as > headers do, and should, IMO, be converted from one to the other on > encode/decode. > To me, your first phrase implies that they should share common encode/decode routines, but not the other. I can clearly see a use case where your opinion is the right approach, but I think there are use cases where it might not be... while text MIME parts are generally smaller than binary MIME parts, that is neither a requirement, nor always true (think about transferring an XML format database... could be huge... and is text of sorts -- human decipherable, more easily than hex dumps, but not what I would call "human readable"). > Another possible approach would be some sort of 'encode/decode on demand' > system, although that would need to retain a pointer to the original > object, which might get us into suboptimal reference cycle difficulties. > Hmm. Brainstorming again. decode could minimally create the String format object, with only the Bytes format object and charset parameter set (from the above brainstorming idea). Then the real decoding could be done if the properties are accessed. If the properties are not accessed (because the client/application makes its decisions based on access to other components of the email), the decoding need never be done for some objects. Perhaps this would also neatly deal with my desire to delay the decode of MIME data parts as well? > These bits are implementation details, though, and don't affect the API > design. > Well, one impact of the above brainstorming would be an interface to create the StringHeader containing the BytesHeader and charset parameters. Or maybe that would be a private interface, not considered to be part of the API? >>> raw_header would be the data passed in to the constructor if >>> from_full_header is used, and None otherwise. If encode/decode call >>> the regular constructor, then this attribute would also act as a flag >>> as to whether or not the header was constructed from raw input data >>> or via program. >>> >>> >> This _implies_ that from_full_header always accepts raw data bytes... >> even for the StringHeader. And that implies the need for an implicit >> decode, and therefore, perhaps a charset parameter? No, not a charset >> parameter, since they are explicitly contained in the header values. >> > Your confusion was my confusing use of the term 'raw data' to mean > whatever was input to the from_full_header constructor, which is > bytes for a BytesHeader and text for a StringHeade. > If we are going to invent a new "string format raw data" element, maybe we should invent a term to describe it, also... maybe "raw data" should be split into "raw bytes" and "raw string", and "raw data" become a synonym for "raw bytes", as that is what it was historically? -- Glenn ------------------------------------------------------------------------ ?Everyone is entitled to their own opinion, but not their own facts. In turn, everyone is entitled to their own opinions of the facts, but not their own facts based on their opinions.? -- Guy Rocha, retiring NV state archivist From v+python at g.nevcal.com Fri Jan 29 03:20:24 2010 From: v+python at g.nevcal.com (Glenn Linderman) Date: Thu, 28 Jan 2010 18:20:24 -0800 Subject: [Email-SIG] Thoughts on the general API, and the Header API. In-Reply-To: <4B5E6B47.9090307@g.nevcal.com> References: <20100125201034.190CC1BC4B4@kimball.webabinitio.net> <4B5E3D73.1070900@g.nevcal.com> <20100126025146.CF0AC1BC2FF@kimball.webabinitio.net> <4B5E6B47.9090307@g.nevcal.com> Message-ID: <4B6245E8.3060402@g.nevcal.com> On approximately 1/25/2010 8:10 PM, came the following characters from the keyboard of Glenn Linderman: >> That's true. The Bytes and String versions of binary MIME parts, >> which are likely to be the large ones, will probably have a common >> representation for the payload, and could potentially point to the same >> object. That breaking of of the expectation that 'encode' and 'decode' >> return new objects (in analogy to how encode and decode of strings/bytes >> works) might not be a good thing, though. > > Well, one generator could provide the expectation that everything is > new; another could provide different expectations. The differences > between them, and the tradeoffs would be documented, of course, were > both provided. I'm not convinced that treating headers and data > exactly the same at all times is a good thing... a convenient option > at times, perhaps, but I can see it as a serious inefficiency in many > use cases involving large data. > > This deserves a bit more thought/analysis/discussion, perhaps. More > than I have time for tonight, but I may reply again, perhaps after > others have responded, if they do. I guess no one else is responding here at the moment. Read the ideas below, and then afterward, consider building the APIs you've suggested on top of them. And then, with the full knowledge that the messages may be either in fast or slow storage, I think that you'll agree that converting the whole tree in one swoop isn't always appropriate... all headers, probably could be. Data, because of its size, should probably be done on demand. In earlier discussions about the registry, there was the idea of having a registry for transport encoding handling, and a registry for MIME encoding handling. There were also vague comments about doing an external storage protocol "somehow", but it was a vague concept to be defined later, or at least I don't recall any definitions. Given a raw bytes representation of an incoming email, mail servers need to choose how to handle it... this may need to be a dynamic choice based on current server load, as well as the obvious static server resources, as well as configured limits. Unfortunately, the SMTP protocol does not require predeclaration of the size of the incoming DATA part, so servers cannot enforce size limits until they are exceeded. So as the data streams in, a dynamic adjustment to the handling strategy might be appropriate. Gateways may choose to route messages, and stall the input until the output channel is ready to receive it, and basically "pass through" the data, with limited need to buffer messages on disk... unless the output channel doesn't respond... then they might reject the message. An SMTP server should be willing to act as a store-and-forward server, and also must do individual delivery of messages to each RCPT (or at least one per destination domain), so must have a way of dealing with large messages, probably via disk buffering. The case of disk buffering and retrying generally means that the whole message, not just the large data parts, must be stored on disk, so the external storage protocol should be able to deal with that case. The minimal external storage format capability is to store the received bytestream to disk, associate it with the envelope information, and be able to retrieve it in whole later. This would require having the whole thing in RAM at those two points in time, however, and doesn't solve the real problem. Incremental writing and reading to the external storage would be much more useful. Even more useful, would be "partially parsed" seek points. An external storage system that provides "partially parsed" information could include: 1) envelope information. This section is useful to SMTP servers, but not other email tools, so should be optional. This could be a copy of the received RCPT command texts, complete with CRLF endings. 2) header information. This would be everything between DATA and the first CRLF CRLF sequence. 3) data. Pre-MIME this would simply be the rest of the message, but post-MIME it would be usefully more complex. If MIME headers can be observed and parsed as the data passes through, then additional metadata could be saved that could enhance performance of the later processing steps. Such additional metadata could include the beginning of each MIME part, the end of the headers for that part, and the end of the data for that part. The result of saving that information would mean that minimal data (just headers) would need to be read in create a tree representing the email, the rest could be left in external storage until it is accessed... and then obtained directly from there when needed, and converted to the form required by the request... either the whole part, or some piece in a buffer. So there could be a variety of external storage systems... one that stores in memory, one that stores on disk per the ideas above, and a variety that retain some amount of cached information about the email, even though they store it all on disk. Sounds like this could be a plug-in, or an attribute of a message object creation. But to me, it sounds like the foundation upon which the whole email lib should be built, not something that is shoveled in later. A further note about access to data parts... clearly "data for the whole MIME part" could be provided, but even for a single part that could be large. So access to smaller chunks might be desired. The data access/conversion functions, therefore, should support a buffer-at-a-time access interface. Base64 supports random access easily, unless it contains characters not in the 64, that are to be ignored, that could throw off the size calculations. So maybe providing sequential buffer-at-a-time access with rewind is the best that can be done -- quoted-printable doesn't support random access very well, and neither would some sort of compression or encryption technique -- they usually like to start from the beginning -- and those are the sorts of things that I would consider likely to be standardized in the future, to reduce the size of the payload, and to increase the security of the payload. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking