From esj at harvee.org  Mon Jan  4 03:25:56 2010
From: esj at harvee.org (Eric S. Johansson)
Date: Sun, 03 Jan 2010 21:25:56 -0500
Subject: [Email-SIG] Design Thoughts Summary
In-Reply-To: <4B696451-1885-4670-AB7C-C283132F7638@python.org>
References: <Pine.LNX.4.64.0911141648530.12274@kimball.webabinitio.net>	<1258239230.74.7080@mint-julep.mondoinfo.com>
	<4B696451-1885-4670-AB7C-C283132F7638@python.org>
Message-ID: <4B4151B4.9090406@harvee.org>

On 11/15/2009 1:01 PM, Barry Warsaw wrote:
> On Nov 14, 2009, at 5:12 PM, Matthew Dixon Cowles wrote:
>
>> Thank you. I am virtually 100% in agreement that this document
>> represents what people have agreed on and that it represents what is
>> sensible to do.
>
> As am I. Fantastic work in pulling this all together David.
>
> I'm a bit slammed right now, but a quick comment...
>
>>> * The API needs to at a minimum have hooks available for an
>>> application to store data on disk rather than holding everything in
>>> memory.
>>
>> I remain unconvinced that this is worth the trouble. Yes, the Twisted
>> folks say that they can't use the email module because they may be
>> receiving hundreds of messages at once. But can anyone do anything
>> with hundreds of messages at once other than write them to disk?
>>
>> And would anything actually be improved by reading hundreds of files
>> at once, in small chunks, looking for MIME separators?
>
> Mailman has a similar problem. Even if we get just a few big messages,
> they can crush the system. You could argue that the MTA should just
> block messages with 50MB bodies if the underlying Mailman code can't
> handle it, but I still think we can do better.
>
> I think we're fine if all the headers and MIME structure were kept in
> memory it would be fine. But I do think we just want to be able to never
> store the raw body content in memory (perhaps unless needed, on demand).
> Mailman for example rarely cares about the bytes of say an image/jpeg body.

for what it's worth, I've also experienced the same "crushing blow" caused by 
large messages in memory. In my case, I immediately dumped all messages to a 
database (unfortunately, SQL), extracted the essential metadata I needed for my 
application and kept it in the record selected index and search on it. I also 
stored the raw message and the processed message in the database as well. Reason 
being, that I wanted to be able to analyze the raw message if something failed 
(usually Unicode failure) and be able to retrieve the e-mail object from its 
json container for quick(er) processing and I would get with parsing the raw 
message again (and again).

This experience makes me a supporter of an e-mail module that has a storage 
container object that can be searched by any number of metadata fields.  these 
metadata fields would consist of internal (to the message) data sources and 
external data sources. I believe it would be necessary to specify what 
searchable fields you want before creating the storage container.

I hope that it would be possible to make the storage container backend Storage 
Technology independent so that people like me who will detest SQL until the heat 
death of the universe can use something else to store mail messages.  I would 
also recommend not depending on the file system because in my experience, 
performance declined dramatically around 500 messages (ext3 adn jfs). Even 
though I was using an SQL database (SQLite), it was significantly faster using 
the database.

Thanks to all who are working on this project. I wish I could participate more 
but, life has other plans for me.

From barry at python.org  Mon Jan  4 15:57:55 2010
From: barry at python.org (Barry Warsaw)
Date: Mon, 4 Jan 2010 09:57:55 -0500
Subject: [Email-SIG] Design Thoughts Summary
In-Reply-To: <4B4151B4.9090406@harvee.org>
References: <Pine.LNX.4.64.0911141648530.12274@kimball.webabinitio.net>	<1258239230.74.7080@mint-julep.mondoinfo.com>
	<4B696451-1885-4670-AB7C-C283132F7638@python.org>
	<4B4151B4.9090406@harvee.org>
Message-ID: <691D78A9-1931-4276-9AA0-BD8061536104@python.org>

On Jan 3, 2010, at 9:25 PM, Eric S. Johansson wrote:

> I hope that it would be possible to make the storage container backend Storage Technology independent so that people like me who will detest SQL until the heat death of the universe can use something else to store mail messages.  I would also recommend not depending on the file system because in my experience, performance declined dramatically around 500 messages (ext3 adn jfs). Even though I was using an SQL database (SQLite), it was significantly faster using the database.

The standard library should probably have an API and one or two reference implementations.  I think you can a file-system based implementation moderately non-horrible. :)

-Barry

From janssen at parc.com  Mon Jan  4 18:45:43 2010
From: janssen at parc.com (Bill Janssen)
Date: Mon, 4 Jan 2010 09:45:43 PST
Subject: [Email-SIG] Design Thoughts Summary
In-Reply-To: <691D78A9-1931-4276-9AA0-BD8061536104@python.org>
References: <Pine.LNX.4.64.0911141648530.12274@kimball.webabinitio.net>
	<1258239230.74.7080@mint-julep.mondoinfo.com>
	<4B696451-1885-4670-AB7C-C283132F7638@python.org>
	<4B4151B4.9090406@harvee.org>
	<691D78A9-1931-4276-9AA0-BD8061536104@python.org>
Message-ID: <97249.1262627143@parc.com>

Barry Warsaw <barry at python.org> wrote:

> On Jan 3, 2010, at 9:25 PM, Eric S. Johansson wrote:
> 
> > I hope that it would be possible to make the storage container backend Storage Technology independent so that people like me who will detest SQL until the heat death of the universe can use something else to store mail messages.  I would also recommend not depending on the file system because in my experience, performance declined dramatically around 500 messages (ext3 adn jfs). Even though I was using an SQL database (SQLite), it was significantly faster using the database.
> 
> The standard library should probably have an API and one or two reference implementations.  I think you can a file-system based implementation moderately non-horrible. :)
> 
> -Barry

Considering all the IMAP implementations using file system containers, I
think you're right :-).

Bill

From Sypniewski at rowan.edu  Mon Jan 11 17:54:58 2010
From: Sypniewski at rowan.edu (Sypniewski, Bernard Paul)
Date: Mon, 11 Jan 2010 11:54:58 -0500
Subject: [Email-SIG] smtp question
Message-ID: <56938804844D4D488FFE514C657AC3E1022F3B2D@EX2K3-2.rowanads.rowan.edu>

Dear SIG members:
     I am writing a reading comprehension program along with a teacher of Developmental Reading here at Rowan. She wants results of exercises emailed to her. Here is the problem that we have encountered. The usual PYTHON email modules require, as is only sensible, certain email configuration information. Because of the audience for which we are writing a program, not only is it inlikely that  students will have the required information but may not be literate enough to understand the instructions about what information to get and how to get it. So, here I am writing to you asking whether any of you know a way that we can get the required SMTP and POP3 (we will distribute the program to others) information through code so that we do not have to ask the students for information that they may have significant difficulty understanding and obtaining. We are working exclusively on WINDOWS platforms.
 
Bernard Sypniewski
Department of Computer Science
Rowan University - Camden Campus
Broadway and Cooper Street
Camden, NJ 08102 USA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/email-sig/attachments/20100111/1d6e2236/attachment.htm>

From matt at mondoinfo.com  Mon Jan 11 19:22:08 2010
From: matt at mondoinfo.com (Matthew Dixon Cowles)
Date: Mon, 11 Jan 2010 12:22:08 -0600 (CST)
Subject: [Email-SIG] smtp question
Message-ID: <1263234060.75.28452@mint-julep.mondoinfo.com>

[Sent direct reply separately]

Dear Bernard,

> know a way that we can get the required SMTP and POP3 (we will
> distribute the program to others) information through code so that
> we do not have to ask the students

Someone who knows more about Windows than I do may correct me if I'm
mistaken, but I doubt that there's a simple and reliable way to do
that.

If you know what email program they're using, you might be able to
dig the information out of its configuration. But even if you went to
that trouble for several email programs, there would be someone who
was using a different one.

You could do DNS lookups to (try to) get the user's domain name and
then try the hosts smtp, pop, and mail. But POP requires a username
and password and SMTP usually does these days. So even if that were
successful, you'd still have to ask for those.

If you were willing spend the time and money, you could set up your
own SMTP server that all instances of your program would be
hard-coded to use. You'd want to be very careful about its
configuration of course. Many networks block port 25 so you'd
probably want to use SMTP over SSL on port 587. I think that some
networks even block that port so you'd probably also want to have
your server listening on some unusual port.

If you chose to go that route, I'd suggest that you have that machine
configured and run by a sysadmin who knows exactly what they're doing.

Regards,
Matt


From mark at msapiro.net  Mon Jan 11 19:55:07 2010
From: mark at msapiro.net (Mark Sapiro)
Date: Mon, 11 Jan 2010 10:55:07 -0800
Subject: [Email-SIG] smtp question
In-Reply-To: <1263234060.75.28452@mint-julep.mondoinfo.com>
References: <1263234060.75.28452@mint-julep.mondoinfo.com>
Message-ID: <4B4B740B.1060207@msapiro.net>

On 1/11/2010 10:22 AM, Matthew Dixon Cowles wrote:
> 
>> know a way that we can get the required SMTP and POP3 (we will
>> distribute the program to others) information through code so that
>> we do not have to ask the students
> 
> Someone who knows more about Windows than I do may correct me if I'm
> mistaken, but I doubt that there's a simple and reliable way to do
> that.

You might look at what Thunderbird 3 does when you set up a new POP3
account. It asks for the email address first and then appears to just
try various names based on the email address domain using server names
like mail, pop3, smtp and the various standard ports until it finds ones
that work. Basically, it looks at the email address, guesses at the
possible POP3 and SMTP server names and ports and tries to find ones
that work. Sometimes it succeeds admirably well and sometimes it fails
at one or the other or both.

It may also have additional knowledge of popular domains.

You probably can't do better than that.

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan


From avbidder at fortytwo.ch  Mon Jan 11 21:54:36 2010
From: avbidder at fortytwo.ch (Adrian von Bidder)
Date: Mon, 11 Jan 2010 21:54:36 +0100
Subject: [Email-SIG] smtp question
In-Reply-To: <56938804844D4D488FFE514C657AC3E1022F3B2D@EX2K3-2.rowanads.rowan.edu>
References: <56938804844D4D488FFE514C657AC3E1022F3B2D@EX2K3-2.rowanads.rowan.edu>
Message-ID: <201001112154.36796@fortytwo.ch>

Hi,

On Monday 11 January 2010 17:54:58 Sypniewski, Bernard Paul wrote:
> So, here I am writing to you asking whether any of you know a way that we
>  can get the required SMTP and POP3 (we will distribute the program to
>  others) information through code so that we do not have to ask the
>  students for information that they may have significant difficulty
>  understanding and obtaining. We are working exclusively on WINDOWS
>  platforms.

For the "guessing" solution see the other emails.  As an engineer I get 
goosepimples when I read about such solutions...

As far as I can see the only "serious" thing you could do is, since you're on 
pure Windows(tm) anyway, use some kind of Windows specific system wide email 
interface to send the email (MAPI?  I'm not a Windows person at all but I 
*think* I remember some trojans are/were using this to send their spam.)

Of course that might depend on the people having Outlook (Express?) 
configured, and I don't know if a Python wrapper for this interface exists.

cheers
-- vbi

-- 
> I have confiscated his Commodore 64 and acoustic coupler.
You mean he couples acoustically with Commodore 64?! That explains a lot.
Brings a new meaning to the word "audiophile", at least.
        -- Dolphin in news.admin.net-abuse.email
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 389 bytes
Desc: This is a digitally signed message part.
URL: <http://mail.python.org/pipermail/email-sig/attachments/20100111/e9829a7f/attachment.pgp>

From stef.mientki at gmail.com  Fri Jan 15 00:54:39 2010
From: stef.mientki at gmail.com (Stef Mientki)
Date: Fri, 15 Jan 2010 00:54:39 +0100
Subject: [Email-SIG] smtp question
In-Reply-To: <201001112154.36796@fortytwo.ch>
References: <56938804844D4D488FFE514C657AC3E1022F3B2D@EX2K3-2.rowanads.rowan.edu>
	<201001112154.36796@fortytwo.ch>
Message-ID: <4B4FAEBF.5010008@gmail.com>

On 11-01-2010 21:54, Adrian von Bidder wrote:
> Hi,
>
> On Monday 11 January 2010 17:54:58 Sypniewski, Bernard Paul wrote:
>    
>> So, here I am writing to you asking whether any of you know a way that we
>>   can get the required SMTP and POP3 (we will distribute the program to
>>   others) information through code so that we do not have to ask the
>>   students for information that they may have significant difficulty
>>   understanding and obtaining. We are working exclusively on WINDOWS
>>   platforms.
>>      
> For the "guessing" solution see the other emails.  As an engineer I get
> goosepimples when I read about such solutions...
>
> As far as I can see the only "serious" thing you could do is, since you're on
> pure Windows(tm) anyway, use some kind of Windows specific system wide email
> interface to send the email (MAPI?  I'm not a Windows person at all but I
> *think* I remember some trojans are/were using this to send their spam.)
>
> Of course that might depend on the people having Outlook (Express?)
> configured, and I don't know if a Python wrapper for this interface exists.
>
>    
These days, more and more people (on windows), don't have an email 
client installed, they use a webinterface for mail ( and other programs),
so the only reliable way seems to setup your own mail server.

cheers,
Stef
> cheers
> -- vbi
>
>    
>
>
> _______________________________________________
> Email-SIG mailing list
> Email-SIG at python.org
> Your options: http://mail.python.org/mailman/options/email-sig/stef.mientki%40gmail.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/email-sig/attachments/20100115/a23c4c8d/attachment.htm>

From rdmurray at bitdance.com  Mon Jan 18 21:44:51 2010
From: rdmurray at bitdance.com (R. David Murray)
Date: Mon, 18 Jan 2010 15:44:51 -0500
Subject: [Email-SIG] Kick starting email 6.0 development
Message-ID: <20100118204451.B5E0B1FBB0E@kimball.webabinitio.net>

With Barry's encouragement, and based on the summaries I prepared for
the Email Wiki and discussion with Barry, I submitted a proposal to the
PSF to fund me to do development work on the email 6 module.  As you
will read in the proposal[1], this works for me since I do contract
programming work as part of my income, and with funding I can devote
that portion of my time (and more, I hope) to this project.

The PSF did not fully fund the proposal[2].  However, they did provide
seed funding covering the first two months, and will be helping with
fundraising for the rest in the form of "fiscal sponsorship" (from what I
understand we'll learn more about that at PyCon).  In addition, for every
four dollars raised from other sources they'll put in another dollar.

So far I've done the first item on the task list in the budget projection,
the review of all outstanding issues in the tracker.  The results are
posted[3] on my web site, which I have linked from the wiki.  (I put them
on my website rather than the wiki because I wrote a little Sphinx plugin
to query the tracker for issue titles and status, so that links are
generated automatically and the page will get updated as bugs are closed
or the titles changed.)  If you know of any bug reports I've missed,
or disagree with my analysis of any of the bugs, please let me know.

>From here the next steps are to start refactoring and adding to the tests,
and to move the discussion of the new API into the territory of concrete
proposals.  I'll be posting about both of those as the week goes along.

My thought is that all the work should be done using a DVCS, to give
maximum opportunity for others to contribute as their time allows.
I welcome thoughts about how best to set this up to provide maximum
access for this community of interest.

I'm also interested to know who will be at PyCon and interested in BOF
and/or sprint activities involving the email package.

--David

[1] http://www.bitdance.com/test/projects/email6/psfproposal/
[2] http://www.python.org/psf/records/board/minutes/2009-12-14/#funding-for-python-3-email-module
[3] http://www.bitdance.com/test/projects/email6/issues/

From barry at python.org  Mon Jan 18 23:24:20 2010
From: barry at python.org (Barry Warsaw)
Date: Mon, 18 Jan 2010 17:24:20 -0500
Subject: [Email-SIG] Kick starting email 6.0 development
In-Reply-To: <20100118204451.B5E0B1FBB0E@kimball.webabinitio.net>
References: <20100118204451.B5E0B1FBB0E@kimball.webabinitio.net>
Message-ID: <20100118172420.029cdbf7@freewill>

On Jan 18, 2010, at 03:44 PM, R. David Murray wrote:

>With Barry's encouragement, and based on the summaries I prepared for
>the Email Wiki and discussion with Barry, I submitted a proposal to the
>PSF to fund me to do development work on the email 6 module.  As you
>will read in the proposal[1], this works for me since I do contract
>programming work as part of my income, and with funding I can devote
>that portion of my time (and more, I hope) to this project.

This is awesome David!

>My thought is that all the work should be done using a DVCS, to give
>maximum opportunity for others to contribute as their time allows.
>I welcome thoughts about how best to set this up to provide maximum
>access for this community of interest.

Since Python itself has no DVCS still, I might propose using Bazaar and
Launchpad to track the work.  I already have three branches that may or may
not have anything useful in them (they represent my previous attempts at
this):

* lp:~barry/+junk/email-ng
* lp:~barry/python/30email
* lp:~barry/python/email6

I'm in the process of re-establishing code imports of the various Python
branches on Launchpad.  They had been using the bzr mirrors on
code.python.org, but those haven't been updated in a very long time.  I'm
going to blow those away and re-import from the Subversion branches.  After
that's working it should allow you to bzr branch any active Python branch and
hack on things from there.  That might make the most sense since some of the
bugs you've identified affect other than just the email package.

>I'm also interested to know who will be at PyCon and interested in BOF
>and/or sprint activities involving the email package.

I'll be at PyCon though I don't yet know exactly what I'll be sprinting on.
email package is a possibility.  The sprint sign up page is here:

http://us.pycon.org/2010/sprints/signup/

>[3] http://www.bitdance.com/test/projects/email6/issues/

That one is *scary* :).
-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 835 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/email-sig/attachments/20100118/7954a9a3/attachment.pgp>

From orsenthil at gmail.com  Tue Jan 19 02:32:26 2010
From: orsenthil at gmail.com (Senthil Kumaran)
Date: Tue, 19 Jan 2010 07:02:26 +0530
Subject: [Email-SIG] Kick starting email 6.0 development
In-Reply-To: <20100118204451.B5E0B1FBB0E@kimball.webabinitio.net>
References: <20100118204451.B5E0B1FBB0E@kimball.webabinitio.net>
Message-ID: <20100119013226.GA5255@ubuntu.ubuntu-domain>

On Mon, Jan 18, 2010 at 03:44:51PM -0500, R. David Murray wrote:
> >From here the next steps are to start refactoring and adding to the tests,
> and to move the discussion of the new API into the territory of concrete
> proposals.  I'll be posting about both of those as the week goes along.

I have been following the discussions actively. I am willing to get
involved and shall pitch in with bug fixes/tests and other things.

> My thought is that all the work should be done using a DVCS, to give
> maximum opportunity for others to contribute as their time allows.
> I welcome thoughts about how best to set this up to provide maximum
> access for this community of interest.
> 
> I'm also interested to know who will be at PyCon and interested in BOF
> and/or sprint activities involving the email package.

Yes, I would be there. I have identified urllib related
issues/enhancements, but I see email module work is something I am
interested too. Looked at the list and I see a couple of interesting
ones. So, here is a +1. 

Thank you,
Senthil

-- 
we just switched to Sprint.

From rdmurray at bitdance.com  Thu Jan 21 20:45:25 2010
From: rdmurray at bitdance.com (R. David Murray)
Date: Thu, 21 Jan 2010 14:45:25 -0500
Subject: [Email-SIG] Kick starting email 6.0 development
In-Reply-To: <20100118172420.029cdbf7@freewill>
References: <20100118204451.B5E0B1FBB0E@kimball.webabinitio.net>
	<20100118172420.029cdbf7@freewill>
Message-ID: <20100121194525.74C6C18F90C@kimball.webabinitio.net>

On Mon, 18 Jan 2010 17:24:20 -0500, Barry Warsaw <barry at python.org> wrote:
> On Jan 18, 2010, at 03:44 PM, R. David Murray wrote:
> Since Python itself has no DVCS still, I might propose using Bazaar and
> Launchpad to track the work.  I already have three branches that may or may
> not have anything useful in them (they represent my previous attempts at
> this):
> 
> * lp:~barry/+junk/email-ng
> * lp:~barry/python/30email
> * lp:~barry/python/email6

I've looked briefly at your email6 branch, and will take a look at
the others.  But unless you think there are specifically relevant
bits to look at, I probably won't look too closely until we have
the new API design roughed out.

Did you start doing any test refactoring in any of the branches?

> I'm in the process of re-establishing code imports of the various
> Python branches on Launchpad.  They had been using the bzr mirrors on
> code.python.org, but those haven't been updated in a very long time.
> I'm going to blow those away and re-import from the Subversion
> branches.  After that's working it should allow you to bzr branch any
> active Python branch and hack on things from there.  That might make
> the most sense since some of the bugs you've identified affect other
> than just the email package.

I think since you are the email czar and you are deeply involved with
bzr and launchpad, that this makes sense :)

I'm not sure how best to integrate this with Launchpad to make it public,
though.  I created a project (python-email6), pulled py3k according to
your instructions on python-dev, pushed it to launchpad, and linked it
to to the project as trunk.  Was that the right thing to do?  Or should
I request membership in the Python team and create an 'email6' branch
there instead?

> >I'm also interested to know who will be at PyCon and interested in BOF
> >and/or sprint activities involving the email package.
> 
> I'll be at PyCon though I don't yet know exactly what I'll be sprinting on.
> email package is a possibility.  The sprint sign up page is here:
> 
> http://us.pycon.org/2010/sprints/signup/

The sprint page doesn't list the core sprint yet.  Any idea who is
organizing it this year? 

--David

From barry at python.org  Fri Jan 22 15:30:15 2010
From: barry at python.org (Barry Warsaw)
Date: Fri, 22 Jan 2010 09:30:15 -0500
Subject: [Email-SIG] Kick starting email 6.0 development
In-Reply-To: <20100121194525.74C6C18F90C@kimball.webabinitio.net>
References: <20100118204451.B5E0B1FBB0E@kimball.webabinitio.net>
	<20100118172420.029cdbf7@freewill>
	<20100121194525.74C6C18F90C@kimball.webabinitio.net>
Message-ID: <20100122093015.1c1bcc3f@freewill>

On Jan 21, 2010, at 02:45 PM, R. David Murray wrote:

>I've looked briefly at your email6 branch, and will take a look at
>the others.  But unless you think there are specifically relevant
>bits to look at, I probably won't look too closely until we have
>the new API design roughed out.
>
>Did you start doing any test refactoring in any of the branches?

I did, but honestly it's been so long since I hacked on these branches, I
don't remember what's in them. :/

email6 has I think the latest changes aligned to my thinking about API
improvements and probably also includes some test refactoring

email-ng is just the email package and contains some header module refactoring
and a start on doctests for headers.  The nice thing about using more doctests
is that it can serve as the basis for improved documentation.

30email has some additional refactoring (that might be similar to what's in
email6) along with the results of work done at last year's pycon.

I'm sorry that it's so disorganized, but you seem to be pretty good at
untangling nasty knots. :)  Probably best to start with email6 and then just
review the top few revisions in the other branches to see if there's anything
useful in them.

>> I'm in the process of re-establishing code imports of the various
>> Python branches on Launchpad.  They had been using the bzr mirrors on
>> code.python.org, but those haven't been updated in a very long time.
>> I'm going to blow those away and re-import from the Subversion
>> branches.  After that's working it should allow you to bzr branch any
>> active Python branch and hack on things from there.  That might make
>> the most sense since some of the bugs you've identified affect other
>> than just the email package.
>
>I think since you are the email czar and you are deeply involved with
>bzr and launchpad, that this makes sense :)

Yay! :)

>I'm not sure how best to integrate this with Launchpad to make it public,
>though.  I created a project (python-email6), pulled py3k according to
>your instructions on python-dev, pushed it to launchpad, and linked it
>to to the project as trunk.  Was that the right thing to do?  Or should
>I request membership in the Python team and create an 'email6' branch
>there instead?

What you did isn't too bad actually.  Whether it was the right thing to do
depends on how we want to manage commit access to the email branch.  I see no
problem adding you to the ~python-dev team on Launchpad and creating the
branch in the python project.  That just means everyone in ~python-dev could
commit to the branch (by default).  I'd have no problem with that.

Alternatively, we'd need to create a team for the python-email6 project so
that folks other than just you have commit access to the branch.  It's only a
little more work that way, just because it's more things to set up, but it's
not that big of a deal.  So the question is: how locked down do you want to
make this branch, and are there folks who would like to commit to this branch
that shouldn't be added to ~python-dev?

Either way, we have to remember to occasionally merge the py3k branch back
into our branch so as to keep up on changes there.

>> >I'm also interested to know who will be at PyCon and interested in BOF
>> >and/or sprint activities involving the email package.
>> 
>> I'll be at PyCon though I don't yet know exactly what I'll be sprinting on.
>> email package is a possibility.  The sprint sign up page is here:
>> 
>> http://us.pycon.org/2010/sprints/signup/
>
>The sprint page doesn't list the core sprint yet.  Any idea who is
>organizing it this year? 

It might be me, but I'm not sure :).  I think Brett is not going to attend the
sprints so I'm slated to give the intro-to-sprinting talk.  I'm not sure if
that also means I've "volunteered" to organize the core sprint or not.  I'll
try to figure that out.

-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 835 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/email-sig/attachments/20100122/dd0c35e6/attachment.pgp>

From rdmurray at bitdance.com  Fri Jan 22 17:36:59 2010
From: rdmurray at bitdance.com (R. David Murray)
Date: Fri, 22 Jan 2010 11:36:59 -0500
Subject: [Email-SIG] Kick starting email 6.0 development
In-Reply-To: <20100122093015.1c1bcc3f@freewill>
References: <20100118204451.B5E0B1FBB0E@kimball.webabinitio.net>
	<20100118172420.029cdbf7@freewill>
	<20100121194525.74C6C18F90C@kimball.webabinitio.net>
	<20100122093015.1c1bcc3f@freewill>
Message-ID: <20100122163659.8DB641BC378@kimball.webabinitio.net>

On Fri, 22 Jan 2010 09:30:15 -0500, Barry Warsaw <barry at python.org> wrote:
> On Jan 21, 2010, at 02:45 PM, R. David Murray wrote:
> email-ng is just the email package and contains some header module refactoring
> and a start on doctests for headers.  The nice thing about using more doctests
> is that it can serve as the basis for improved documentation.

I will take a look at that, since I'm actively working on Header
right now.  As for doctests...I agree that they are good for helping to
document the API, but IMO we are going to need more than that to get a
really good set of validation tests.  I have some thoughts about that
that I'm experimenting with and will report back when I'm satisfied that
my idea is at least workable.  (Whether it's a *good* idea remains to
be seen :)

> I'm sorry that it's so disorganized, but you seem to be pretty good at
> untangling nasty knots. :)  Probably best to start with email6 and then just
> review the top few revisions in the other branches to see if there's anything
> useful in them.

No problem; at least you got started on it :)

> What you did isn't too bad actually.  Whether it was the right thing to do
> depends on how we want to manage commit access to the email branch.  I see no
> problem adding you to the ~python-dev team on Launchpad and creating the
> branch in the python project.  That just means everyone in ~python-dev could
> commit to the branch (by default).  I'd have no problem with that.
> 
> Alternatively, we'd need to create a team for the python-email6 project so
> that folks other than just you have commit access to the branch.  It's only a
> little more work that way, just because it's more things to set up, but it's
> not that big of a deal.  So the question is: how locked down do you want to
> make this branch, and are there folks who would like to commit to this branch
> that shouldn't be added to ~python-dev?

My goal in using the DVCS is to make it easy for anyone to submit patches
for review, which I believe launchpad facilitates.  ("Propose for merge",
right?)   I don't want the branch locked too tightly, I'd rather
facilitate active contribution.  So possibly making an email6 team is
better, but since I don't know what the consequences of adding someone to
~python-dev are, I don't know what would make it a bad thing for someone
to be added to it :). 

> Either way, we have to remember to occasionally merge the py3k branch back
> into our branch so as to keep up on changes there.

Yes.  That will probably be my job.

> >The sprint page doesn't list the core sprint yet.  Any idea who is
> >organizing it this year?
> 
> It might be me, but I'm not sure :).  I think Brett is not going to attend the
> sprints so I'm slated to give the intro-to-sprinting talk.  I'm not sure if
> that also means I've "volunteered" to organize the core sprint or not.  I'll
> try to figure that out.

Well if you are we could try to hijack the whole core sprint to work on
email :)

Seriously, though, if I can be of assistance, let me know.

--David

From barry at python.org  Fri Jan 22 17:45:59 2010
From: barry at python.org (Barry Warsaw)
Date: Fri, 22 Jan 2010 11:45:59 -0500
Subject: [Email-SIG] Kick starting email 6.0 development
In-Reply-To: <20100122163659.8DB641BC378@kimball.webabinitio.net>
References: <20100118204451.B5E0B1FBB0E@kimball.webabinitio.net>
	<20100118172420.029cdbf7@freewill>
	<20100121194525.74C6C18F90C@kimball.webabinitio.net>
	<20100122093015.1c1bcc3f@freewill>
	<20100122163659.8DB641BC378@kimball.webabinitio.net>
Message-ID: <20100122114559.4bb73054@freewill>

On Jan 22, 2010, at 11:36 AM, R. David Murray wrote:

>I will take a look at that, since I'm actively working on Header
>right now.  As for doctests...I agree that they are good for helping to
>document the API, but IMO we are going to need more than that to get a
>really good set of validation tests.  I have some thoughts about that
>that I'm experimenting with and will report back when I'm satisfied that
>my idea is at least workable.  (Whether it's a *good* idea remains to
>be seen :)

I definitely agree we can't use doctests exclusively.  Nobody in their right
mind would ever want to read those things!  A good mix of (separate file)
doctest and unittests would probably work, but I'm eager to hear how your
experiment turns out. :)

>My goal in using the DVCS is to make it easy for anyone to submit patches
>for review, which I believe launchpad facilitates.  ("Propose for merge",
>right?)

Yep.  It's a great way to go.  I also suggest that as things stabilize we move
to a model where branches proposed for merging are always linked to a bug.
But that might not always be feasible while there's lots of churn.

>I don't want the branch locked too tightly, I'd rather facilitate active
>contribution.  So possibly making an email6 team is better, but since I don't
>know what the consequences of adding someone to ~python-dev are, I don't know
>what would make it a bad thing for someone to be added to it :).

I'm thinking it does make sense to make an email6 team and keep this branch in
a separate package.  I've just created the team:

https://edge.launchpad.net/~email6

and made you a co-admin.  I think it's up to you to make the team the owner of
the python-email project.

>Well if you are we could try to hijack the whole core sprint to work on
>email :)

That's like the logical extension of Zawinksi's law. :)

>Seriously, though, if I can be of assistance, let me know.

Thanks!
-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 835 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/email-sig/attachments/20100122/1f3ac0d8/attachment.pgp>

From rdmurray at bitdance.com  Fri Jan 22 19:04:33 2010
From: rdmurray at bitdance.com (R. David Murray)
Date: Fri, 22 Jan 2010 13:04:33 -0500
Subject: [Email-SIG] Kick starting email 6.0 development
In-Reply-To: <20100122114559.4bb73054@freewill>
References: <20100118204451.B5E0B1FBB0E@kimball.webabinitio.net>
	<20100118172420.029cdbf7@freewill>
	<20100121194525.74C6C18F90C@kimball.webabinitio.net>
	<20100122093015.1c1bcc3f@freewill>
	<20100122163659.8DB641BC378@kimball.webabinitio.net>
	<20100122114559.4bb73054@freewill>
Message-ID: <20100122180433.E7B351BC3AE@kimball.webabinitio.net>

On Fri, 22 Jan 2010 11:45:59 -0500, Barry Warsaw <barry at python.org> wrote:
> On Jan 22, 2010, at 11:36 AM, R. David Murray wrote:
> I'm thinking it does make sense to make an email6 team and keep this branch  in
> a separate package.  I've just created the team:
> 
> https://edge.launchpad.net/~email6
> 
> and made you a co-admin.  I think it's up to you to make the team the owner of
> the python-email project.

OK, done.  (It took me a while to figure out how...I assumed that 'change
details' would include everything about the project that I could change,
but it turns out that changing the maintainer is a separate page.)

It's not obvious from the interface whether or not this gives the
team permission to update the branch, and I don't see a way to change
the ownership of the branch to the team.

--David

From barry at python.org  Fri Jan 22 19:10:12 2010
From: barry at python.org (Barry Warsaw)
Date: Fri, 22 Jan 2010 13:10:12 -0500
Subject: [Email-SIG] Kick starting email 6.0 development
In-Reply-To: <20100122180433.E7B351BC3AE@kimball.webabinitio.net>
References: <20100118204451.B5E0B1FBB0E@kimball.webabinitio.net>
	<20100118172420.029cdbf7@freewill>
	<20100121194525.74C6C18F90C@kimball.webabinitio.net>
	<20100122093015.1c1bcc3f@freewill>
	<20100122163659.8DB641BC378@kimball.webabinitio.net>
	<20100122114559.4bb73054@freewill>
	<20100122180433.E7B351BC3AE@kimball.webabinitio.net>
Message-ID: <20100122131012.4ba93eaf@freewill>

On Jan 22, 2010, at 01:04 PM, R. David Murray wrote:

>OK, done.  (It took me a while to figure out how...I assumed that 'change
>details' would include everything about the project that I could change,
>but it turns out that changing the maintainer is a separate page.)

Yeah, sometimes things aren't where you expect them to be, like...

>It's not obvious from the interface whether or not this gives the
>team permission to update the branch, and I don't see a way to change
>the ownership of the branch to the team.

Click on the branch, then click on "Change details".  Set the owner to the
email6 team and then OK.  That should be enough to allow anyone in the team to
push changes to it.

-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 835 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/email-sig/attachments/20100122/113adcc8/attachment-0001.pgp>

From rdmurray at bitdance.com  Fri Jan 22 19:14:11 2010
From: rdmurray at bitdance.com (R. David Murray)
Date: Fri, 22 Jan 2010 13:14:11 -0500
Subject: [Email-SIG] Kick starting email 6.0 development
In-Reply-To: <20100122131012.4ba93eaf@freewill>
References: <20100118204451.B5E0B1FBB0E@kimball.webabinitio.net>
	<20100118172420.029cdbf7@freewill>
	<20100121194525.74C6C18F90C@kimball.webabinitio.net>
	<20100122093015.1c1bcc3f@freewill>
	<20100122163659.8DB641BC378@kimball.webabinitio.net>
	<20100122114559.4bb73054@freewill>
	<20100122180433.E7B351BC3AE@kimball.webabinitio.net>
	<20100122131012.4ba93eaf@freewill>
Message-ID: <20100122181411.CB2131BC3E7@kimball.webabinitio.net>

On Fri, 22 Jan 2010 13:10:12 -0500, Barry Warsaw <barry at python.org> wrote:
> On Jan 22, 2010, at 01:04 PM, R. David Murray wrote:
> Yeah, sometimes things aren't where you expect them to be, like...
> 
> >It's not obvious from the interface whether or not this gives the
> >team permission to update the branch, and I don't see a way to change
> >the ownership of the branch to the team.
> 
> Click on the branch, then click on "Change details".  Set the owner to the
> email6 team and then OK.  That should be enough to allow anyone in the team
> to push changes to it.

Done.

I didn't even see the change details link because it was down below the
recent commits list.

--David

From barry at python.org  Fri Jan 22 19:35:29 2010
From: barry at python.org (Barry Warsaw)
Date: Fri, 22 Jan 2010 13:35:29 -0500
Subject: [Email-SIG] Kick starting email 6.0 development
In-Reply-To: <20100122181411.CB2131BC3E7@kimball.webabinitio.net>
References: <20100118204451.B5E0B1FBB0E@kimball.webabinitio.net>
	<20100118172420.029cdbf7@freewill>
	<20100121194525.74C6C18F90C@kimball.webabinitio.net>
	<20100122093015.1c1bcc3f@freewill>
	<20100122163659.8DB641BC378@kimball.webabinitio.net>
	<20100122114559.4bb73054@freewill>
	<20100122180433.E7B351BC3AE@kimball.webabinitio.net>
	<20100122131012.4ba93eaf@freewill>
	<20100122181411.CB2131BC3E7@kimball.webabinitio.net>
Message-ID: <20100122133529.0854a8fa@freewill>

On Jan 22, 2010, at 01:14 PM, R. David Murray wrote:

>I didn't even see the change details link because it was down below the
>recent commits list.

Looks good to me.  That was the easy part! :)

-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 835 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/email-sig/attachments/20100122/868cdff1/attachment.pgp>

From rdmurray at bitdance.com  Mon Jan 25 21:10:34 2010
From: rdmurray at bitdance.com (R. David Murray)
Date: Mon, 25 Jan 2010 15:10:34 -0500
Subject: [Email-SIG] Thoughts on the general API, and the Header API.
Message-ID: <20100125201034.190CC1BC4B4@kimball.webabinitio.net>

OK, so we've agreed that we need to handle bytes and text at pretty
much all API levels, and that the "original data" that informs the data
structure can be either bytes or text.  We want to be able to recover
that original data, especially in the bytes case, but arguably also in
the text case.

Then there's also the issue of transforming a message once we have it in
a data structure, and the consequent issue of what it means to serialize
the resulting modified message.  (This last comes up in a very specific
way in issues 968430 and 1670765, which are about preserving the *exact*
byte representation of a multipart/signed message).

We've also agreed that whatever we decide to do with the __str__ and
__bytes__ magic methods, they will be implemented in terms of other
parts of the API.  So I'll ignore those for now.

I think we want to decide on a general API structure that is implemented
at all levels and objects where it makes sense, this being the API
for creating and accessing the following information about any part of
the model:

    * create object from bytes
    * create object from text
    * obtain the defect list resulting from creating the object
    * serialize object to bytes
    * serialize object to text
    * obtain original input data
    * discover the type of the original input data

At the moment I see no reason to change the API for defects (a .defects
attribute on the object holding a list of defects), so I'm going to
ignore that for now as well.

I spent a bunch of time trying to define an API for Headers that provided
methods for all of the above.  As I was writing the descriptions for
the various methods, and especially trying to specify the "correct"
behavior for both the raw-data-is-bytes and raw-data-is-text cases
(especially for the methods that serialize the data), the whole thing
began to give off a bad code smell.

After setting it aside for a bit, I had what I think is a little epiphany:
our need is to deal with messages (and parts of messages) that could be
in either bytes form or text form.  The things we need to do with them
are similar regardless of their form, and so we have been talking about a
"dual API": one method for bytes and a parallel method for text.

What if we recognize that we have two different data types, bytes messages
and text messages?  Then the "dual API" becomes a more uniform, almost
single, API, but with two possible underlying data types.

In the context specifically of the proposed new Header object, I propose
that we have a StringHeader and a BytesHeader, and an API that looks
something like this:

StringHeader

    properties:
        raw_header (None unless from_full_header was used)
        raw_name
        raw_value
        name
        value

    __init__(name, value)
    from_full_header(header)
    serialize(max_line_len=78,
              newline='\n',
              use_raw_data_if_possible=False)
    encode(charset='utf-8')

BytesHeader would be exactly the same, with the exception of the signature
for serialize and the fact that it has a 'decode' method rather than an
'encode' method.  Serialize would be different only in the fact that
it would have an additional keyword parameter, must_be_7bit=True.

The magic of this approach is in those encode/decode methods.

Encoding a StringHeader would yield a BytesHeader containing the same
data, but encoded per RFC2047 using the specified charset.  Decoding a
BytesHeader would yield a StringHeader with the same data, but decoded to
unicode per RFC2047, with any 8bit parts decoded (in the unicode sense,
not the RFC2047 sense) using the specified charset (which would default to
ASCII, meaning bare 8bit bytes in headers would throw an error).  (What to
with RFC2047 charsets like unknown-8bit is an open question...probably
throw an error).

(Encoding or decoding a Message would cause the Message to recursively
encode or decode its subparts.  This means you are making a complete
new copy of the Message in memory.  If you don't want to do that you
can walk the Message and convert it piece by piece (we could provide a
generator that does this).)

raw_header would be the data passed in to the constructor if
from_full_header is used, and None otherwise.  If encode/decode call
the regular constructor, then this attribute would also act as a flag
as to whether or not the header was constructed from raw input data
or via program.

raw_name and raw_value would be the fieldname and fieldbody, either
what was passed in to the __init__ constructor, or the result of
splitting what was passed to the from_full_header constructor on
the first ':'.  (These are convenience attributes and are not
essential to the proposed API).

name would be the fieldname stripped of trailing whitespace.

value would be the *unfolded* fieldbody stripped of leading and
trailing whitespace (but with internal whitespace intact).

As for serialize, my thought here is that every object in the tree
has a serialize method with the same signature, and serialization
is a matter of recursively passing the specified parameters downward.

max_line_len is obvious, and defaults to the RFC recommended max.  (If you
want the unfolded header, use its .value attribute).  newline resolves
issue 1349106, allowing an email package client to generate completely
wire-format messages if it needs to.  use_raw_data_if_possible would
mean to emit the original raw data if it exists (modulo changing
the flavor of newline if needed, for those object types (such as
headers) where that makes sense).  The serialize method of specific
sub-types can do specialized things (eg: multipart/signed can make
use_raw_data_if_possible default to True).

For Bytes types, the extra 'must_be_7bit' flag would cause any 8bit
data to be transport encoded to be 7bit clean.  (For headers, this would
mean raw 8bit data would get the charset 'unknown-8bit', and we might
want to provide more control over that in some way: an error and way to
provide an error handler, or some other way to specify a charset to use
for such encodings.)  use_raw_data_if_possible would cause this flag to
be ignored when raw data was available for the object.

(If you want the text version of the transport-encoded message for some
reason, you can serialize the Bytes form using must_be_7bit and decode
the result as ASCII.)

Subclasses of these classes for structured headers would have additional
methods that would return either specialized object types (datetimes,
address objects) or bytes/strings, and these may or may not exist in
both Bytes and String forms (that depends on the use cases, I think).

I also think that the Bytes and Strings versions of objects that have
them can share large portions of their implementation through a base
class.  I think that makes this approach both easier to code than a
single-type-dual-API approach, and more robust in the face of changes.

So, those are my thoughts, and I'm sure I haven't thought of all the
corner cases.  The biggest question is, does it seem like this general
scheme is worth pursuing? 

--David

From v+python at g.nevcal.com  Tue Jan 26 01:55:15 2010
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Mon, 25 Jan 2010 16:55:15 -0800
Subject: [Email-SIG] Thoughts on the general API, and the Header API.
In-Reply-To: <20100125201034.190CC1BC4B4@kimball.webabinitio.net>
References: <20100125201034.190CC1BC4B4@kimball.webabinitio.net>
Message-ID: <4B5E3D73.1070900@g.nevcal.com>

On approximately 1/25/2010 12:10 PM, came the following characters from 
the keyboard of R. David Murray:
> So, those are my thoughts, and I'm sure I haven't thought of all the
> corner cases.  The biggest question is, does it seem like this general
> scheme is worth pursuing?

Moving your last question to the front, yes.  And of course, we do need 
to think through most of the corner cases before absolutely committing 
to this approach.  But it sounds viable, and avoids an awful lot of 
duplicate APIs, and would allow simple email clients to be written 
primarily or even fully in bytes or primarily or even fully in strings.

A simple email client that is written fully in strings would "simply" 
reject/bounce messages that cannot be decoded to strings.  This is 
simple; it works for 100% properly encoded messages; in an environment 
where a client is coded to process messages from some generator, once 
they are both debugged to the extent of generating messages that can be 
consumed, then all is well, and no messages would be rejected.  This 
would not be an appropriate model for a general email server; while I'd 
like to see a popular mailing list submission client that would bounce 
messages that are improperly formed -- forcing contributors to use RFC 
conformant clients, and thus encouraging the of those clients that are 
not RFC conformant, but I'm not going to hold my breath.

I think there can be enough power in an API designed in this manner to 
allow the full nitty-gritty access as required.

I have some questions and concerns; I haven't thought through all of 
them; perhaps some of them are corner cases, if so, they are corner 
cases that are particularly interesting to me.

> OK, so we've agreed that we need to handle bytes and text at pretty
> much all API levels, and that the "original data" that informs the data
> structure can be either bytes or text.  We want to be able to recover
> that original data, especially in the bytes case, but arguably also in
> the text case.
>
> Then there's also the issue of transforming a message once we have it in
> a data structure, and the consequent issue of what it means to serialize
> the resulting modified message.  (This last comes up in a very specific
> way in issues 968430 and 1670765, which are about preserving the *exact*
> byte representation of a multipart/signed message).
>
> We've also agreed that whatever we decide to do with the __str__ and
> __bytes__ magic methods, they will be implemented in terms of other
> parts of the API.  So I'll ignore those for now.
>
> I think we want to decide on a general API structure that is implemented
> at all levels and objects where it makes sense, this being the API
> for creating and accessing the following information about any part of
> the model:
>
>      * create object from bytes
>      * create object from text
>      * obtain the defect list resulting from creating the object
>      * serialize object to bytes
>      * serialize object to text
>      * obtain original input data
>      * discover the type of the original input data
>
> At the moment I see no reason to change the API for defects (a .defects
> attribute on the object holding a list of defects), so I'm going to
> ignore that for now as well.
>
> I spent a bunch of time trying to define an API for Headers that provided
> methods for all of the above.  As I was writing the descriptions for
> the various methods, and especially trying to specify the "correct"
> behavior for both the raw-data-is-bytes and raw-data-is-text cases
> (especially for the methods that serialize the data), the whole thing
> began to give off a bad code smell.
>
> After setting it aside for a bit, I had what I think is a little epiphany:
> our need is to deal with messages (and parts of messages) that could be
> in either bytes form or text form.  The things we need to do with them
> are similar regardless of their form, and so we have been talking about a
> "dual API": one method for bytes and a parallel method for text.
>
> What if we recognize that we have two different data types, bytes messages
> and text messages?  Then the "dual API" becomes a more uniform, almost
> single, API, but with two possible underlying data types.
>
> In the context specifically of the proposed new Header object, I propose
> that we have a StringHeader and a BytesHeader, and an API that looks
> something like this:
>
> StringHeader
>
>      properties:
>          raw_header (None unless from_full_header was used)
>          raw_name
>          raw_value
>          name
>          value
>
>      __init__(name, value)
>      from_full_header(header)
>      serialize(max_line_len=78,
>                newline='\n',
>                use_raw_data_if_possible=False)
>      encode(charset='utf-8')
>    

If it was stated, I missed it: is  from_full_header  a way of producing 
an object from a raw data value?  Whereas __init__ would obviously be 
used to produce one from string or bytes values.  If so, then it would 
be a requirement that this from_full_header API would never produce an 
exception?  Rather it would produce an object with or without defects?

Are there any other *Header APIs that would be required not to produce 
exceptions?  I don't yet perceive any.

The "charset" parameter... is that not mostly needed for data parts?
Headers are either ASCII, or contain self-describing charset info.
I guess I could see an intermediate decode from string to some charset, 
before serialization, as a hint that when generating headers, that all 
the characters in the header that are not ASCII are in the specified 
charset... and that that charset is the one to be used in the 
self-describing serialized ASCII stream?  The full generality of the 
RFCs, however,
allows pieces of headers to be encoded using different charsets... with 
this API, it would seem that that could only be created containing one 
charset... the serialization primitives were made available, so that 
piecewise construction of a header value could be done with different 
charsets, and then the from_full_header API used to create the complex 
value.  I don't see this as a severe limitation, I just want to 
understand your intention, and document the limitation, or my 
misunderstanding.


> BytesHeader would be exactly the same, with the exception of the signature
> for serialize and the fact that it has a 'decode' method rather than an
> 'encode' method.  Serialize would be different only in the fact that
> it would have an additional keyword parameter, must_be_7bit=True.
>    

I am not clear on why StringHeader's serialize would not need the  
must_be_7bit  parameter... or do I misunderstand that 
StringHeader.serialize produces wire-format data?

> The magic of this approach is in those encode/decode methods.
>
> Encoding a StringHeader would yield a BytesHeader containing the same
> data, but encoded per RFC2047 using the specified charset.  Decoding a
> BytesHeader would yield a StringHeader with the same data, but decoded to
> unicode per RFC2047, with any 8bit parts decoded (in the unicode sense,
> not the RFC2047 sense) using the specified charset (which would default to
> ASCII, meaning bare 8bit bytes in headers would throw an error).  (What to
> with RFC2047 charsets like unknown-8bit is an open question...probably
> throw an error).
>    

Would the encoding to/from StringHeader/BytesHeader preserve the  
from_full_header  state and value?

> (Encoding or decoding a Message would cause the Message to recursively
> encode or decode its subparts.  This means you are making a complete
> new copy of the Message in memory.  If you don't want to do that you
> can walk the Message and convert it piece by piece (we could provide a
> generator that does this).)
>    

Walking it piece by piece would allow the old pieces to be discarded, to 
save total memory consumption, where that is appropriate.

Perhaps one generator that would be commonly used, would be to convert 
headers only, and leave MIME data parts alone, accessing and converting 
them only with the registered methods?  This would mean that a "complete 
copy" wouldn't generally be very big, if the data parts were excluded 
from implicit conversion.  Perhaps the "external storage protocol" might 
also only be defined for MIME data parts, and walking the tree with this 
generator would not need to reference the MIME data parts, nor bring 
them in from "external storage".

> raw_header would be the data passed in to the constructor if
> from_full_header is used, and None otherwise.  If encode/decode call
> the regular constructor, then this attribute would also act as a flag
> as to whether or not the header was constructed from raw input data
> or via program.
>    

This _implies_ that  from_full_header always accepts raw data bytes... 
even for the StringHeader.  And that implies the need for an implicit 
decode, and therefore, perhaps a charset parameter?  No, not a charset 
parameter, since they are explicitly contained in the header values.

Decode for header values may not need a charset value at all!


No comments for the rest.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From rdmurray at bitdance.com  Tue Jan 26 03:51:46 2010
From: rdmurray at bitdance.com (R. David Murray)
Date: Mon, 25 Jan 2010 21:51:46 -0500
Subject: [Email-SIG] Thoughts on the general API, and the Header API.
In-Reply-To: <4B5E3D73.1070900@g.nevcal.com>
References: <20100125201034.190CC1BC4B4@kimball.webabinitio.net>
	<4B5E3D73.1070900@g.nevcal.com>
Message-ID: <20100126025146.CF0AC1BC2FF@kimball.webabinitio.net>

On Mon, 25 Jan 2010 16:55:15 -0800, Glenn Linderman <v+python at g.nevcal.com> wrote:
> On approximately 1/25/2010 12:10 PM, came the following characters from 
> the keyboard of R. David Murray:
> > So, those are my thoughts, and I'm sure I haven't thought of all the
> > corner cases.  The biggest question is, does it seem like this general
> > scheme is worth pursuing?
> 
> If it was stated, I missed it: is  from_full_header  a way of producing 
> an object from a raw data value?  Whereas __init__ would obviously be 

Yes.

> used to produce one from string or bytes values.  If so, then it would 

Well, StringHeader.from_full_header would take a string as input,
while BytesHeader.from_full_headerwould take bytes as input.
__init__ would be used to construct a header in your program:

    StringHeader('MyHeader', 'my value')
    BytesHeader(b'MyHeader', b'my value').

> be a requirement that this from_full_header API would never produce an 
> exception?  Rather it would produce an object with or without defects?

Yes.

> Are there any other *Header APIs that would be required not to produce 
> exceptions?  I don't yet perceive any.

I don't think so.  from_full_header is the only one involved in parsing
raw data.  Whether __init__ throws errors or records defects is an open
question, but I lean toward it throwing errors.  The reason there is an
open question is because an email manipulating application may want to
convert to text to process an incoming message, and there are things
that a BytesHeader can hold that would cause errors when encoded to a
StringHeader (specifically, 8 bit bytes that aren't transfer encoded).
So it may be that decode, at least, should not throw errors but instead
record additional defects in the resulting StringHeader.  I think that
even in that case __init__ should still throw errors, though; decode
could deal with the defects before calling StringHeader.__init__, or
(more likely) catch the errors throw by __init__, fix/record the defects,
and call it again.

Note, by the way, that by 'raw data' I mean what you are feeding in.
Raw data fed to a BytesHeader would be bytes, but raw data fed to
a StringHeader would be text (eg: if read from a file in text mode).

> The "charset" parameter... is that not mostly needed for data parts?

No, if you start with a unicode string in a StringHeader, you need to
know what charset to encode the unicode to and therefore to specify as
the charset in the RFC 2047 encoded words.

> Headers are either ASCII, or contain self-describing charset info.

That's true for BytesHeaders, but not for StringHeaders.  So as I
said above charset for StringHeader says which charset to put into
the encoded words when converting to BytesHeader form.

I specified a charset parameter for 'decode' only to handle the case
of raw bytes data that contains 8 bit data that is not in encoded words
(ie: is not RFC compliant).  I am visualizing this as satisfying a use
case where you have non-email (non RFC compliant) data where you allow
8 bit data in the header bodies because it's in internal ap and you
know the encoding.  You can then use decode(charset) to decode those
BytesHeaders into StringHeaders.

> I guess I could see an intermediate decode from string to some charset, 
> before serialization, as a hint that when generating headers, that all 
> the characters in the header that are not ASCII are in the specified 
> charset... and that that charset is the one to be used in the 
> self-describing serialized ASCII stream?  The full generality of the 

Exactly.

> RFCs, however,
> allows pieces of headers to be encoded using different charsets... with 
> this API, it would seem that that could only be created containing one 
> charset... the serialization primitives were made available, so that 
> piecewise construction of a header value could be done with different 
> charsets, and then the from_full_header API used to create the complex 
> value.  I don't see this as a severe limitation, I just want to 
> understand your intention, and document the limitation, or my 
> misunderstanding.

Right.  I'm visualizing the "normal case" being encoding a StringHeader
using the default utf-8 charset or another specified charset, turning
the words containing non-ASCII characters into encoded words using that
charset.  The utility methods that turn unicode into encoded words would
be exposed, and an application that needs to create a header with mixed
charsets can use those utilities to build RFC compliant bytes data and
pass that to one of the BytesHeader constructors.  (Make the common case
easy, and the complicated cases possible.)

> > BytesHeader would be exactly the same, with the exception of the signature
> > for serialize and the fact that it has a 'decode' method rather than an
> > 'encode' method.  Serialize would be different only in the fact that
> > it would have an additional keyword parameter, must_be_7bit=True.
> 
> I am not clear on why StringHeader's serialize would not need the  
> must_be_7bit  parameter... or do I misunderstand that 
> StringHeader.serialize produces wire-format data?

The latter.  StringHeader serialize does not produce wire-format data,
it produces text (for example, for display to the user).  If you want
wire format, you encode the StringHeader and use the resulting BytesHeader
serialize.

> > The magic of this approach is in those encode/decode methods.
> >
> > Encoding a StringHeader would yield a BytesHeader containing the same
> > data, but encoded per RFC2047 using the specified charset.  Decoding a
> > BytesHeader would yield a StringHeader with the same data, but decoded to
> > unicode per RFC2047, with any 8bit parts decoded (in the unicode sense,
> > not the RFC2047 sense) using the specified charset (which would default to
> > ASCII, meaning bare 8bit bytes in headers would throw an error).  (What to
> > with RFC2047 charsets like unknown-8bit is an open question...probably
> > throw an error).
> >    
> 
> Would the encoding to/from StringHeader/BytesHeader preserve the  
> from_full_header  state and value?

My thought is no.  Once you encode/decode the header, your program has
transformed it, and I think it is better to treat the original raw data
as gone.  The motivation for this is that the 'raw data' of a StringHeader
is the *text* string used to create it.  Keeping a bytes string 'raw data'
around as well would get us back into the mess that I developed this
approach to avoid, where we'd need to specify carefully the difference
between handing a header whose 'original' raw data was bytes vs string,
for each of the BytesHeader and StringHeader cases.  Better, I think,
to put the (small) burden on the application programmer: if you want to
preserve the original input data, do so by keeping the original object
around.  Once you mutate the object model, the original raw data for
the mutated piece is gone.

There are some use-case questions here, though, with regards to
preservation of as much original information/format as possible, and how
valuable that is.  I think we'll have to figure that out by examining
concrete use cases in detail.  (It is not something that the current email
package supports very well, by the way...headers currently get modified
significantly in the parse/generate cycle, even without bytes-to-string
transformations happening.)

> > (Encoding or decoding a Message would cause the Message to recursively
> > encode or decode its subparts.  This means you are making a complete
> > new copy of the Message in memory.  If you don't want to do that you
> > can walk the Message and convert it piece by piece (we could provide a
> > generator that does this).)
> 
> Walking it piece by piece would allow the old pieces to be discarded, to 
> save total memory consumption, where that is appropriate.
> 
> Perhaps one generator that would be commonly used, would be to convert 
> headers only, and leave MIME data parts alone, accessing and converting 
> them only with the registered methods?  This would mean that a "complete 
> copy" wouldn't generally be very big, if the data parts were excluded 
> from implicit conversion.  Perhaps the "external storage protocol" might 
> also only be defined for MIME data parts, and walking the tree with this 
> generator would not need to reference the MIME data parts, nor bring 
> them in from "external storage".

That's true.  The Bytes and String versions of binary MIME parts,
which are likely to be the large ones, will probably have a common
representation for the payload, and could potentially point to the same
object.  That breaking of of the expectation that 'encode' and 'decode'
return new objects (in analogy to how encode and decode of strings/bytes
works) might not be a good thing, though.

In any case, text MIME parts have the same bytes vs string issues as
headers do, and should, IMO, be converted from one to the other on
encode/decode.

Another possible approach would be some sort of 'encode/decode on demand'
system, although that would need to retain a pointer to the original
object, which might get us into suboptimal reference cycle difficulties.

These bits are implementation details, though, and don't affect the API
design.

> > raw_header would be the data passed in to the constructor if
> > from_full_header is used, and None otherwise.  If encode/decode call
> > the regular constructor, then this attribute would also act as a flag
> > as to whether or not the header was constructed from raw input data
> > or via program.
> >    
> 
> This _implies_ that  from_full_header always accepts raw data bytes... 
> even for the StringHeader.  And that implies the need for an implicit 
> decode, and therefore, perhaps a charset parameter?  No, not a charset 
> parameter, since they are explicitly contained in the header values.

Your confusion was my confusing use of the term 'raw data' to mean
whatever was input to the from_full_header constructor, which is
bytes for a BytesHeader and text for a StringHeade.

> Decode for header values may not need a charset value at all!

Normally it would not.  charset would be useful in decode only for
non-RFC compliant headers.

> No comments for the rest.

Thanks for your feedback.

--David

From v+python at g.nevcal.com  Tue Jan 26 05:10:47 2010
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Mon, 25 Jan 2010 20:10:47 -0800
Subject: [Email-SIG] Thoughts on the general API, and the Header API.
In-Reply-To: <20100126025146.CF0AC1BC2FF@kimball.webabinitio.net>
References: <20100125201034.190CC1BC4B4@kimball.webabinitio.net>
	<4B5E3D73.1070900@g.nevcal.com>
	<20100126025146.CF0AC1BC2FF@kimball.webabinitio.net>
Message-ID: <4B5E6B47.9090307@g.nevcal.com>

On approximately 1/25/2010 6:51 PM, came the following characters from 
the keyboard of R. David Murray:
> On Mon, 25 Jan 2010 16:55:15 -0800, Glenn Linderman<v+python at g.nevcal.com>  wrote:
>
>> Are there any other *Header APIs that would be required not to produce
>> exceptions?  I don't yet perceive any.
>>
> I don't think so.  from_full_header is the only one involved in parsing
> raw data.  Whether __init__ throws errors or records defects is an open
> question, but I lean toward it throwing errors.  The reason there is an
> open question is because an email manipulating application may want to
> convert to text to process an incoming message, and there are things
> that a BytesHeader can hold that would cause errors when encoded to a
> StringHeader (specifically, 8 bit bytes that aren't transfer encoded).
> So it may be that decode, at least, should not throw errors but instead
> record additional defects in the resulting StringHeader.  I think that
> even in that case __init__ should still throw errors, though; decode
> could deal with the defects before calling StringHeader.__init__, or
> (more likely) catch the errors throw by __init__, fix/record the defects,
> and call it again.
>
> Note, by the way, that by 'raw data' I mean what you are feeding in.
> Raw data fed to a BytesHeader would be bytes, but raw data fed to
> a StringHeader would be text (eg: if read from a file in text mode).
>

Glad you clarified that; it wasn't obvious, without typed parameters to 
the APIs.

I had assumed that serialize and from_full_header would produce/consume 
bytes, and I think that showed up in my comments, and you've probably 
addressed that below.  Of course, the reason that I assumed that, is 
that there are no RFCs to describe a string format email message, either 
on the wire, in memory, or, particularly, stored in a file.  So it is 
really up to the application to define that, if it wants that.  Now 
since py3 has a natural string format manipulation capability, and since 
the emaillib wants to provide the interface between them, I suppose it 
is a somewhat obvious thing that you might want to store a whole email 
message in string format... I say somewhat obvious, because you thought 
of it, but I didn't, until you clarified the above.

Perhaps the reason I didn't think of it, is simply that all the 
currently used email message storage containers of which I am aware use 
wire format.  So using string format for that purpose would require 
inventing a new storage container (perhaps a trivial extension of an 
existing one, but new, nonetheless).  I sort of expected email clients 
would, given the capabilities of the emaillib, simply continue to 
save/read in wire format.  In fact, it may be the only choice of format 
that can completely preserve raw format messages for later processing, 
in the presence of defects.

>> The "charset" parameter... is that not mostly needed for data parts?
>>
> No, if you start with a unicode string in a StringHeader, you need to
> know what charset to encode the unicode to and therefore to specify as
> the charset in the RFC 2047 encoded words.
>
>
>> Headers are either ASCII, or contain self-describing charset info.
>>
> That's true for BytesHeaders, but not for StringHeaders.  So as I
> said above charset for StringHeader says which charset to put into
> the encoded words when converting to BytesHeader form.
>
> I specified a charset parameter for 'decode' only to handle the case
> of raw bytes data that contains 8 bit data that is not in encoded words
> (ie: is not RFC compliant).  I am visualizing this as satisfying a use
> case where you have non-email (non RFC compliant) data where you allow
> 8 bit data in the header bodies because it's in internal ap and you
> know the encoding.  You can then use decode(charset) to decode those
> BytesHeaders into StringHeaders.
>
>
>> I guess I could see an intermediate decode from string to some charset,
>> before serialization, as a hint that when generating headers, that all
>> the characters in the header that are not ASCII are in the specified
>> charset... and that that charset is the one to be used in the
>> self-describing serialized ASCII stream?  The full generality of the
>>
> Exactly.
>

OK, I'm with you now on the charset parameter, for encoding and decoding.


>> RFCs, however,
>> allows pieces of headers to be encoded using different charsets... with
>> this API, it would seem that that could only be created containing one
>> charset... the serialization primitives were made available, so that
>> piecewise construction of a header value could be done with different
>> charsets, and then the from_full_header API used to create the complex
>> value.  I don't see this as a severe limitation, I just want to
>> understand your intention, and document the limitation, or my
>> misunderstanding.
>>
> Right.  I'm visualizing the "normal case" being encoding a StringHeader
> using the default utf-8 charset or another specified charset, turning
> the words containing non-ASCII characters into encoded words using that
> charset.  The utility methods that turn unicode into encoded words would
> be exposed, and an application that needs to create a header with mixed
> charsets can use those utilities to build RFC compliant bytes data and
> pass that to one of the BytesHeader constructors.  (Make the common case
> easy, and the complicated cases possible.)
>

Thanks for this clarification also.


>>> BytesHeader would be exactly the same, with the exception of the signature
>>> for serialize and the fact that it has a 'decode' method rather than an
>>> 'encode' method.  Serialize would be different only in the fact that
>>> it would have an additional keyword parameter, must_be_7bit=True.
>>>
>> I am not clear on why StringHeader's serialize would not need the
>> must_be_7bit  parameter... or do I misunderstand that
>> StringHeader.serialize produces wire-format data?
>>
> The latter.  StringHeader serialize does not produce wire-format data,
> it produces text (for example, for display to the user).  If you want
> wire format, you encode the StringHeader and use the resulting BytesHeader
> serialize.
>

OK, I'm with you here now too.  So it may be nice to have a recursive 
operation that would convert String format stuff to Bytes and then to 
wire format, in one go, discarding the intermediate Bytes format stuffh 
along the way to avoid three copies of the data, for simple email 
clients that only use the String format interfaces.


>>> The magic of this approach is in those encode/decode methods.
>>>
>>> Encoding a StringHeader would yield a BytesHeader containing the same
>>> data, but encoded per RFC2047 using the specified charset.  Decoding a
>>> BytesHeader would yield a StringHeader with the same data, but decoded to
>>> unicode per RFC2047, with any 8bit parts decoded (in the unicode sense,
>>> not the RFC2047 sense) using the specified charset (which would default to
>>> ASCII, meaning bare 8bit bytes in headers would throw an error).  (What to
>>> with RFC2047 charsets like unknown-8bit is an open question...probably
>>> throw an error).
>>>
>> Would the encoding to/from StringHeader/BytesHeader preserve the
>> from_full_header  state and value?
>>
> My thought is no.  Once you encode/decode the header, your program has
> transformed it, and I think it is better to treat the original raw data
> as gone.  The motivation for this is that the 'raw data' of a StringHeader
> is the *text* string used to create it.  Keeping a bytes string 'raw data'
> around as well would get us back into the mess that I developed this
> approach to avoid, where we'd need to specify carefully the difference
> between handing a header whose 'original' raw data was bytes vs string,
> for each of the BytesHeader and StringHeader cases.  Better, I think,
> to put the (small) burden on the application programmer: if you want to
> preserve the original input data, do so by keeping the original object
> around.  Once you mutate the object model, the original raw data for
> the mutated piece is gone.
>
> There are some use-case questions here, though, with regards to
> preservation of as much original information/format as possible, and how
> valuable that is.  I think we'll have to figure that out by examining
> concrete use cases in detail.  (It is not something that the current email
> package supports very well, by the way...headers currently get modified
> significantly in the parse/generate cycle, even without bytes-to-string
> transformations happening.)
>

Not every transformation is intended to be a change.  Until there is a 
change, it would be nice to be able to retain the original byte stream, 
for invertibility, without requiring that a simple email client deal 
with bytes interfaces for RFC conformant messages.

I hear you regarding the mess... here's an brainstorming idea, tossed 
out mostly to get your creative juices flowing in this direction, not 
because I think it is "definitely the way to go".  The decode API could, 
in addition to your description, have an option to preserve itself and 
the decode charset, within the String object... If encode "discovers" a 
preserved Bytes object, and the same charset is provided, it would 
return the preserved Bytes object, rather than creating a new one. 
There may be no need to drop the Bytes object explicitly; as it seems 
the only API for making changes to a Header object is to create a new 
one, and substitute the new one for the old one.  Or maybe 
from_full_header does a modify.  Or maybe the properties are assignable 
(that is not explicitly stated, by the way).  So if there are modify 
operations, they should drop the Bytes object.


>>> (Encoding or decoding a Message would cause the Message to recursively
>>> encode or decode its subparts.  This means you are making a complete
>>> new copy of the Message in memory.  If you don't want to do that you
>>> can walk the Message and convert it piece by piece (we could provide a
>>> generator that does this).)
>>>
>> Walking it piece by piece would allow the old pieces to be discarded, to
>> save total memory consumption, where that is appropriate.
>>
>> Perhaps one generator that would be commonly used, would be to convert
>> headers only, and leave MIME data parts alone, accessing and converting
>> them only with the registered methods?  This would mean that a "complete
>> copy" wouldn't generally be very big, if the data parts were excluded
>> from implicit conversion.  Perhaps the "external storage protocol" might
>> also only be defined for MIME data parts, and walking the tree with this
>> generator would not need to reference the MIME data parts, nor bring
>> them in from "external storage".
>>
> That's true.  The Bytes and String versions of binary MIME parts,
> which are likely to be the large ones, will probably have a common
> representation for the payload, and could potentially point to the same
> object.  That breaking of of the expectation that 'encode' and 'decode'
> return new objects (in analogy to how encode and decode of strings/bytes
> works) might not be a good thing, though.
>

Well, one generator could provide the expectation that everything is 
new; another could provide different expectations.  The differences 
between them, and the tradeoffs would be documented, of course, were 
both provided.  I'm not convinced that treating headers and data exactly 
the same at all times is a good thing... a convenient option at times, 
perhaps, but I can see it as a serious inefficiency in many use cases 
involving large data.

This deserves a bit more thought/analysis/discussion, perhaps.  More 
than I have time for tonight, but I may reply again, perhaps after 
others have responded, if they do.

> In any case, text MIME parts have the same bytes vs string issues as
> headers do, and should, IMO, be converted from one to the other on
> encode/decode.
>

To me, your first phrase implies that they should share common 
encode/decode routines, but not the other.  I can clearly see a use case 
where your opinion is the right approach, but I think there are use 
cases where it might not be... while text MIME parts are generally 
smaller than binary MIME parts, that is neither a requirement, nor 
always true (think about transferring an XML format database... could be 
huge... and is text of sorts -- human decipherable, more easily than hex 
dumps, but not what I would call "human readable").


> Another possible approach would be some sort of 'encode/decode on demand'
> system, although that would need to retain a pointer to the original
> object, which might get us into suboptimal reference cycle difficulties.
>

Hmm.  Brainstorming again.  decode could minimally create the String 
format object, with only the Bytes format object and charset parameter 
set (from the above brainstorming idea).  Then the real decoding could 
be done if the properties are accessed.  If the properties are not 
accessed (because the client/application makes its decisions based on 
access to other components of the email), the decoding need never be 
done for some objects.  Perhaps this would also neatly deal with my 
desire to delay the decode of MIME data parts as well?

> These bits are implementation details, though, and don't affect the API
> design.
>

Well, one impact of the above brainstorming would be an interface to 
create the StringHeader containing the BytesHeader and charset 
parameters.  Or maybe that would be a private interface, not considered 
to be part of the API?


>>> raw_header would be the data passed in to the constructor if
>>> from_full_header is used, and None otherwise.  If encode/decode call
>>> the regular constructor, then this attribute would also act as a flag
>>> as to whether or not the header was constructed from raw input data
>>> or via program.
>>>
>>>
>> This _implies_ that  from_full_header always accepts raw data bytes...
>> even for the StringHeader.  And that implies the need for an implicit
>> decode, and therefore, perhaps a charset parameter?  No, not a charset
>> parameter, since they are explicitly contained in the header values.
>>
> Your confusion was my confusing use of the term 'raw data' to mean
> whatever was input to the from_full_header constructor, which is
> bytes for a BytesHeader and text for a StringHeade.
>

If we are going to invent a new "string format raw data" element, maybe 
we should invent a term to describe it, also... maybe "raw data" should 
be split into "raw bytes" and "raw string", and "raw data" become a 
synonym for "raw bytes", as that is what it was historically?


-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

From glenn at nevcal.com  Tue Jan 26 05:10:01 2010
From: glenn at nevcal.com (Glenn Linderman)
Date: Mon, 25 Jan 2010 20:10:01 -0800
Subject: [Email-SIG] Thoughts on the general API, and the Header API.
In-Reply-To: <20100126025146.CF0AC1BC2FF@kimball.webabinitio.net>
References: <20100125201034.190CC1BC4B4@kimball.webabinitio.net>
	<4B5E3D73.1070900@g.nevcal.com>
	<20100126025146.CF0AC1BC2FF@kimball.webabinitio.net>
Message-ID: <4B5E6B19.4050503@nevcal.com>

On approximately 1/25/2010 6:51 PM, came the following characters from 
the keyboard of R. David Murray:
> On Mon, 25 Jan 2010 16:55:15 -0800, Glenn Linderman<v+python at g.nevcal.com>  wrote:
>    
>> Are there any other *Header APIs that would be required not to produce
>> exceptions?  I don't yet perceive any.
>>      
> I don't think so.  from_full_header is the only one involved in parsing
> raw data.  Whether __init__ throws errors or records defects is an open
> question, but I lean toward it throwing errors.  The reason there is an
> open question is because an email manipulating application may want to
> convert to text to process an incoming message, and there are things
> that a BytesHeader can hold that would cause errors when encoded to a
> StringHeader (specifically, 8 bit bytes that aren't transfer encoded).
> So it may be that decode, at least, should not throw errors but instead
> record additional defects in the resulting StringHeader.  I think that
> even in that case __init__ should still throw errors, though; decode
> could deal with the defects before calling StringHeader.__init__, or
> (more likely) catch the errors throw by __init__, fix/record the defects,
> and call it again.
>
> Note, by the way, that by 'raw data' I mean what you are feeding in.
> Raw data fed to a BytesHeader would be bytes, but raw data fed to
> a StringHeader would be text (eg: if read from a file in text mode).
>    

Glad you clarified that; it wasn't obvious, without typed parameters to 
the APIs.

I had assumed that serialize and from_full_header would produce/consume 
bytes, and I think that showed up in my comments, and you've probably 
addressed that below.  Of course, the reason that I assumed that, is 
that there are no RFCs to describe a string format email message, either 
on the wire, in memory, or, particularly, stored in a file.  So it is 
really up to the application to define that, if it wants that.  Now 
since py3 has a natural string format manipulation capability, and since 
the emaillib wants to provide the interface between them, I suppose it 
is a somewhat obvious thing that you might want to store a whole email 
message in string format... I say somewhat obvious, because you thought 
of it, but I didn't, until you clarified the above.

Perhaps the reason I didn't think of it, is simply that all the 
currently used email message storage containers of which I am aware use 
wire format.  So using string format for that purpose would require 
inventing a new storage container (perhaps a trivial extension of an 
existing one, but new, nonetheless).  I sort of expected email clients 
would, given the capabilities of the emaillib, simply continue to 
save/read in wire format.  In fact, it may be the only choice of format 
that can completely preserve raw format messages for later processing, 
in the presence of defects.

>> The "charset" parameter... is that not mostly needed for data parts?
>>      
> No, if you start with a unicode string in a StringHeader, you need to
> know what charset to encode the unicode to and therefore to specify as
> the charset in the RFC 2047 encoded words.
>
>    
>> Headers are either ASCII, or contain self-describing charset info.
>>      
> That's true for BytesHeaders, but not for StringHeaders.  So as I
> said above charset for StringHeader says which charset to put into
> the encoded words when converting to BytesHeader form.
>
> I specified a charset parameter for 'decode' only to handle the case
> of raw bytes data that contains 8 bit data that is not in encoded words
> (ie: is not RFC compliant).  I am visualizing this as satisfying a use
> case where you have non-email (non RFC compliant) data where you allow
> 8 bit data in the header bodies because it's in internal ap and you
> know the encoding.  You can then use decode(charset) to decode those
> BytesHeaders into StringHeaders.
>
>    
>> I guess I could see an intermediate decode from string to some charset,
>> before serialization, as a hint that when generating headers, that all
>> the characters in the header that are not ASCII are in the specified
>> charset... and that that charset is the one to be used in the
>> self-describing serialized ASCII stream?  The full generality of the
>>      
> Exactly.
>    

OK, I'm with you now on the charset parameter, for encoding and decoding.


>> RFCs, however,
>> allows pieces of headers to be encoded using different charsets... with
>> this API, it would seem that that could only be created containing one
>> charset... the serialization primitives were made available, so that
>> piecewise construction of a header value could be done with different
>> charsets, and then the from_full_header API used to create the complex
>> value.  I don't see this as a severe limitation, I just want to
>> understand your intention, and document the limitation, or my
>> misunderstanding.
>>      
> Right.  I'm visualizing the "normal case" being encoding a StringHeader
> using the default utf-8 charset or another specified charset, turning
> the words containing non-ASCII characters into encoded words using that
> charset.  The utility methods that turn unicode into encoded words would
> be exposed, and an application that needs to create a header with mixed
> charsets can use those utilities to build RFC compliant bytes data and
> pass that to one of the BytesHeader constructors.  (Make the common case
> easy, and the complicated cases possible.)
>    

Thanks for this clarification also.


>>> BytesHeader would be exactly the same, with the exception of the signature
>>> for serialize and the fact that it has a 'decode' method rather than an
>>> 'encode' method.  Serialize would be different only in the fact that
>>> it would have an additional keyword parameter, must_be_7bit=True.
>>>        
>> I am not clear on why StringHeader's serialize would not need the
>> must_be_7bit  parameter... or do I misunderstand that
>> StringHeader.serialize produces wire-format data?
>>      
> The latter.  StringHeader serialize does not produce wire-format data,
> it produces text (for example, for display to the user).  If you want
> wire format, you encode the StringHeader and use the resulting BytesHeader
> serialize.
>    

OK, I'm with you here now too.  So it may be nice to have a recursive 
operation that would convert String format stuff to Bytes and then to 
wire format, in one go, discarding the intermediate Bytes format stuffh 
along the way to avoid three copies of the data, for simple email 
clients that only use the String format interfaces.


>>> The magic of this approach is in those encode/decode methods.
>>>
>>> Encoding a StringHeader would yield a BytesHeader containing the same
>>> data, but encoded per RFC2047 using the specified charset.  Decoding a
>>> BytesHeader would yield a StringHeader with the same data, but decoded to
>>> unicode per RFC2047, with any 8bit parts decoded (in the unicode sense,
>>> not the RFC2047 sense) using the specified charset (which would default to
>>> ASCII, meaning bare 8bit bytes in headers would throw an error).  (What to
>>> with RFC2047 charsets like unknown-8bit is an open question...probably
>>> throw an error).
>>>        
>> Would the encoding to/from StringHeader/BytesHeader preserve the
>> from_full_header  state and value?
>>      
> My thought is no.  Once you encode/decode the header, your program has
> transformed it, and I think it is better to treat the original raw data
> as gone.  The motivation for this is that the 'raw data' of a StringHeader
> is the *text* string used to create it.  Keeping a bytes string 'raw data'
> around as well would get us back into the mess that I developed this
> approach to avoid, where we'd need to specify carefully the difference
> between handing a header whose 'original' raw data was bytes vs string,
> for each of the BytesHeader and StringHeader cases.  Better, I think,
> to put the (small) burden on the application programmer: if you want to
> preserve the original input data, do so by keeping the original object
> around.  Once you mutate the object model, the original raw data for
> the mutated piece is gone.
>
> There are some use-case questions here, though, with regards to
> preservation of as much original information/format as possible, and how
> valuable that is.  I think we'll have to figure that out by examining
> concrete use cases in detail.  (It is not something that the current email
> package supports very well, by the way...headers currently get modified
> significantly in the parse/generate cycle, even without bytes-to-string
> transformations happening.)
>    

Not every transformation is intended to be a change.  Until there is a 
change, it would be nice to be able to retain the original byte stream, 
for invertibility, without requiring that a simple email client deal 
with bytes interfaces for RFC conformant messages.

I hear you regarding the mess... here's an brainstorming idea, tossed 
out mostly to get your creative juices flowing in this direction, not 
because I think it is "definitely the way to go".  The decode API could, 
in addition to your description, have an option to preserve itself and 
the decode charset, within the String object... If encode "discovers" a 
preserved Bytes object, and the same charset is provided, it would 
return the preserved Bytes object, rather than creating a new one.  
There may be no need to drop the Bytes object explicitly; as it seems 
the only API for making changes to a Header object is to create a new 
one, and substitute the new one for the old one.  Or maybe 
from_full_header does a modify.  Or maybe the properties are assignable 
(that is not explicitly stated, by the way).  So if there are modify 
operations, they should drop the Bytes object.


>>> (Encoding or decoding a Message would cause the Message to recursively
>>> encode or decode its subparts.  This means you are making a complete
>>> new copy of the Message in memory.  If you don't want to do that you
>>> can walk the Message and convert it piece by piece (we could provide a
>>> generator that does this).)
>>>        
>> Walking it piece by piece would allow the old pieces to be discarded, to
>> save total memory consumption, where that is appropriate.
>>
>> Perhaps one generator that would be commonly used, would be to convert
>> headers only, and leave MIME data parts alone, accessing and converting
>> them only with the registered methods?  This would mean that a "complete
>> copy" wouldn't generally be very big, if the data parts were excluded
>> from implicit conversion.  Perhaps the "external storage protocol" might
>> also only be defined for MIME data parts, and walking the tree with this
>> generator would not need to reference the MIME data parts, nor bring
>> them in from "external storage".
>>      
> That's true.  The Bytes and String versions of binary MIME parts,
> which are likely to be the large ones, will probably have a common
> representation for the payload, and could potentially point to the same
> object.  That breaking of of the expectation that 'encode' and 'decode'
> return new objects (in analogy to how encode and decode of strings/bytes
> works) might not be a good thing, though.
>    

Well, one generator could provide the expectation that everything is 
new; another could provide different expectations.  The differences 
between them, and the tradeoffs would be documented, of course, were 
both provided.  I'm not convinced that treating headers and data exactly 
the same at all times is a good thing... a convenient option at times, 
perhaps, but I can see it as a serious inefficiency in many use cases 
involving large data.

This deserves a bit more thought/analysis/discussion, perhaps.  More 
than I have time for tonight, but I may reply again, perhaps after 
others have responded, if they do.

> In any case, text MIME parts have the same bytes vs string issues as
> headers do, and should, IMO, be converted from one to the other on
> encode/decode.
>    

To me, your first phrase implies that they should share common 
encode/decode routines, but not the other.  I can clearly see a use case 
where your opinion is the right approach, but I think there are use 
cases where it might not be... while text MIME parts are generally 
smaller than binary MIME parts, that is neither a requirement, nor 
always true (think about transferring an XML format database... could be 
huge... and is text of sorts -- human decipherable, more easily than hex 
dumps, but not what I would call "human readable").


> Another possible approach would be some sort of 'encode/decode on demand'
> system, although that would need to retain a pointer to the original
> object, which might get us into suboptimal reference cycle difficulties.
>    

Hmm.  Brainstorming again.  decode could minimally create the String 
format object, with only the Bytes format object and charset parameter 
set (from the above brainstorming idea).  Then the real decoding could 
be done if the properties are accessed.  If the properties are not 
accessed (because the client/application makes its decisions based on 
access to other components of the email), the decoding need never be 
done for some objects.  Perhaps this would also neatly deal with my 
desire to delay the decode of MIME data parts as well?

> These bits are implementation details, though, and don't affect the API
> design.
>    

Well, one impact of the above brainstorming would be an interface to 
create the StringHeader containing the BytesHeader and charset 
parameters.  Or maybe that would be a private interface, not considered 
to be part of the API?


>>> raw_header would be the data passed in to the constructor if
>>> from_full_header is used, and None otherwise.  If encode/decode call
>>> the regular constructor, then this attribute would also act as a flag
>>> as to whether or not the header was constructed from raw input data
>>> or via program.
>>>
>>>        
>> This _implies_ that  from_full_header always accepts raw data bytes...
>> even for the StringHeader.  And that implies the need for an implicit
>> decode, and therefore, perhaps a charset parameter?  No, not a charset
>> parameter, since they are explicitly contained in the header values.
>>      
> Your confusion was my confusing use of the term 'raw data' to mean
> whatever was input to the from_full_header constructor, which is
> bytes for a BytesHeader and text for a StringHeade.
>    

If we are going to invent a new "string format raw data" element, maybe 
we should invent a term to describe it, also... maybe "raw data" should 
be split into "raw bytes" and "raw string", and "raw data" become a 
synonym for "raw bytes", as that is what it was historically?


-- 
Glenn
------------------------------------------------------------------------
?Everyone is entitled to their own opinion, but not their own facts. In 
turn, everyone is entitled to their own opinions of the facts, but not 
their own facts based on their opinions.? -- Guy Rocha, retiring NV 
state archivist

From v+python at g.nevcal.com  Fri Jan 29 03:20:24 2010
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Thu, 28 Jan 2010 18:20:24 -0800
Subject: [Email-SIG] Thoughts on the general API, and the Header API.
In-Reply-To: <4B5E6B47.9090307@g.nevcal.com>
References: <20100125201034.190CC1BC4B4@kimball.webabinitio.net>
	<4B5E3D73.1070900@g.nevcal.com>
	<20100126025146.CF0AC1BC2FF@kimball.webabinitio.net>
	<4B5E6B47.9090307@g.nevcal.com>
Message-ID: <4B6245E8.3060402@g.nevcal.com>

On approximately 1/25/2010 8:10 PM, came the following characters from 
the keyboard of Glenn Linderman:
>> That's true.  The Bytes and String versions of binary MIME parts,
>> which are likely to be the large ones, will probably have a common
>> representation for the payload, and could potentially point to the same
>> object.  That breaking of of the expectation that 'encode' and 'decode'
>> return new objects (in analogy to how encode and decode of strings/bytes
>> works) might not be a good thing, though.
>
> Well, one generator could provide the expectation that everything is 
> new; another could provide different expectations.  The differences 
> between them, and the tradeoffs would be documented, of course, were 
> both provided.  I'm not convinced that treating headers and data 
> exactly the same at all times is a good thing... a convenient option 
> at times, perhaps, but I can see it as a serious inefficiency in many 
> use cases involving large data.
>
> This deserves a bit more thought/analysis/discussion, perhaps.  More 
> than I have time for tonight, but I may reply again, perhaps after 
> others have responded, if they do. 

I guess no one else is responding here at the moment.  Read the ideas 
below, and then afterward, consider building the APIs you've suggested 
on top of them.  And then, with the full knowledge that the messages may 
be either in fast or slow storage, I think that you'll agree that 
converting the whole tree in one swoop isn't always appropriate... all 
headers, probably could be.  Data, because of its size, should probably 
be done on demand.


In earlier discussions about the registry, there was the idea of having 
a registry for transport encoding handling, and a registry for MIME 
encoding handling.  There were also vague comments about doing an 
external storage protocol "somehow", but it was a vague concept to be 
defined later, or at least I don't recall any definitions.

Given a raw bytes representation of an incoming email, mail servers need 
to choose how to handle it... this may need to be a dynamic choice based 
on current server load, as well as the obvious static server resources, 
as well as configured limits.

Unfortunately, the SMTP protocol does not require predeclaration of the 
size of the incoming DATA part, so servers cannot enforce size limits 
until they are exceeded.  So as the data streams in, a dynamic 
adjustment to the handling strategy might be appropriate.  Gateways may 
choose to route messages, and stall the input until the output channel 
is ready to receive it, and basically "pass through" the data, with 
limited need to buffer messages on disk... unless the output channel 
doesn't respond... then they might reject the message.  An SMTP server 
should be willing to act as a store-and-forward server, and also must do 
individual delivery of messages to each RCPT (or at least one per 
destination domain), so must have a way of dealing with large messages, 
probably via disk buffering.  The case of disk buffering and retrying 
generally means that the whole message, not just the large data parts, 
must be stored on disk, so the external storage protocol should be able 
to deal with that case.

The minimal external storage format capability is to store the received 
bytestream to disk, associate it with the envelope information, and be 
able to retrieve it in whole later.  This would require having the whole 
thing in RAM at those two points in time, however, and doesn't solve the 
real problem.  Incremental writing and reading to the external storage 
would be much more useful.  Even more useful, would be "partially 
parsed" seek points.

An external storage system that provides "partially parsed" information 
could include:

1) envelope information.  This section is useful to SMTP servers, but 
not other email tools, so should be optional.  This could be a copy of 
the received RCPT command texts, complete with CRLF endings.

2) header information.  This would be everything between DATA and the 
first CRLF CRLF sequence.

3) data.  Pre-MIME this would simply be the rest of the message, but 
post-MIME it would be usefully more complex.  If MIME headers can be 
observed and parsed as the data passes through, then additional metadata 
could be saved that could enhance performance of the later processing 
steps.  Such additional metadata could include the beginning of each 
MIME part, the end of the headers for that part, and the end of the data 
for that part.

The result of saving that information would mean that minimal data (just 
headers) would need to be read in create a tree representing the email, 
the rest could be left in external storage until it is accessed... and 
then obtained directly from there when needed, and converted to the form 
required by the request... either the whole part, or some piece in a buffer.

So there could be a variety of external storage systems... one that 
stores in memory, one that stores on disk per the ideas above, and a 
variety that retain some amount of cached information about the email, 
even though they store it all on disk.  Sounds like this could be a 
plug-in, or an attribute of a message object creation.

But to me, it sounds like the foundation upon which the whole email lib 
should be built, not something that is shoveled in later.

A further note about access to data parts... clearly "data for the whole 
MIME part" could be provided, but even for a single part that could be 
large.  So access to smaller chunks might be desired.

The data access/conversion functions, therefore, should support a 
buffer-at-a-time access interface.  Base64 supports random access 
easily, unless it contains characters not in the 64, that are to be 
ignored, that could throw off the size calculations.  So maybe providing 
sequential buffer-at-a-time access with rewind is the best that can be 
done -- quoted-printable doesn't support random access very well, and 
neither would some sort of compression or encryption technique -- they 
usually like to start from the beginning -- and those are the sorts of 
things that I would consider likely to be standardized in the future, to 
reduce the size of the payload, and to increase the security of the payload.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking