From rdmurray at bitdance.com  Tue Mar  1 21:40:57 2011
From: rdmurray at bitdance.com (R. David Murray)
Date: Tue, 01 Mar 2011 15:40:57 -0500
Subject: [Email-SIG] API thoughts
Message-ID: <20110301204058.54C96249A9D@kimball.webabinitio.net>

This is a long email, for which my apologies.  I hope you all will
manage to find some time to read it and provide feedback, as it speaks
to fundamental design issues.

My subconscious seems to have been very busy last night, since in the
shower this morning it presented me with a whole bunch of thoughts about
the email API.  This was triggered, I think, by Barry's question about
__version__, my response that we might want an 'api version' declaration,
and some comments made during the email 5.1 discussion by Steven D'Arapano
(I think) about how Message is really the idealized representation of
an email message.

Let me start by saying that I think we can all agree that the fundamental
design of the email package is excellent:  we have a Parser which handles
taking input from the outside world and turning it into a Message, and
we have a Generator which handles taking a Message and turning it into
something the outside world can handle.  In the focus of the original
development the "outside world" was, sensibly, RFC 822/2822 encoded
byte streams.

The idealized message consists of some meta information (addressee,
recipient, date, etc, etc) and a body.  The body, the content, can be
arbitrarily complex.  The purpose of the message is to convey some of
that meta information and all of the arbitrarily complex body content
from the sender to the recipient.

Everything else is an implementation detail :)

So, if we are writing a program and we want to compose such a message, it
makes sense that we can build up this idealized message from its component
pieces by attaching objects representing those pieces to the Message.
At that stage we care nothing about how it needs to be transformed to
get from point A to point B.

If we want to look at a message, we again don't are about how it was
transformed to get from point A to point B, we just want to be able to
access the content in its original form.

In today's "outside world" we have more to worry about than just
RFC822/2822/5322.  The "outside world" could be an http transmission
medium.  It could (if we re-design things right:) be a SIP session.
It could be a disk-based data store, where an RFC822-like message format
is being used to store data.  I'm sure there are other contexts as well.

So keeping the external representation concerns separate from the
idealized message model makes sense.

The email4/5 API doesn't do this as successfully as it could, especially
in a Python3 context.  The application program dealing with the idealized
message doesn't really care what character set any given piece of a header
is encoded in, it really just wants to deal with complete unicode strings.
The application program also really doesn't care about the MIME type of a
piece of content, it just wants to manage an object that has methods that
allow it to manipulate that image, or that audio file, or what have you.
Of course, it also needs to know what type of object it is handling in an
incoming message, but the mime type is only one piece of the information
that determines that (albeit usually the most important one).

(Yes, some applications *do* care about internal details...but those
are special cases and we can provide additional APIs that allow access
at that level for those applications that need it, as we have discussed
previously.)

We propose to create a new API to make all of this easier for
the application programmer.  What doesn't change is the fundamental
structure of the package:  a message in some transmission format is
fed to a Parser, which produces a Message object.  A Message object
can be fed to a Generator, which produces a transmission format object.
Now, I lost sight of this a bit while I was working on the email6 header
classes, as Barry at least will remember, but I do think it is important,
and I want to keep it in the forefront of my mind as I work on adding
the proposed policy framework.

So, and here is the point of this email, how does the policy framework
integrate into this design?

I said that the policy pulls together the tunable bits of the email
package's algorithms.  What does this mean?  What are the tunable
bits?  Here are some candidates:

    maximum header line length on serialization
    line ending character on serialization
    whether or not to raise an exception if a defect is encountered
        during parsing
    how much transformation of untouched original data is permissible
        when re-serializing a message
    can the serialized form contain any non-ASCII data?
    what classes to use to represent various MIME types.
   
These are all decisions that can be made one way or another by an
application program using the current package.  Often, however, modifying
the default is not easy or convenient.  Note that the last one can only
be decided by an application program when constructing a message, not
when parsing one.

Here are some other things that it might be useful to be able to
control:

    what string to use as the continuation whitespace when needing
        to add some
    what classes to use to represent various structured headers
    what exactly counts as a defect
    should headers be RFC2047 encoded on serialization, or
        should another encoding be used?[*]

[*] There are current real-world use cases for this:  there are nntp
    servers that use utf-8 for headers, and the http protocol uses
    latin-1 (or sometimes, I think, utf-8)

This list breaks down into items that affect the Parser, ones that affect
the Generator, and ones that affect both the Parser and the Message.
(Well, the "how much transformation" affects all three in the sense that
the data has to be preserved by both the Parser and the Message in order
for the Generator to be able to implement it, but I think we can take
it as a given that we are going to preserve that data.)

The pieces that are shared between the Parser and the Message are really
about the Message:  how are the sub-objects represented?  How are the
structured headers represented?  So we could consider that the Parser
is a *consumer* of those pieces of policy, but that they are defined on
the Message, not on the Parser.

What this means is that the policy controlling each of the major
components (parser, message, generator) are in principle independent.

The design of the policy framework envisions having, for example, an
'HTTP' policy that would, say, expect and generate latin-1 encoded
headers, and generate headers without line breaks, using CRLF for the
line termination.  Initially I thought one would declare a policy
and that the Message object would remember that policy, but that you
could override it when, say, calling the generator.

Re-thinking it now, though, I think there are actually two distinct
components here: the I/O policy(s), and the Message construction policy.
That is, the things that the HTTP policy cares about are all Parser or
Generator controls.  The only things the Message (should) care about is
how to represent its components.  The Message is thus independent of any
policy *except* the header/mime classes, while the Parser and Generator
can be consumers of the header/mime class policy used to construct the
Message.  It nevertheless makes sense to group the parser and generator
policy controls together, since that is how we conceptually think of them
('HTML' implies a coherent set of input and output policies).

So, I think the "policy framework" is actually two things:  the
header/mime-types registry, and the Parser/Generator policies.  Let's have
'policy' refer to only the I/O policy, and call the other the email
class registry.

This narrower definition of policy is a straightforward enhancement
of the current API.  It makes these "knobs" more easily controlled,
and makes it easier to add new knobs without complicating the API.
I propose that I write up this policy API as a distinct proposal/patch
(with the work I've already done, this is more than half completed).
This would add policy keywords to the Parser and Generator classes,
and probably to the as_string method of Message.

The real meat of email6, then, is the header/mime-types registry, and
the changes in the API of the resulting Message objects.  The parser
currently accepts a _factory argument that specifies the object to be used
in creating the Message.   I propose that we deprecate this argument,
but that any code using it gets the old behavior of the parser (using
_factory to create the class for any new sub-objects).  Then we introduce
a new argument, 'factory'.  This new argument would expect a callable
that takes a mime-type as its argument, and returns an appropriate class.
The parser would be re-written so that it could use this factory, and
the backward compatibility case would be trivial to implement.

In theory the classes returned by the registry/factory are arbitrary,
but in practice we will need to define the minimal API that they
should provide.  By specifying the API separately from the concrete
implementation in email6, we will allow third parties to write classes
that can play well with programs expecting to operate on email6 Messages.
This will allow, for example, an MUA to provide custom classes to enhance
presentation, while still allowing the message to be submitted to smtplib
for transmission.

I guess I'm proposing, then, that there be an API version definition,
with two values as of Python3.3: email5 API, and email6 API.  We'll
figure out how we name and interrogate these formally later.

The Header registry in this vision is accessed through the Message class.
I have various thoughts about how this will work, but I'm going to leave
those for later, since this email is already long enough.  I also have
some additional thoughts about backward compatibility, but it is going
to require some experimentation to see if they are realistic.

--David

From v+python at g.nevcal.com  Tue Mar  1 22:58:50 2011
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Tue, 01 Mar 2011 13:58:50 -0800
Subject: [Email-SIG] API thoughts
In-Reply-To: <20110301204058.54C96249A9D@kimball.webabinitio.net>
References: <20110301204058.54C96249A9D@kimball.webabinitio.net>
Message-ID: <4D6D6C1A.2070200@g.nevcal.com>

On 3/1/2011 12:40 PM, R. David Murray wrote:
> This is a long email, for which my apologies.  I hope you all will
> manage to find some time to read it and provide feedback, as it speaks
> to fundamental design issues.

Indeed.  Good to discuss before designing with ready-mix.

> Everything else is an implementation detail :)

Agreed.

> We propose to create a new API to make all of this easier for
> the application programmer.

YES!!

> [*] There are current real-world use cases for this:  there are nntp
>      servers that use utf-8 for headers, and the http protocol uses
>      latin-1 (or sometimes, I think, utf-8)

All the tunables listed are relevant.  The HTTP protocol standard claims 
to use Latin-1 + RFC 2047 encoding for non-Latin-1 characters; in 
practice, the browser implementations apparently use nearly _any_ 
encoding for headers!!!  For <form> responses, when there is actually 
user-specified data involved, they use the encoding defined for the page 
containing the form, as the encoding of the MIME headers sent back.  The 
"standard headers" seem to be ASCII, and somewhat immune to choice of 
encoding, except perhaps for those few encodings that are not ASCII 
supersets. (I have no clue how such are handled, if they are.  Anyone 
want to write an EBCDIC page containing a <form> for testing?)

This is useful, as it reduces the amount of character escaping likely to 
be required, the designer of the page chooses a character set that can 
represent the page, and is likely in the language of the intended 
recipient, who is likely to fill out the form using the same language.

It would be more useful, if the browsers included a(n ASCII) header that 
specified the encoding of subsequent headers: they do not.  Therefore, 
the server that receives the headers must somehow "know" the proper 
encoding.  For the situation where the CGI (or equivalent) script both 
generates the page containing the <form> and receives the form data, 
this is simple.  For the situation where the same web application 
designer creates the page containing the <form> and the CGI receiving 
the form data, and explicitly or implicitly declares the same encoding 
for both, this is functional, but there is the danger of someone 
changing the static pages to conform to a new standard encoding without 
realizing the consequences on the associated CGI scripts.  It is also 
rather hard to create "form filling" applications that can send form 
data to a server bypassing the access of the form itself... such 
applications must also "know" the proper encoding, and such applications 
are much more likely to be generated outside the realm of the original 
development environment, and much less likely to be involved in any 
planning to change encodings inside the application <form>s and CGIs.

To support reading byte-stream HTTP headers, therefore, it is critical 
that the email API accept an encoding from the application which "knows" 
the encoding; presently cgi.py has to pre-decode incoming headers 
because email does not have such a parameter.  On the other hand, maybe 
cgi.py shouldn't use email header parsing at all... since browsers don't 
use RFC 2047 encoding in practice, the parsing of headers without such 
is straightforward.

Further, HTTP data streams can be extremely large, and thus 
time-consuming to obtain over the wire.  CGI applications cannot afford 
to keep large blocks of data in RAM during receipt, thus if email wishes 
to support CGI, it needs features for placing large blocks of data on 
disk instead of in RAM during the parsing phase; cgi.py presently has to 
preparse headers, to separate them from the data streams, which it then 
handles on its own, because of this issue.

Hence, cgi.py does sufficient preparsing and private handling of HTTP 
data streams, that it seems that the only real benefit it gains from 
using email at all, is the handling of the complex RFC 2047 decoding... 
which in practice isn't used in HTTP data streams!

In any case, if email wants to promulgate itself as the "one true way" 
to process HTTP data streams, as well as SMTP and NNTP data streams, 
then it needs to address the issues above.

There is, by the way, room for improvement in the cgi.py handler for 
HTTP data streams; presently all large MIME objects are written to disk 
(but small ones are kept as string or byte streams), but it isn't 
necessarily the right disk, and the data must then be again copied, byte 
by byte, to its final file system location.  I see that as abhorrent 
overhead.  There is presently no provision for hooks that ask the CGI 
application what to do with the data being received, while it is being 
received, nor for policies to assist with better heuristics, with the 
goal in mind that a properly and completely received MIME object could 
then be renamed to its final location rather than copied.

> I guess I'm proposing, then, that there be an API version definition,
> with two values as of Python3.3: email5 API, and email6 API.  We'll
> figure out how we name and interrogate these formally later.

Question: While it is pretty clear that enhanced behaviors are required 
to benefit new applications that use email, and while some new APIs may 
be incompatible with some existing APIs, might it be possible to design 
the new API, and then build a compatibility layer that looks like the 
old API on top?  Such that there would be policies for the new APIs that 
would work like the old APIs to ease the implementation of such a 
layer?  I'm not sure I fully understand the use of _factory or factory 
parameters, but for APIs that have _factory and grow a factory, could 
not the presence of which parameter imply any variant functionality?

(OK, this question comes after not looking at the email API during all 
the GSOC and your implementation efforts since the last big round of 
discussion, but your proposals here seem to sound like it would be more 
possible with your current thinking that with your previous thinking.)

> The Header registry in this vision is accessed through the Message class.
> I have various thoughts about how this will work, but I'm going to leave
> those for later, since this email is already long enough.  I also have
> some additional thoughts about backward compatibility, but it is going
> to require some experimentation to see if they are realistic.

Consider me an interested observer; I'll enjoy reading, thinking, and 
commenting about these ideas too, but sadly am unlikely to implement an 
email client this year :(  But I have aspirations to do so, because none 
of the existing email clients exactly suit my preferences... (everyone 
should write an editor and an email client, no?  I've done the former 
several times... what I want, though, is emacs-python, instead of 
emacs-lisp).

Glenn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/email-sig/attachments/20110301/8a551ed1/attachment-0001.html>

From rdmurray at bitdance.com  Tue Mar  1 23:59:10 2011
From: rdmurray at bitdance.com (R. David Murray)
Date: Tue, 01 Mar 2011 17:59:10 -0500
Subject: [Email-SIG] API thoughts
In-Reply-To: <4D6D6C1A.2070200@g.nevcal.com>
References: <20110301204058.54C96249A9D@kimball.webabinitio.net>
	<4D6D6C1A.2070200@g.nevcal.com>
Message-ID: <20110301225910.72D79249A6C@kimball.webabinitio.net>

On Tue, 01 Mar 2011 13:58:50 -0800, Glenn Linderman <v+python at g.nevcal.com> wrote:
> To support reading byte-stream HTTP headers, therefore, it is critical 
> that the email API accept an encoding from the application which "knows" 
> the encoding; presently cgi.py has to pre-decode incoming headers 
> because email does not have such a parameter.  On the other hand, maybe 
> cgi.py shouldn't use email header parsing at all... since browsers don't 
> use RFC 2047 encoding in practice, the parsing of headers without such 
> is straightforward.

I think it could make sense for the default input character set to be
a policy parameter for the parser.  Maybe not in the first version,
though :)

Yes, it is simple(r) to parse headers if you don't have to worry about
RFC2047, but why duplicate code if you don't need to?  This assumes,
of course, that email6 does what cgi.py and similar programs need,
but I'll try to keep my eye on that.

> Further, HTTP data streams can be extremely large, and thus 
> time-consuming to obtain over the wire.  CGI applications cannot afford 
> to keep large blocks of data in RAM during receipt, thus if email wishes 
> to support CGI, it needs features for placing large blocks of data on 
> disk instead of in RAM during the parsing phase; cgi.py presently has to 
> preparse headers, to separate them from the data streams, which it then 
> handles on its own, because of this issue.

It is already in the plan to add disk caching support to the base email
API, so this will get addressed.  You may even be the one who suggested
designing the API as a general "storage" API so that different back-ends
can be hooked up.  In any case, that's what I've got in mind.

> There is, by the way, room for improvement in the cgi.py handler for 
> HTTP data streams; presently all large MIME objects are written to disk 
> (but small ones are kept as string or byte streams), but it isn't 
> necessarily the right disk, and the data must then be again copied, byte 
> by byte, to its final file system location.  I see that as abhorrent 
> overhead.  There is presently no provision for hooks that ask the CGI 
> application what to do with the data being received, while it is being 
> received, nor for policies to assist with better heuristics, with the 
> goal in mind that a properly and completely received MIME object could 
> then be renamed to its final location rather than copied.

I think the hookable storage back end addresses this, but the concrete
implementation (eventually) provided by email ought to support it as well.

> > I guess I'm proposing, then, that there be an API version definition,
> > with two values as of Python3.3: email5 API, and email6 API.  We'll
> > figure out how we name and interrogate these formally later.
> 
> Question: While it is pretty clear that enhanced behaviors are required 
> to benefit new applications that use email, and while some new APIs may 
> be incompatible with some existing APIs, might it be possible to design 
> the new API, and then build a compatibility layer that looks like the 
> old API on top?  Such that there would be policies for the new APIs that 
> would work like the old APIs to ease the implementation of such a 

Yes, this is what was behind my comment that I had further ideas
about backward compatibility.  One way is what Barry and I already
discussed:  a wrapper to put around an email6 object that would support
the email5 API.  Another approach is to have the email6 message itself
support the legacy API.  I haven't looked at every method, but most
of them would be supportable.  The tricky bit is headers:  an email6
Message will return Header objects, whereas an email5 application will
generally expect to get strings.  (It shouldn't!  But many will.  Even the
email package itself expects to get strings when it accesses headers.)
My wild thought at this point is:  what if Header subclassed string?
With the exception of a few structured headers such as address headers,
this might actually work pretty well.  But experimentation with some
at least semi-real-world examples would be needed to prove out the
concept.

> layer?  I'm not sure I fully understand the use of _factory or factory 
> parameters, but for APIs that have _factory and grow a factory, could 
> not the presence of which parameter imply any variant functionality?

I'm not sure what you are asking here.  In what I outlined for the parser
API, you'd get an email5-API object if you used _factory or nothing,
and and email6 API object if you used factory, so yes, in that sense
the parameter determines the API.  But what about a library that is
accepting a Message object?  It needs a way to detect whether or not
it has been passed an email5 API message, or an email6 one.

> (OK, this question comes after not looking at the email API during all 
> the GSOC and your implementation efforts since the last big round of 
> discussion, but your proposals here seem to sound like it would be more 
> possible with your current thinking that with your previous thinking.)

Well, in my previous thinking I was intending on doing much the same thing
as far as backward compatibility went (having a policy that provided an
email5 compatible object), I just hadn't talked about it much :)  The
biggest difference now is that email5 will be the default, at least in
the Python3.3 release.

> Consider me an interested observer; I'll enjoy reading, thinking, and 
> commenting about these ideas too, but sadly am unlikely to implement an 
> email client this year :(  But I have aspirations to do so, because none 
> of the existing email clients exactly suit my preferences... (everyone 
> should write an editor and an email client, no?  I've done the former 
> several times... what I want, though, is emacs-python, instead of 
> emacs-lisp).

Thanks for your attention and comments.  I haven't implemented an editor
yet (VIM + Python has been good enough so far), but I have implemented
parts of an email client, and intend to finish that project as part of
working on email6, as an API test bed.

--David

From rdmurray at bitdance.com  Wed Mar  2 01:52:51 2011
From: rdmurray at bitdance.com (R. David Murray)
Date: Tue, 01 Mar 2011 19:52:51 -0500
Subject: [Email-SIG] API thoughts
In-Reply-To: <20110301225910.72D79249A6C@kimball.webabinitio.net>
References: <20110301204058.54C96249A9D@kimball.webabinitio.net>
	<4D6D6C1A.2070200@g.nevcal.com>
	<20110301225910.72D79249A6C@kimball.webabinitio.net>
Message-ID: <20110302005251.85259249C7B@kimball.webabinitio.net>

On Tue, 01 Mar 2011 17:59:10 -0500, "R. David Murray" <rdmurray at bitdance.com> wrote:
> On Tue, 01 Mar 2011 13:58:50 -0800, Glenn Linderman <v+python at g.nevcal.com> wrote:
> > To support reading byte-stream HTTP headers, therefore, it is critical 
> > that the email API accept an encoding from the application which "knows" 
> > the encoding; presently cgi.py has to pre-decode incoming headers 
> > because email does not have such a parameter.  On the other hand, maybe 
> > cgi.py shouldn't use email header parsing at all... since browsers don't 
> > use RFC 2047 encoding in practice, the parsing of headers without such 
> > is straightforward.
> 
> I think it could make sense for the default input character set to be
> a policy parameter for the parser.  Maybe not in the first version,
> though :)

Just to clarify:  in the first version I check in.  I'd expect to decide
about that part of the API not too far in to the development process,
and certainly well before 3.3.

--David

From rdmurray at bitdance.com  Wed Mar  2 02:45:46 2011
From: rdmurray at bitdance.com (R. David Murray)
Date: Tue, 01 Mar 2011 20:45:46 -0500
Subject: [Email-SIG] API thoughts
In-Reply-To: <4D6D959E.3000800@g.nevcal.com>
References: <20110301204058.54C96249A9D@kimball.webabinitio.net>
	<4D6D6C1A.2070200@g.nevcal.com>
	<20110301225910.72D79249A6C@kimball.webabinitio.net>
	<4D6D959E.3000800@g.nevcal.com>
Message-ID: <20110302014546.310242497E1@kimball.webabinitio.net>

On Tue, 01 Mar 2011 16:55:58 -0800, Glenn Linderman <v+python at g.nevcal.com> wrote:
> On 3/1/2011 2:59 PM, R. David Murray wrote:
> > On Tue, 01 Mar 2011 13:58:50 -0800, Glenn Linderman<v+python at g.nevcal.com>  wrote:
> Another reason is if the existing code handles many cases that are not 
> needed, and cannot be optimized for the case that is needed.  A "fast 
> path" reimplementation can eliminate the cases that are not needed, and 
> speed the result.  That, of course, depends on the internals of the 
> parsing of headers in the email package, and how much overhead RFC 2047 
> adds to that, which I haven't investigated and don't know.  Happily, 
> when uploading big files, headers are a  tiny fraction of time spent.  
> Sadly, when using large fill-in-the-blanks forms, header parsing can be 
> a significant fraction of the time spent.

I think the overhead if there are no encoded words in the header should
be minimal (probably a re scan, but possibly not even that, we'll see).
This could also be controlled by the policy (ie: the HTTP policy could
cause the header parser to skip the check-for-rfc2047-encoded-words
step).

> Presently, the cgi.py stream API only provides a open-file-like handle 
> to the data... so it can be read, written, and sought, but not assigned 
> to a specific filesystem, renamed, or moved using os facilities.  So a 
> broader API seems to be necessary for cgi.py; if that were available in 
> email, that would be helpful for cgi.py.

Yeah, additions to the cgi API are probably required to support this
properly.

> Hmm.  And while it might be more complex to handle structured headers, 
> in fact they come in a character sequences, so a mapping to string is 
> not impossible.  The real issue is if those headers had another API in 
> email5 (I could look that up, I guess), but perhaps that API could also 
> be supported along with a subclass of string.

They don't.  The issue is that what we would like is for the email6 API
for the address header to be that it looks like a list of Address objects.
So msg['To'][0] would yield an address object.  But if we also want the
header to look like a string, that won't work, because as a string that
should yield the first character of the body of the header.

Now, a sensible application would process the list of addresses in a To
header by passing it to util.getaddresses, but you can bet that there
are applications that don't do that.

A compromise would be to have an 'addresses' method that returned the
list of addresses.  Perhaps this would even be sensible in the context of
email6 by itself:  it would mean that all headers had a uniform base API
(they act like strings) and all structured information is accessed via
special methods.

> OK, what I was asking boils down to if the Message object can support 
> both APIs, the application doesn't need to care.  New applications would 
> probably want to use the new APIs, of course.  But they could be passed 
> between old and new applications (or fragments thereof) if they support 
> both.  It certainly wouldn't hurt to introduce the concept of a version 
> for the object, although in itself, that would only be accessible via a 
> new API, so old applications wouldn't think to use it...

Yeah, that would be an ideal world.  Let's see how close we can get :)

--David

From rdmurray at bitdance.com  Wed Mar  2 17:23:27 2011
From: rdmurray at bitdance.com (R. David Murray)
Date: Wed, 02 Mar 2011 11:23:27 -0500
Subject: [Email-SIG] API thoughts
In-Reply-To: <4D6DAD3F.2090306@g.nevcal.com>
References: <20110301204058.54C96249A9D@kimball.webabinitio.net>
	<4D6D6C1A.2070200@g.nevcal.com>
	<20110301225910.72D79249A6C@kimball.webabinitio.net>
	<4D6D959E.3000800@g.nevcal.com>
	<20110302014546.310242497E1@kimball.webabinitio.net>
	<4D6DAD3F.2090306@g.nevcal.com>
Message-ID: <20110302162327.7714224153E@kimball.webabinitio.net>

On Tue, 01 Mar 2011 18:36:47 -0800, Glenn Linderman <v+python at g.nevcal.com> wrote:
> On 3/1/2011 5:45 PM, R. David Murray wrote:
> > On Tue, 01 Mar 2011 16:55:58 -0800, Glenn Linderman<v+python at g.nevcal.com>  wrote:
> >> On 3/1/2011 2:59 PM, R. David Murray wrote:
> >>> On Tue, 01 Mar 2011 13:58:50 -0800, Glenn Linderman<v+python at g.nevcal.com>   wrote:
> >> Hmm.  And while it might be more complex to handle structured headers,
> >> in fact they come in a character sequences, so a mapping to string is
> >> not impossible.  The real issue is if those headers had another API in
> >> email5 (I could look that up, I guess), but perhaps that API could also
> >> be supported along with a subclass of string.
> > They don't.  The issue is that what we would like is for the email6 API
> > for the address header to be that it looks like a list of Address objects.
> > So msg['To'][0] would yield an address object.  But if we also want the
> > header to look like a string, that won't work, because as a string that
> > should yield the first character of the body of the header.
> >
> > Now, a sensible application would process the list of addresses in a To
> > header by passing it to util.getaddresses, but you can bet that there
> > are applications that don't do that.
> >
> > A compromise would be to have an 'addresses' method that returned the
> > list of addresses.  Perhaps this would even be sensible in the context of
> > email6 by itself:  it would mean that all headers had a uniform base API
> > (they act like strings) and all structured information is accessed via
> > special methods.
> 
> While  msg['To']  producing a structured result might not be possible 
> when subclassing string, you mention one possible alternative, an 
> additional method... seems like you mean msg['To'].addresses()?  It 
> would also be possible to make  msg.p['To'] for parsed/structured 
> results.  I'm not sure which would be easier to implement, or more 
> flexible under the covers to do caching of parsed/structured results.  
> Of course there are several headers dealing with lists of addresses, as 
> you are well aware, so  msg.addresses() wouldn't work without some 
> specification of the header.

Yes, exactly msg['To'].addresses (might as well use a property).
I think I prefer this to a separate retrieval method, since not all
headers are structured headers, and it is not clear what the "parsed"
version of a non-structured header would be (a plain string?).

--David

From barry at python.org  Wed Mar  2 21:46:24 2011
From: barry at python.org (Barry Warsaw)
Date: Wed, 2 Mar 2011 15:46:24 -0500
Subject: [Email-SIG] API thoughts
In-Reply-To: <20110301204058.54C96249A9D@kimball.webabinitio.net>
References: <20110301204058.54C96249A9D@kimball.webabinitio.net>
Message-ID: <20110302154624.5dea1bd7@limelight.wooz.org>

On Mar 01, 2011, at 03:40 PM, R. David Murray wrote:

>So, and here is the point of this email, how does the policy framework
>integrate into this design?
[...]
>This list breaks down into items that affect the Parser, ones that affect
>the Generator, and ones that affect both the Parser and the Message.
>(Well, the "how much transformation" affects all three in the sense that
>the data has to be preserved by both the Parser and the Message in order
>for the Generator to be able to implement it, but I think we can take
>it as a given that we are going to preserve that data.)
>
>The pieces that are shared between the Parser and the Message are really
>about the Message:  how are the sub-objects represented?  How are the
>structured headers represented?  So we could consider that the Parser
>is a *consumer* of those pieces of policy, but that they are defined on
>the Message, not on the Parser.
>
>What this means is that the policy controlling each of the major
>components (parser, message, generator) are in principle independent.
[...]
>Re-thinking it now, though, I think there are actually two distinct
>components here: the I/O policy(s), and the Message construction policy.
[...]
>So, I think the "policy framework" is actually two things:  the
>header/mime-types registry, and the Parser/Generator policies.  Let's have
>'policy' refer to only the I/O policy, and call the other the email
>class registry.

+1

This makes a lot of sense, and I'm glad you've been thinking about this more
deeply than I have since we last bandied it about.  At the time, I thought a
single policy hierarchy would probably be fine, but you've laid out a good
argument for keeping them separate, and in fact not even calling the latter a
'policy'.  Here's another distinction:

Policy objects should be composable.  This would allow for a standard library
of policies that could be mixed and matched for specific applications, and
might even include some higher level policies like 'CGI' or 'NNTP'.  E.g. my
applications might combine a standard 'don't-check-rfc-2047' policy with a
'use-only-CRNL' and 'die-on-defect'.

I wonder too, how sophisticated policy objects really need to be.  Are they
just bags of attributes with some defaults, properties for access, maybe some
validation, and composability?

As for the registry, I don't think you need anything near that.  You just need
to say "when you see this mime-type, create an object using this callable".
Multiple registrations might be useful, but I don't think composability is.

>The real meat of email6, then, is the header/mime-types registry, and
>the changes in the API of the resulting Message objects.  The parser
>currently accepts a _factory argument that specifies the object to be used
>in creating the Message.   I propose that we deprecate this argument,
>but that any code using it gets the old behavior of the parser (using
>_factory to create the class for any new sub-objects).  Then we introduce
>a new argument, 'factory'.  This new argument would expect a callable
>that takes a mime-type as its argument, and returns an appropriate class.
>The parser would be re-written so that it could use this factory, and
>the backward compatibility case would be trivial to implement.

+1.  The underscore name in _factory is a historical wart that's not needed
any more.  I'm not even sure it makes much sense any more in Message
subclasses.  It *does* still make sense in e.g. add_header() where there's a
potential name collision between the arguments and the **params.  We should
evaluate these more carefully given today's API and clean this up if possible
(modulo all b/c considerations).

>In theory the classes returned by the registry/factory are arbitrary,
>but in practice we will need to define the minimal API that they
>should provide.  By specifying the API separately from the concrete
>implementation in email6, we will allow third parties to write classes
>that can play well with programs expecting to operate on email6 Messages.
>This will allow, for example, an MUA to provide custom classes to enhance
>presentation, while still allowing the message to be submitted to smtplib
>for transmission.

+1

>I guess I'm proposing, then, that there be an API version definition,
>with two values as of Python3.3: email5 API, and email6 API.  We'll
>figure out how we name and interrogate these formally later.
>
>The Header registry in this vision is accessed through the Message class.
>I have various thoughts about how this will work, but I'm going to leave
>those for later, since this email is already long enough.  I also have
>some additional thoughts about backward compatibility, but it is going
>to require some experimentation to see if they are realistic.

Cool.  Really great stuff David.

-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/email-sig/attachments/20110302/cdce70ca/attachment.pgp>

From barry at python.org  Wed Mar  2 21:52:52 2011
From: barry at python.org (Barry Warsaw)
Date: Wed, 2 Mar 2011 15:52:52 -0500
Subject: [Email-SIG] API thoughts
In-Reply-To: <4D6D6C1A.2070200@g.nevcal.com>
References: <20110301204058.54C96249A9D@kimball.webabinitio.net>
	<4D6D6C1A.2070200@g.nevcal.com>
Message-ID: <20110302155252.5b58619c@limelight.wooz.org>

On Mar 01, 2011, at 01:58 PM, Glenn Linderman wrote:

>(everyone should write an editor and an email client, no?

Is there really any difference?

http://www.catb.org/~esr/jargon/html/Z/Zawinskis-Law.html

That's also the proof that the email package is the most important one in
Python because it will eventually be used by every Python application ever
written.

-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/email-sig/attachments/20110302/22f9a5c3/attachment.pgp>

From sdaoden at googlemail.com  Wed Mar  2 11:19:25 2011
From: sdaoden at googlemail.com (Steffen Daode Nurpmeso)
Date: Wed, 2 Mar 2011 11:19:25 +0100
Subject: [Email-SIG] email6 and Python 3.3
In-Reply-To: <20110228213235.609A3239561@kimball.webabinitio.net>
References: <20110228201133.8EECE249BE5@kimball.webabinitio.net>
	<20110228154829.16c89a32@limelight.wooz.org>
	<20110228213235.609A3239561@kimball.webabinitio.net>
Message-ID: <20110302101925.GA64097@sherwood.local>

> On Mon, Feb 28, 2011 at 04:32:35PM -0500, R. David Murray wrote:
> Well, fortunately I've been enjoying it, and the increased recognition
> is certainly one of the rewards, so thank you.

> On Mon, 28 Feb 2011 15:48:29 -0500, Barry Warsaw <barry at python.org> wrote:
> Just wait 'til the hate mail starts.  Fortunately, most of that's got raw
> 8-bit in the headers, so you're in luck. :)

Increasing recognition with a non hate mail!
Thank you for my making my thing possible - out of the box.

From sdaoden at googlemail.com  Wed Mar  2 21:40:39 2011
From: sdaoden at googlemail.com (Steffen Daode Nurpmeso)
Date: Wed, 2 Mar 2011 21:40:39 +0100
Subject: [Email-SIG] API thoughts
In-Reply-To: <20110301204058.54C96249A9D@kimball.webabinitio.net>
References: <20110301204058.54C96249A9D@kimball.webabinitio.net>
Message-ID: <20110302204039.GA43276@sherwood.local>

I've also read the updated EMAIL-SIG DesignThoughts.

But if "what goes in .defects[]" will be configurable i would hope 
for a generic is_malformed() and maybe is_processable() or the 
like, i.e. state versus (translatable?) user-info.
(The more i think about it the more i agree with David (i hope 
i don't lie about that) that it's a waste of time to try to 
convert malformed data to a compliant state, especially if the 
package is - by design - capable to spit out the data the very 
same way it came in.  Someone will take care - and throw it away.)

I also go for lazy parsing when designing an email package. 
(Pluggable) File-based backend. 

Besides that all of this, and including the things David explained 
in the issue tracker, sounds like smoked tofu to me. ;-)

Unfortunately my non-hate mail seems to have been mistreated as 
spam 8-}, therefore i wrote all of the above just to thank David 
once again for making the email and mailbox packages usable 
already in Python 3.2.  Thanks.

From barry at python.org  Wed Mar  2 22:12:06 2011
From: barry at python.org (Barry Warsaw)
Date: Wed, 2 Mar 2011 16:12:06 -0500
Subject: [Email-SIG] API thoughts
In-Reply-To: <20110302014546.310242497E1@kimball.webabinitio.net>
References: <20110301204058.54C96249A9D@kimball.webabinitio.net>
	<4D6D6C1A.2070200@g.nevcal.com>
	<20110301225910.72D79249A6C@kimball.webabinitio.net>
	<4D6D959E.3000800@g.nevcal.com>
	<20110302014546.310242497E1@kimball.webabinitio.net>
Message-ID: <20110302161206.3a61d67a@limelight.wooz.org>

On Mar 01, 2011, at 08:45 PM, R. David Murray wrote:

>They don't.  The issue is that what we would like is for the email6 API
>for the address header to be that it looks like a list of Address objects.
>So msg['To'][0] would yield an address object.  But if we also want the
>header to look like a string, that won't work, because as a string that
>should yield the first character of the body of the header.

Here's where things get really interesting because you won't actually know
what msg[header][0] could return for any arbitrary value of 'header'.

For structured headers like To, msg['To'] can return an ordered sequence of
address objects, but what about msg['Received'] or msg['X-Happy-Fun-Ball']?
The same will go for anything like .addresses.

I'm not sure what the implications of this for the API are, but it's important
to keep in mind (I know RDM knows this) that structured headers need extra
parsing and will have more sophisticated objects representing them.

-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/email-sig/attachments/20110302/bfd511a8/attachment.pgp>

From rdmurray at bitdance.com  Thu Mar  3 01:40:36 2011
From: rdmurray at bitdance.com (R. David Murray)
Date: Wed, 02 Mar 2011 19:40:36 -0500
Subject: [Email-SIG] bug report
In-Reply-To: <4CC9FB26.2020100@gmail.com>
References: <4CC9FB26.2020100@gmail.com>
Message-ID: <20110303004036.A713D239549@kimball.webabinitio.net>

On Fri, 29 Oct 2010 00:37:26 +0200, Tobias Koeck <tobias.koeck at gmail.com> wrote:
> 'ascii' codec can't encode character u'\xfc' in position 40: ordinal 
> not in range(128)
> Traceback (most recent call last):
>    File "/usr/lib/calibre/calibre/gui2/device.py", line 588, in 
> _send_mails
>      attachment_name = attachment_names[i])
>    File "/usr/lib/calibre/calibre/utils/smtp.py", line 179, in 
> compose_mail
>      attachment_name=attachment_name)
>    File "/usr/lib/calibre/calibre/utils/smtp.py", line 29, in create_mail
>      msg = MIMEText(text)
>    File "/usr/lib/python2.6/email/mime/text.py", line 30, in __init__
>      self.set_payload(_text, _charset)
>    File "/usr/lib/python2.6/email/message.py", line 224, in set_payload
>      self.set_charset(charset)
>    File "/usr/lib/python2.6/email/message.py", line 260, in set_charset
>      self._payload = self._payload.encode(charset.output_charset)
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in 
> position 40: ordinal not in range(128)

Please submit a bug report at bugs.python.org with additional
details if you can (ie: what was the input to MIMEText that
triggered this error, and what version of python are you
using?)

--David

From rdmurray at bitdance.com  Thu Mar  3 01:50:20 2011
From: rdmurray at bitdance.com (R. David Murray)
Date: Wed, 02 Mar 2011 19:50:20 -0500
Subject: [Email-SIG] API thoughts
In-Reply-To: <20110302204039.GA43276@sherwood.local>
References: <20110301204058.54C96249A9D@kimball.webabinitio.net>
	<20110302204039.GA43276@sherwood.local>
Message-ID: <20110303005020.6135B2001CD@kimball.webabinitio.net>

On Wed, 02 Mar 2011 21:40:39 +0100, Steffen Daode Nurpmeso <sdaoden at googlemail.com> wrote:
> But if "what goes in .defects[]" will be configurable i would hope 
> for a generic is_malformed() and maybe is_processable() or the 
> like, i.e. state versus (translatable?) user-info.

I'm not sure what you are asking for here.  I think "if msg.is_malformed()"
is spelled "if msg.defects".  That is, if the defects list is non-empty,
the message is technically malformed.  Of course, that information by
itself isn't necessarily useful, which is why defects is a list
of defects.  "is_processable" lies in the eyes of the application.
What defects is it capable of dealing with?  The email package
can't know that.  So, again, that's why defects is a list.

Let me clarify what I mean by the policy controlling "what, exactly, is
a defect".  The idea here is that when parsing an email, each deviance
from the RFCs counts as a defect (the current email package, by the way,
only detects a small number of such defects!).  But when parsing, say,
an http stream, non-ascii characters in headers are perfectly legal.
So it seems to make sense that the HTTP policy would change what counts
as a defect during the operation of the parser.

> (The more i think about it the more i agree with David (i hope 
> i don't lie about that) that it's a waste of time to try to 
> convert malformed data to a compliant state, especially if the 
> package is - by design - capable to spit out the data the very 
> same way it came in.  Someone will take care - and throw it away.)

Well, I think we may provide some tools to do such "fixups" when it is
possible and the application wants it.  But they should be app-requested
transformations, not automatic ones.

> Unfortunately my non-hate mail seems to have been mistreated as 
> spam 8-}, therefore i wrote all of the above just to thank David 
> once again for making the email and mailbox packages usable 
> already in Python 3.2.  Thanks.

You are welcome :)

--David

From rdmurray at bitdance.com  Thu Mar  3 02:23:41 2011
From: rdmurray at bitdance.com (R. David Murray)
Date: Wed, 02 Mar 2011 20:23:41 -0500
Subject: [Email-SIG] API thoughts
In-Reply-To: <20110302154624.5dea1bd7@limelight.wooz.org>
References: <20110301204058.54C96249A9D@kimball.webabinitio.net>
	<20110302154624.5dea1bd7@limelight.wooz.org>
Message-ID: <20110303012341.DF05922B74A@kimball.webabinitio.net>

On Wed, 02 Mar 2011 15:46:24 -0500, Barry Warsaw <barry at python.org> wrote:
> On Mar 01, 2011, at 03:40 PM, R. David Murray wrote:
> >So, I think the "policy framework" is actually two things:  the
> >header/mime-types registry, and the Parser/Generator policies.  Let's have
> >'policy' refer to only the I/O policy, and call the other the email
> >class registry.
> 
> +1
> 
> This makes a lot of sense, and I'm glad you've been thinking about this more
> deeply than I have since we last bandied it about.  At the time, I thought a
> single policy hierarchy would probably be fine, but you've laid out a good
> argument for keeping them separate, and in fact not even calling the latter
> a 'policy'.  Here's another distinction:
> 
> Policy objects should be composable.  This would allow for a standard library
> of policies that could be mixed and matched for specific applications, and
> might even include some higher level policies like 'CGI' or 'NNTP'.  E.g. my
> applications might combine a standard 'don't-check-rfc-2047' policy with a
> 'use-only-CRNL' and 'die-on-defect'.

Yes, my current implementation of policy objects allows you to say
things like:

    policy = HTTP + Strict

where HTTP is the obvious and 'Strict' is a policy that sets the "raise
on defect" flag.

> I wonder too, how sophisticated policy objects really need to be.  Are they
> just bags of attributes with some defaults, properties for access, maybe some
> validation, and composability?

Pretty much.  I think they will also contain some callable methods,
to provide hooks where a policy subclass can implement a custom policy.
My current implementation has such a hook for registering defects, which
would allow a custom policy to, for example, log the defects in addition
to or instead of putting them into the defects list.

> As for the registry, I don't think you need anything near that.  You just need
> to say "when you see this mime-type, create an object using this callable".
> Multiple registrations might be useful, but I don't think composability is.

Well, I'm thinking that a minimal sort of composability *is* useful.
One of the annoying things about class hierarchies is that if you want to
add a feature to the base class, you have to make new subclasses for *all*
of the classes in the hierarchy (unless you monkey patch).  What I was
thinking of was to have the registry have a 'base class' slot that got
used as the base class for all the mime-type classes, composed on the fly
at instantiation time (and similarly for the headers).  That way if you
wanted to add features to all the classes in the hierarchy, you could
register your custom 'base class' and not need to touch anything else.
But since the API for the registry is now a callable, and especially if
we specify it as returning callables, then doing such composition could
be left to the application (perhaps with a recipe in the docs).

Composing registries can thus also be left to the application.  email6
itself should have only one, I think, or if there are two the other will
be the email5 back-compat registry and there'd be no reason to compose
with it.

I'm not sure what we you mean by multiple registrations.  Can you give
an example?

> >The real meat of email6, then, is the header/mime-types registry, and
> >the changes in the API of the resulting Message objects.  The parser
> >currently accepts a _factory argument that specifies the object to be used
> >in creating the Message.   I propose that we deprecate this argument,
> >but that any code using it gets the old behavior of the parser (using
> >_factory to create the class for any new sub-objects).  Then we introduce
> >a new argument, 'factory'.  This new argument would expect a callable
> >that takes a mime-type as its argument, and returns an appropriate class.
> >The parser would be re-written so that it could use this factory, and
> >the backward compatibility case would be trivial to implement.
> 
> +1.  The underscore name in _factory is a historical wart that's not needed
> any more.  I'm not even sure it makes much sense any more in Message
> subclasses.  It *does* still make sense in e.g. add_header() where there's a
> potential name collision between the arguments and the **params.  We should
> evaluate these more carefully given today's API and clean this up if possible
> (modulo all b/c considerations).

Ah, so *that's* what those underscores are for.  I always wondered.
Yeah, I think we can do a lot of cleanup here.

> Cool.  Really great stuff David.

Thanks.

--David

From rdmurray at bitdance.com  Thu Mar  3 02:41:12 2011
From: rdmurray at bitdance.com (R. David Murray)
Date: Wed, 02 Mar 2011 20:41:12 -0500
Subject: [Email-SIG] email6 funding
Message-ID: <20110303014112.A4F3E21AE4B@kimball.webabinitio.net>

So, now that I've cleared my reply backlog, on to the exciting news.

Some of you may have seen Jesse Noller's retweet of the tweet from
Paul Leroux of QNX.  This is big news for me (and for the email-sig :):
QNX wants to fund me to do the email6 development.

We are still working out the details, but I think you can expect to
see email6 development go into overdrive in the near future.  Like,
right after PyCon.  We're preparing things at my consulting firm to
allow me to spend a significant amount of my time working on email6.

I am *seriously* excited by this, and very grateful to QNX.

Anyone interested in an email6 BOF at PyCon or a brainstorming
session during the Sprints afterward, please let me know.

--David

From janssen at parc.com  Thu Mar  3 02:57:00 2011
From: janssen at parc.com (Bill Janssen)
Date: Wed, 2 Mar 2011 17:57:00 PST
Subject: [Email-SIG] email6 funding
In-Reply-To: <20110303014112.A4F3E21AE4B@kimball.webabinitio.net>
References: <20110303014112.A4F3E21AE4B@kimball.webabinitio.net>
Message-ID: <12386.1299117420@parc.com>

And RIM just bought QNX, so I'd expect to see interest in Outlook
compatibility.

Interesting.

Bill

From sdaoden at googlemail.com  Thu Mar  3 16:28:32 2011
From: sdaoden at googlemail.com (Steffen Daode Nurpmeso)
Date: Thu, 3 Mar 2011 16:28:32 +0100
Subject: [Email-SIG] API thoughts
In-Reply-To: <20110303005020.6135B2001CD@kimball.webabinitio.net>
References: <20110301204058.54C96249A9D@kimball.webabinitio.net>
	<20110302204039.GA43276@sherwood.local>
	<20110303005020.6135B2001CD@kimball.webabinitio.net>
Message-ID: <20110303152832.GA17870@sherwood.local>

On Wed, Mar 02, 2011 at 07:50:20PM -0500, R. David Murray wrote:
> That is, if the defects list is non-empty,
> the message is technically malformed.  Of course, that information by
> itself isn't necessarily useful, which is why defects is a list
> of defects.
> "is_processable" lies in the eyes of the application.
> What defects is it capable of dealing with?  The email package
> can't know that.  So, again, that's why defects is a list.
> 
> Let me clarify what I mean by the policy controlling "what, exactly, is
> a defect".  The idea here is that when parsing an email, each deviance
> from the RFCs counts as a defect (the current email package, by the way,
> only detects a small number of such defects!).  But when parsing, say,
> an http stream, non-ascii characters in headers are perfectly legal.
> So it seems to make sense that the HTTP policy would change what counts
> as a defect during the operation of the parser.

So i would hope for '.all_defects[]' and (policy-adjusted) 
'.defects[]'.  I would hope for 
'.had_header_defects(policy_only=True)', 
'.had_payload_defects(policy_only=True)'.

Doing so would fill the huge hole in between 'not len(defects)' 
and the detailed inspection of a defects list which consists of 
a highly differentiated tree of classes.

The parser has to parse- and does encounter all of these anyway, 
and an application cannot re-collect this (dropped) information 
except with expensive effort, i.e. at least choosing a different, 
stricter policy followed by another parse of the bogus mail.

In the end it is my believe that a framework should bring light 
onto all aspects of a thing, such that no other framework is ever 
needed, but especially not on a lower level (except the framework 
is so designed that it allows replacement of its own low-level 
interface, say). 
And i don't think there can be a higher level interface than 
message_from_(bytes|string)().

From rdmurray at bitdance.com  Thu Mar  3 17:13:41 2011
From: rdmurray at bitdance.com (R. David Murray)
Date: Thu, 03 Mar 2011 11:13:41 -0500
Subject: [Email-SIG] API thoughts
In-Reply-To: <20110303152832.GA17870@sherwood.local>
References: <20110301204058.54C96249A9D@kimball.webabinitio.net>
	<20110302204039.GA43276@sherwood.local>
	<20110303005020.6135B2001CD@kimball.webabinitio.net>
	<20110303152832.GA17870@sherwood.local>
Message-ID: <20110303161341.F297D249F78@kimball.webabinitio.net>

On Thu, 03 Mar 2011 16:28:32 +0100, Steffen Daode Nurpmeso <sdaoden at googlemail.com> wrote:
> On Wed, Mar 02, 2011 at 07:50:20PM -0500, R. David Murray wrote:
> > That is, if the defects list is non-empty,
> > the message is technically malformed.  Of course, that information by
> > itself isn't necessarily useful, which is why defects is a list
> > of defects.
> > "is_processable" lies in the eyes of the application.
> > What defects is it capable of dealing with?  The email package
> > can't know that.  So, again, that's why defects is a list.
> > 
> > Let me clarify what I mean by the policy controlling "what, exactly, is
> > a defect".  The idea here is that when parsing an email, each deviance
> > from the RFCs counts as a defect (the current email package, by the way,
> > only detects a small number of such defects!).  But when parsing, say,
> > an http stream, non-ascii characters in headers are perfectly legal.
> > So it seems to make sense that the HTTP policy would change what counts
> > as a defect during the operation of the parser.
> 
> So i would hope for '.all_defects[]' and (policy-adjusted) 
> '.defects[]'.  I would hope for 
> '.had_header_defects(policy_only=True)', 
> '.had_payload_defects(policy_only=True)'.

Well, what is a defect for an HTTP parse is not the same as what is
a defect for an email parse, so I don't know what "all defects" would
consist of.  The recovery decisions the parser makes can also be affected
by the policy, so there can't, as far as I can see, be a single list of
"all defects" that applies to all parses.

Currently the email package does not report header defects.  When it does,
my plan is that each Header will have its own defect list, and likewise
each message body (using a recursive definition).  How the defects list
on the Message object interacts with this is an interesting API question
worthy of discussion.  Perhaps we do, after all, have some sort of
"has_defects" method that queries the constituent parts, and perhaps a
function that returns a list of parts with defects, possibly divided
between headers and body as you suggest.

> Doing so would fill the huge hole in between 'not len(defects)' 
> and the detailed inspection of a defects list which consists of 
> a highly differentiated tree of classes.

Yeah, the number of different defect classes involved in this scheme
worries me a little bit.

> The parser has to parse- and does encounter all of these anyway, 
> and an application cannot re-collect this (dropped) information 
> except with expensive effort, i.e. at least choosing a different, 
> stricter policy followed by another parse of the bogus mail.

Why recollect?  The list is there (and, as I indicated above, will be
associated with the part that contains the error).  The list of defects
will be *all* the defects detected by that policy: all RFC deviance
(well, perhaps not quite all...see below).  Defects don't normally raise
errors, so there's no reason not lot look for all of the relevant ones
(and indeed, we are probably only detecting the ones that actually affect
the parsing).

That is, if you parse an HTTP stream, encountering a non-ASCII character
is *not* a defect.  It doesn't make any sense to me to report an
"if this were an email this would be a defect" defect.  And if the
header for some strange reason included an RFC2047 encoded word that
was invalidly formed...well, in an HTTP parse that would *technically*
violate the RFC, but in practice it really means that the data should
just be passed through as is.  That is, it's not a defect, and we
would be be wasting time even *looking* for RFC2047 encoded words.
(Unless someone finds a browser or server that generates them!)

In other words, in the base package I don't think there are "strict"
and "less strict" parsing policies; rather there are *different* parsing
policies depending on the context.  As far as I can see, it makes no sense
to parse an HTTP stream, and the reparse it as if it were an email stream.
Now, it might be useful to design a "very_strict" policy that did extra
work looking for RFC defects that a normal parse wouldn't detect (I can't
think of any off the top of my head, but the email RFCs are so complex
that I'm sure there are some), but in that case if you parsed it with
the less-strict (normal) policy those defects would *not* be noticed
by the parser.  In any case, I think such a validating parser/policy is
out of scope for the current package.

--David

From barry at python.org  Fri Mar  4 03:52:31 2011
From: barry at python.org (Barry Warsaw)
Date: Thu, 3 Mar 2011 21:52:31 -0500
Subject: [Email-SIG] email6 funding
In-Reply-To: <20110303014112.A4F3E21AE4B@kimball.webabinitio.net>
References: <20110303014112.A4F3E21AE4B@kimball.webabinitio.net>
Message-ID: <20110303215231.41b55ddd@neurotica.wooz.org>

On Mar 02, 2011, at 08:41 PM, R. David Murray wrote:

>So, now that I've cleared my reply backlog, on to the exciting news.
>
>Some of you may have seen Jesse Noller's retweet of the tweet from
>Paul Leroux of QNX.  This is big news for me (and for the email-sig :):
>QNX wants to fund me to do the email6 development.
>
>We are still working out the details, but I think you can expect to
>see email6 development go into overdrive in the near future.  Like,
>right after PyCon.  We're preparing things at my consulting firm to
>allow me to spend a significant amount of my time working on email6.
>
>I am *seriously* excited by this, and very grateful to QNX.

What can I say other than: AWESOME!  Thanks QNX!

>Anyone interested in an email6 BOF at PyCon or a brainstorming
>session during the Sprints afterward, please let me know.

o/

I probably won't have time to sprint on email this year, but I would love to
have a BOF.

-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/email-sig/attachments/20110303/23ccafdb/attachment.pgp>

From barry at python.org  Fri Mar  4 03:55:59 2011
From: barry at python.org (Barry Warsaw)
Date: Thu, 3 Mar 2011 21:55:59 -0500
Subject: [Email-SIG] API thoughts
In-Reply-To: <20110303012341.DF05922B74A@kimball.webabinitio.net>
References: <20110301204058.54C96249A9D@kimball.webabinitio.net>
	<20110302154624.5dea1bd7@limelight.wooz.org>
	<20110303012341.DF05922B74A@kimball.webabinitio.net>
Message-ID: <20110303215559.572fcede@neurotica.wooz.org>

On Mar 02, 2011, at 08:23 PM, R. David Murray wrote:

>Pretty much.  I think they will also contain some callable methods,
>to provide hooks where a policy subclass can implement a custom policy.
>My current implementation has such a hook for registering defects, which
>would allow a custom policy to, for example, log the defects in addition
>to or instead of putting them into the defects list.

Makes sense.

>I'm not sure what we you mean by multiple registrations.  Can you give
>an example?

I really meant multiple registries, mostly thinking about how to avoid some
global state.  But Python already has some global registries, so maybe that's
not too bad in this case.

-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/email-sig/attachments/20110303/486d34a2/attachment.pgp>

From rdmurray at bitdance.com  Fri Mar  4 14:33:04 2011
From: rdmurray at bitdance.com (R. David Murray)
Date: Fri, 04 Mar 2011 08:33:04 -0500
Subject: [Email-SIG] API thoughts
In-Reply-To: <20110303215559.572fcede@neurotica.wooz.org>
References: <20110301204058.54C96249A9D@kimball.webabinitio.net>
	<20110302154624.5dea1bd7@limelight.wooz.org>
	<20110303012341.DF05922B74A@kimball.webabinitio.net>
	<20110303215559.572fcede@neurotica.wooz.org>
Message-ID: <20110304133304.861442499B0@kimball.webabinitio.net>

On Thu, 03 Mar 2011 21:55:59 -0500, Barry Warsaw <barry at python.org> wrote:
> On Mar 02, 2011, at 08:23 PM, R. David Murray wrote:
> >I'm not sure what we you mean by multiple registrations.  Can you give
> >an example?
> 
> I really meant multiple registries, mostly thinking about how to avoid some
> global state.  But Python already has some global registries, so maybe that's
> not too bad in this case.

Ah, yes.  Well, so far my thought is that there is a global registry
for the email package itself, but since email package access to that
registry will be through the 'factory', there is nothing that says that
has to be the only registry used by an application.  The existence of
the email package global registry will allow the addition of classes
to the "default" registry by libraries (if we dare :) and applications,
while access through the factory means that an application is free
to manage a completely independent registry if it prefers.  Or perhaps
it is better to think about the default email package registry as
just that, the *default* registry, since I think it's only specialness
will be that it is the registry that is used by default.

But that's just my current thought, if anyone can think of a better
design I'm all ears.

I should note that one design concern I have in all this is that it so
far looks like importing email will, under this registry design, end up
importing pretty much *all* of the email classes (and there will be more
of them than in the current package).  I'm so far ignoring that issue,
treating it as a premature optimization, but if anyone has any clever
ideas or other thoughts, let me know.

--David

From paull at qnx.com  Fri Mar  4 16:01:56 2011
From: paull at qnx.com (Paul Leroux)
Date: Fri, 4 Mar 2011 10:01:56 -0500
Subject: [Email-SIG] email6 funding
In-Reply-To: <20110303215231.41b55ddd@neurotica.wooz.org>
References: <20110303014112.A4F3E21AE4B@kimball.webabinitio.net>
	<20110303215231.41b55ddd@neurotica.wooz.org>
Message-ID: <1CF662C832BF6F4AADE933869AB1B701696E24@neptune.ott.qnx.com>

Thanks Barry. QNX will have a booth at Pycon, and Andy will be there.
Feel free to drop by and say hello to him.

- Paul


-----Original Message-----
From: Barry Warsaw [mailto:barry at python.org] 
Sent: March 3, 2011 9:53 PM
To: R. David Murray
Cc: email-sig at python.org; Paul Leroux; Andy Gryc
Subject: Re: [Email-SIG] email6 funding

On Mar 02, 2011, at 08:41 PM, R. David Murray wrote:

>So, now that I've cleared my reply backlog, on to the exciting news.
>
>Some of you may have seen Jesse Noller's retweet of the tweet from
>Paul Leroux of QNX.  This is big news for me (and for the email-sig :):
>QNX wants to fund me to do the email6 development.
>
>We are still working out the details, but I think you can expect to
>see email6 development go into overdrive in the near future.  Like,
>right after PyCon.  We're preparing things at my consulting firm to
>allow me to spend a significant amount of my time working on email6.
>
>I am *seriously* excited by this, and very grateful to QNX.

What can I say other than: AWESOME!  Thanks QNX!

>Anyone interested in an email6 BOF at PyCon or a brainstorming
>session during the Sprints afterward, please let me know.

o/

I probably won't have time to sprint on email this year, but I would
love to
have a BOF.

-Barry

From barry at python.org  Fri Mar  4 16:16:00 2011
From: barry at python.org (Barry Warsaw)
Date: Fri, 4 Mar 2011 10:16:00 -0500
Subject: [Email-SIG] email6 funding
In-Reply-To: <1CF662C832BF6F4AADE933869AB1B701696E24@neptune.ott.qnx.com>
References: <20110303014112.A4F3E21AE4B@kimball.webabinitio.net>
	<20110303215231.41b55ddd@neurotica.wooz.org>
	<1CF662C832BF6F4AADE933869AB1B701696E24@neptune.ott.qnx.com>
Message-ID: <20110304101600.4407b12c@neurotica.wooz.org>

On Mar 04, 2011, at 10:01 AM, Paul Leroux wrote:

>Thanks Barry. QNX will have a booth at Pycon, and Andy will be there.
>Feel free to drop by and say hello to him.

I will!

Cheers,
-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/email-sig/attachments/20110304/16212626/attachment.pgp>

From barry at python.org  Fri Mar  4 17:02:28 2011
From: barry at python.org (Barry Warsaw)
Date: Fri, 4 Mar 2011 11:02:28 -0500
Subject: [Email-SIG] API thoughts
In-Reply-To: <20110304133304.861442499B0@kimball.webabinitio.net>
References: <20110301204058.54C96249A9D@kimball.webabinitio.net>
	<20110302154624.5dea1bd7@limelight.wooz.org>
	<20110303012341.DF05922B74A@kimball.webabinitio.net>
	<20110303215559.572fcede@neurotica.wooz.org>
	<20110304133304.861442499B0@kimball.webabinitio.net>
Message-ID: <20110304110228.206f870f@neurotica.wooz.org>

On Mar 04, 2011, at 08:33 AM, R. David Murray wrote:

>Ah, yes.  Well, so far my thought is that there is a global registry
>for the email package itself, but since email package access to that
>registry will be through the 'factory', there is nothing that says that
>has to be the only registry used by an application.  The existence of
>the email package global registry will allow the addition of classes
>to the "default" registry by libraries (if we dare :) and applications,
>while access through the factory means that an application is free
>to manage a completely independent registry if it prefers.  Or perhaps
>it is better to think about the default email package registry as
>just that, the *default* registry, since I think it's only specialness
>will be that it is the registry that is used by default.

I think that's a great place to start.

>But that's just my current thought, if anyone can think of a better
>design I'm all ears.
>
>I should note that one design concern I have in all this is that it so
>far looks like importing email will, under this registry design, end up
>importing pretty much *all* of the email classes (and there will be more
>of them than in the current package).  I'm so far ignoring that issue,
>treating it as a premature optimization, but if anyone has any clever
>ideas or other thoughts, let me know.

Yeah, that's a problem.  Maybe we (the Python community) should invest in good
lazy importing support for Python 3.3?  I know that this has been reinvented
several times already.

-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/email-sig/attachments/20110304/efa623de/attachment.pgp>

From sdaoden at googlemail.com  Mon Mar  7 21:06:08 2011
From: sdaoden at googlemail.com (Steffen Daode Nurpmeso)
Date: Mon, 7 Mar 2011 21:06:08 +0100
Subject: [Email-SIG] API thoughts
Message-ID: <20110307200608.GA31032@sherwood.local>

I was never involved in discussions, so that the topics i address 
may have been defined for EMAIL 6 already etc., 
but because i've not found anything in the archives of the list 
back in 2010 i add yet another feature request which really 
worries me.

I find the interface a bit inconsistent in respect to 
replace_header() (replaces the first header found), __delitem__() 
(drops them all), __setitem__() (appends) in any case. 
(I personally would through these __accessor__ things away, they 
taste a bit strange when used to access email payload.)

And i would provide a series of functions which can be used 
to get/set/modify header fields and bodies: 
i would check wether the argument is a list and if, it would mean 
"all bodies of a field".  This is of course very hard to implement 
if it's done gracefully, i.e. with modification-detection, 
order-preservation etc.

Another, easier to implement, idea would be (yet) an(other) 
iterator which supports in-place editing.  Perfect: it could yield 
a (to be invented) class which offers methods like .field(), 
.bodies() (all [bodies] - maybe even as sub-iterator), 
.remove_field() etc...
Doing it like this would offer the possibility to easily detect 
in-place editing of header bodies etc...

All of these are just suggestions and my very personal point of 
view, of course. 
But one thing is true, and that's that it is currently really hard 
to remove or replace just one body of a field, especially if there 
are multiple bodies for a field. 

-- Steffen Daode

From barry at python.org  Mon Mar  7 23:15:29 2011
From: barry at python.org (Barry Warsaw)
Date: Mon, 7 Mar 2011 17:15:29 -0500
Subject: [Email-SIG] API thoughts
In-Reply-To: <20110307200608.GA31032@sherwood.local>
References: <20110307200608.GA31032@sherwood.local>
Message-ID: <20110307171529.4dc9631c@neurotica.wooz.org>

On Mar 07, 2011, at 09:06 PM, Steffen Daode Nurpmeso wrote:

>I find the interface a bit inconsistent in respect to 
>replace_header() (replaces the first header found), __delitem__() 
>(drops them all), __setitem__() (appends) in any case. 
>(I personally would through these __accessor__ things away, they 
>taste a bit strange when used to access email payload.)

I personally like this part of the API, and I think it's held up well under
years of use.  In general you don't care about header order, so using various
combinations of del, .get_all(), and __setitem__ work fine.  The semantics of
message-as-dict API, header ordering, the various header methods, etc. was
thought out and discussed, and I don't have a problem with them.

>And i would provide a series of functions which can be used 
>to get/set/modify header fields and bodies: 
>i would check wether the argument is a list and if, it would mean 
>"all bodies of a field".  This is of course very hard to implement 
>if it's done gracefully, i.e. with modification-detection, 
>order-preservation etc.
>
>Another, easier to implement, idea would be (yet) an(other) 
>iterator which supports in-place editing.  Perfect: it could yield 
>a (to be invented) class which offers methods like .field(), 
>.bodies() (all [bodies] - maybe even as sub-iterator), 
>.remove_field() etc...
>Doing it like this would offer the possibility to easily detect 
>in-place editing of header bodies etc...
>
>All of these are just suggestions and my very personal point of 
>view, of course. 
>But one thing is true, and that's that it is currently really hard 
>to remove or replace just one body of a field, especially if there 
>are multiple bodies for a field. 

Well, replace one header retaining original order is a bit difficult, but I've
rarely had to do that.  Still, it would probably make sense to add such
functionality -- *if* it can be done without complicating the API or the
implementation.  I think it could too, by adding an index argument to
.replace_header(), and using .get_all() to get an ordered list of the headers
of interest.

Cheers,
-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/email-sig/attachments/20110307/61daf3e0/attachment.pgp>

From barry at python.org  Mon Mar  7 23:17:33 2011
From: barry at python.org (Barry Warsaw)
Date: Mon, 7 Mar 2011 17:17:33 -0500
Subject: [Email-SIG] unixfrom and __str__()
Message-ID: <20110307171733.79cc269f@neurotica.wooz.org>

One other thing I'm reminded of: we should definitely switch the parity of the
'unixfrom' value in __str__().  IOW, do *not* include the envelope header by
default in str(msg).

-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/email-sig/attachments/20110307/84d41d3d/attachment.pgp>

From sdaoden at googlemail.com  Tue Mar  8 15:32:51 2011
From: sdaoden at googlemail.com (Steffen Daode Nurpmeso)
Date: Tue, 8 Mar 2011 15:32:51 +0100
Subject: [Email-SIG] API thoughts
In-Reply-To: <20110307171529.4dc9631c@neurotica.wooz.org>
References: <20110307171529.4dc9631c@neurotica.wooz.org>
Message-ID: <20110308143251.GA61190@sherwood.local>

Barry Warsaw wrote:
> I personally like this part of the API, and I think it's held up well under
> years of use.

:-) 
msg[f] is indeed and really an elegant and understand-at-a-glance 
way to access headers. 
(Possible restriction: it would be graceful if it would return and 
take a list.)

> Well, replace one header retaining original order is a bit difficult, but I've
> rarely had to do that.
[...]
> I think it could too, by adding an index argument to .replace_header(),
> and using .get_all() to get an ordered list of the headers of interest.

... and give me a way to also delete just one body of a field and 
i'll be lucky. 
Maybe simply 'Message._headers = {normalized_field = [bodies]}'? 
But, why not .delete_all_of(0, 2, 5), realized by a walk in equal 
spirit to .get_all().

(My thought was that a new Proxy class can be added very easily, 
requiring only one new method in Message and 
without affecting the remaining interface, 
whatever status David's local EMAIL 6 branch is currently in and 
whatever approach he will have chosen in the end.

Anyway, and unless i missed something, this is the current way:

    def _bewitch_msg(self):
        """Handle Python 3.2.0/3.3a0 issue 11401 email/message.py error"""
        if sys.hexversion > 0x030300A1 or sys.hexversion > 0x030200F1:
            return

        for f in self._msg:
            had_repl = False
            new_ab = []
            ab = self._msg.get_all(f)
            for b in ab:
                if not len(b):
                    had_repl = True
                    b = ' '
                new_ab.append(b)
            if had_repl:
                del self._msg[f]
                for b in new_ab:
                    self._msg[f] = b

At best the very same could be achieved (faster and with smaller 
memory footprint):

        for p in self._msg.proxy_iter():
            for (idx, body) in p:
                if not len(body):
                    p[idx] = ' '
)

From barry at python.org  Tue Mar  8 18:10:51 2011
From: barry at python.org (Barry Warsaw)
Date: Tue, 8 Mar 2011 12:10:51 -0500
Subject: [Email-SIG] API thoughts
In-Reply-To: <20110308143251.GA61190@sherwood.local>
References: <20110307171529.4dc9631c@neurotica.wooz.org>
	<20110308143251.GA61190@sherwood.local>
Message-ID: <20110308121051.35b81289@neurotica.wooz.org>

On Mar 08, 2011, at 03:32 PM, Steffen Daode Nurpmeso wrote:

>msg[f] is indeed and really an elegant and understand-at-a-glance 
>way to access headers. 
>(Possible restriction: it would be graceful if it would return and 
>take a list.)

Actually, I disagree. :)  From experience, look at .get_payload().  It tries
to manage both scalar payloads and list payloads (for multiparts), and it
sucks.  In hindsight (and email6) I hope that .get_payload() will be split
into separate API methods, one for simple payloads like image or audio data,
and another for multipart access.

So for headers, I think setitem/getitem/delitem should be reserved for simple
manipulation with well defined semantics (as it currently is <wink>), and new
API methods should be added for full access to headers when multiple ones are
present.

>... and give me a way to also delete just one body of a field and 
>i'll be lucky. 

That's a good idea too.

>Maybe simply 'Message._headers = {normalized_field = [bodies]}'? 

I'm not sure what that means, but yeah, you definitely don't want to be
messing with that private attribute.

>But, why not .delete_all_of(0, 2, 5), realized by a walk in equal 
>spirit to .get_all().
>
>(My thought was that a new Proxy class can be added very easily, 
>requiring only one new method in Message and 
>without affecting the remaining interface, 
>whatever status David's local EMAIL 6 branch is currently in and 
>whatever approach he will have chosen in the end.

It's an interesting idea.  Why don't you flesh that out and propose something
concrete, with a working implementation if possible?

Anyway, rewriting headers is not that hard:

#! /usr/bin/env python3
from email import message_from_string as mfs

msg = mfs("""\
From: aperson at example.com
X-Header: aardvark
To: bperson at example.com
X-Header: beaver
Subject: foo
X-Header: cougar
X-Header: dingo

""")

def yummy_toppings():
    for topping in ('duck', 'cheese', 'black olive', 'anchovy'):
        yield topping
toppings = yummy_toppings()


new_headers = []

for header, value in msg.items():
    if header.lower() == 'x-header':
        new_headers.append(('X-Header', toppings.__next__()))
    else:
        new_headers.append((header, value))


for header in msg:
    del msg[header]

for header, value in new_headers:
    msg[header] = value

print(msg)

Cheers,
-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/email-sig/attachments/20110308/3ad05123/attachment.pgp>

From sdaoden at googlemail.com  Thu Mar 24 17:10:11 2011
From: sdaoden at googlemail.com (Steffen Daode Nurpmeso)
Date: Thu, 24 Mar 2011 17:10:11 +0100
Subject: [Email-SIG] I miss size() (and some latest frustration)
Message-ID: <20110324161010.GD69753@sherwood.local>

I'm stressing this list again, but i stumbled over a missing 
[message_]size(). 
http://wiki.python.org/moin/Email%20SIG/DesignThoughts makes it 
a prerequisite for the new EMail package that

    The API needs to at a minimum have hooks available for an 
    application to store data on disk rather than holding 
    everything in memory.

It would be great if the message (file) size would also be 
provided as a public method, so that code-flow decisions can be 
made dependend upon the plain size of a message. 
(The size is known without parsing for many real-life message 
objects anyway or can be detected *cheap*.  True, e.g., for 
all Message objects which are created by mailbox.py.)

It's also so unfortunate that 'headersonly' of Parser is in fact 
treated as "a backwards compatibility hack", effectively consuming 
the entire input nonetheless. 
And *DesignThoughts* treats lazy parsing/partial loading as an 
"interesting idea" only, though i can think about many cases where 
it is a good thing to parse a Message{Headers[/Part/Part/Part...]} 
sequentially.

E.g., why should a spam detector load an entire message if it only 
wants to check addresses against some white-/blacklists and simply 
throw away bad hits. 
Even more, why should a companies dispatcher read all the content 
if it's only about to rewrite addresses and dispatch the mail to 
some other internal server. 
(Of course - hey, it's you, you know *such* more about this stuff 
than i do.)

Waiting is an electric experience ...
Have fun.

--
Steffen Daode Nurpmeso <sdaoden at gmail.com>
:wq steffen


From barry at python.org  Thu Mar 24 22:41:49 2011
From: barry at python.org (Barry Warsaw)
Date: Thu, 24 Mar 2011 17:41:49 -0400
Subject: [Email-SIG] I miss size() (and some latest frustration)
In-Reply-To: <20110324161010.GD69753@sherwood.local>
References: <20110324161010.GD69753@sherwood.local>
Message-ID: <20110324174149.78391d3a@neurotica.wooz.org>

On Mar 24, 2011, at 05:10 PM, Steffen Daode Nurpmeso wrote:

>It would be great if the message (file) size would also be 
>provided as a public method, so that code-flow decisions can be 
>made dependend upon the plain size of a message. 
>(The size is known without parsing for many real-life message 
>objects anyway or can be detected *cheap*.  True, e.g., for 
>all Message objects which are created by mailbox.py.)

Certainly the normal FeedParser will see every byte of the message, even if it
does save parts of it on disk.  Mailman 3's LMTP server also sees every byte
and tucks the size away on an .original_size attribute of its Message
subclass.

But how would you handle it when you are creating the message yourself?  I
think there are too many places you'd have to hook to get an accurate reading,
or you'd have to essentially serialize it via a generator before you'd know,
so it's less than helpful.

It may indeed be possible to ask some external process what the size of the
message is, but it would likely be a hint you couldn't necessarily trust.
(I.e. the server might only have an approximate size.)

So, I'm not sure whether the email package can have a consistent notion of a
message's 'size'.  Perhaps though it ought to define an attribute for when the
message is created by a parser, but let it be writable so that e.g. your
application could get it from an IMAP server or whatever, and stick it in the
attribute.

>It's also so unfortunate that 'headersonly' of Parser is in fact treated as
>"a backwards compatibility hack", effectively consuming the entire input
>nonetheless.  And *DesignThoughts* treats lazy parsing/partial loading as an
>"interesting idea" only, though i can think about many cases where it is a
>good thing to parse a Message{Headers[/Part/Part/Part...]}  sequentially.
>
>E.g., why should a spam detector load an entire message if it only wants to
>check addresses against some white-/blacklists and simply throw away bad
>hits.  Even more, why should a companies dispatcher read all the content if
>it's only about to rewrite addresses and dispatch the mail to some other
>internal server.  (Of course - hey, it's you, you know *such* more about this
>stuff than i do.)

Do you have suggestions for how the email package can help with these use
cases?  Do you have specific API or implementation proposals?

Cheers,
-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/email-sig/attachments/20110324/a6cdb335/attachment.pgp>

From v+python at g.nevcal.com  Thu Mar 24 23:54:48 2011
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Thu, 24 Mar 2011 15:54:48 -0700
Subject: [Email-SIG] I miss size() (and some latest frustration)
In-Reply-To: <20110324174149.78391d3a@neurotica.wooz.org>
References: <20110324161010.GD69753@sherwood.local>
	<20110324174149.78391d3a@neurotica.wooz.org>
Message-ID: <4D8BCBB8.7050300@g.nevcal.com>

On 3/24/2011 2:41 PM, Barry Warsaw wrote:
> On Mar 24, 2011, at 05:10 PM, Steffen Daode Nurpmeso wrote:
>
>> It would be great if the message (file) size would also be
>> provided as a public method, so that code-flow decisions can be
>> made dependend upon the plain size of a message.
>> (The size is known without parsing for many real-life message
>> objects anyway or can be detected *cheap*.  True, e.g., for
>> all Message objects which are created by mailbox.py.)
> Certainly the normal FeedParser will see every byte of the message, even if it
> does save parts of it on disk.  Mailman 3's LMTP server also sees every byte
> and tucks the size away on an .original_size attribute of its Message
> subclass.
>
> But how would you handle it when you are creating the message yourself?  I
> think there are too many places you'd have to hook to get an accurate reading,
> or you'd have to essentially serialize it via a generator before you'd know,
> so it's less than helpful.
>
> It may indeed be possible to ask some external process what the size of the
> message is, but it would likely be a hint you couldn't necessarily trust.
> (I.e. the server might only have an approximate size.)
>
> So, I'm not sure whether the email package can have a consistent notion of a
> message's 'size'.  Perhaps though it ought to define an attribute for when the
> message is created by a parser, but let it be writable so that e.g. your
> application could get it from an IMAP server or whatever, and stick it in the
> attribute.

When created by a parser, it could have the notion of size-seen-so-far, 
or bytes-fed.  Once the whole message has been processed, the size of 
the message would be known, as well as of each piece.

Incomplete messages, such as those from IMAP servers for which only 
partial requests have been made for pieces, could only get the concept 
of "total size" from the server, if it provides it.  Since POP servers 
do, I think IMAP would also, but I'm not an IMAP expert.

>> It's also so unfortunate that 'headersonly' of Parser is in fact treated as
>> "a backwards compatibility hack", effectively consuming the entire input
>> nonetheless.  And *DesignThoughts* treats lazy parsing/partial loading as an
>> "interesting idea" only, though i can think about many cases where it is a
>> good thing to parse a Message{Headers[/Part/Part/Part...]}  sequentially.
>>
>> E.g., why should a spam detector load an entire message if it only wants to
>> check addresses against some white-/blacklists and simply throw away bad
>> hits.  Even more, why should a companies dispatcher read all the content if
>> it's only about to rewrite addresses and dispatch the mail to some other
>> internal server.  (Of course - hey, it's you, you know *such* more about this
>> stuff than i do.)
> Do you have suggestions for how the email package can help with these use
> cases?  Do you have specific API or implementation proposals?

For message parsing, it seems like allowing registered callbacks for 
various pieces would be handy... "Call me when you parse this type of a 
header" (or body part, etc.).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/email-sig/attachments/20110324/6b84db9d/attachment.html>

From sdaoden at googlemail.com  Fri Mar 25 20:15:17 2011
From: sdaoden at googlemail.com (Steffen Daode Nurpmeso)
Date: Fri, 25 Mar 2011 20:15:17 +0100
Subject: [Email-SIG] I miss size() (and some latest frustration)
In-Reply-To: <20110324174149.78391d3a@neurotica.wooz.org>
References: <20110324174149.78391d3a@neurotica.wooz.org>
Message-ID: <20110325191517.GE86511@sherwood.local>

On Thu, Mar 24, 2011 at 05:41:49PM -0400, Barry Warsaw wrote:
> On Mar 24, 2011, at 05:10 PM, Steffen Daode Nurpmeso wrote:
> So, I'm not sure whether the email package can have a consistent notion of a
> message's 'size'.

> Do you have suggestions for how the email package can help with these use
> cases?  Do you have specific API or implementation proposals?

An incremental package must of course have a notion of a "current 
state of a message", so that all methods of an object must first 
check wether they're applicable - anyway!? 
Methods which can be used in multiple states need to document how 
they react in each of those anyway (if behaviour changes). 

So that there may be .current_parse_state() returning 
a to-be-defined enum. 
Or size() may return a tuple (Bool_is_final_size, current_size) 
(but that's really ugly). 

Beside size(), the most simple way would be to extend the 
FeedParser so that it could stop in a defined way at all 
boundaries of a message (i.e. Headers,Part,Part...). 
That would be a state(). 
It would need to be restartable, i.e., .close() may remain and 
return an entire message, but .last_part() or so/etc. must be 
added.  .feed() must return something useful, too.  E.g.:

    dataf = SOMERAWDATA.get_fileobject()
    while 1:
        l = dataf.readline()
        ..
        parser_state = fp.feed()
        if parser_state == fp.BOUNDARY_SEEN:
            ..
            break
        ..
    # This is a header object
    # (Or, simply: Message without payload)
    headerobject = fp.get_headers()
    rewrite_headers(headerobject)
    datachunk = prepare_as_sendfile_header_object(headerobject)
    call_sendfile_with_headers_and_unchanged_rest_of_dataf

Interestingly FeedParser has almost all capabilities which are 
required to do all that internally, but it does not offer it to 
the outside.  8-)

Anyway, EMail is capable of many things, but it does not expose 
them to the outside, so that one gets stuck soon if a special task 
is to be performed.  email.message_from_xy() is a fantastic 
abstraction of a complex set of RFC's and real-life potholes. 
On the other hand a programming package is not a shelter - you 
can mess up any package which goes beyond some message_from_xy(). 
So i really think that it is acceptable to offer an interface 
which gives you access to partially constructed objects as long as 
it is well-defined in some manner.

--
Steffen Daode Nurpmeso <sdaoden at gmail.com>
:wq steffen


From sdaoden at googlemail.com  Fri Mar 25 20:19:21 2011
From: sdaoden at googlemail.com (Steffen Daode Nurpmeso)
Date: Fri, 25 Mar 2011 20:19:21 +0100
Subject: [Email-SIG] I miss size() (and some latest frustration)
In-Reply-To: <4D8BCBB8.7050300@g.nevcal.com>
References: <4D8BCBB8.7050300@g.nevcal.com>
Message-ID: <20110325191921.GA29700@sherwood.local>

On Thu, Mar 24, 2011 at 03:54:48PM -0700, Glenn Linderman wrote:
> For message parsing, it seems like allowing registered callbacks 
> for various pieces would be handy... "Call me when you parse this 
> type of a header" (or body part, etc.).

A completely different idea, but i also like it. 
I remember that DOM did not even rock a bit unless SAX came up.

From barry at python.org  Fri Mar 25 21:10:03 2011
From: barry at python.org (Barry Warsaw)
Date: Fri, 25 Mar 2011 16:10:03 -0400
Subject: [Email-SIG] I miss size() (and some latest frustration)
In-Reply-To: <4D8BCBB8.7050300@g.nevcal.com>
References: <20110324161010.GD69753@sherwood.local>
	<20110324174149.78391d3a@neurotica.wooz.org>
	<4D8BCBB8.7050300@g.nevcal.com>
Message-ID: <20110325161003.496e418d@neurotica.wooz.org>

On Mar 24, 2011, at 03:54 PM, Glenn Linderman wrote:

>When created by a parser, it could have the notion of size-seen-so-far, or
>bytes-fed.  Once the whole message has been processed, the size of the
>message would be known, as well as of each piece.

It makes sense to record this in the Message objects, but I'd want to be very
careful about what that attribute is called.  Using just 'size' could be
misleading, either because parsing has not completed, or because they might
think that it's an exact count of the serialized size.  Something like
'parsed_byte_count' might be okay though.

>Incomplete messages, such as those from IMAP servers for which only partial
>requests have been made for pieces, could only get the concept of "total
>size" from the server, if it provides it.  Since POP servers do, I think IMAP
>would also, but I'm not an IMAP expert.

In a case like that, an attribute such as 'server_reported_size' or some such
would be okay.

>For message parsing, it seems like allowing registered callbacks for various
>pieces would be handy... "Call me when you parse this type of a header" (or
>body part, etc.).

I think David's design documents to allow for extensions and callbacks based
on the content-types of things seen.

Cheers,
-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/email-sig/attachments/20110325/cda86531/attachment.pgp>

From glenn at nevcal.com  Fri Mar 25 22:19:00 2011
From: glenn at nevcal.com (Glenn Linderman)
Date: Fri, 25 Mar 2011 14:19:00 -0700
Subject: [Email-SIG] I miss size() (and some latest frustration)
In-Reply-To: <20110325161003.496e418d@neurotica.wooz.org>
References: <20110324161010.GD69753@sherwood.local>	<20110324174149.78391d3a@neurotica.wooz.org>	<4D8BCBB8.7050300@g.nevcal.com>
	<20110325161003.496e418d@neurotica.wooz.org>
Message-ID: <4D8D06C4.9010008@nevcal.com>

On 3/25/2011 1:10 PM, Barry Warsaw wrote:
>> For message parsing, it seems like allowing registered callbacks for various
>> >pieces would be handy... "Call me when you parse this type of a header" (or
>> >body part, etc.).
> I think David's design documents to allow for extensions and callbacks based
> on the content-types of things seen.

I recall registration of handlers for various mime times.  I don't 
recall callbacks (registered handlers) being available for header 
parsing, but no time to find and reread at the moment.  Would be a good 
idea, though.  Also, callbacks should have the capability to stop the 
parse.  That technique could be used to implement "only parse headers" 
also, but it might be nicer to implement that as a flag when parsing starts.

Along this line, if parsing is stopped, it would be nice to be able to 
retrieve the unparsed data for alternate use (some is likely to have 
been already retrieved from whatever data stream, and passed as a 
"chunk" to the parser; an early-out would leave a "partial chunk" that 
hasn't been processed, but may want to be processed by some other 
entity, even if only for logging or error reporting.

-- 
Glenn
------------------------------------------------------------------------
Experience is that marvelous thing that enables you to recognize a
mistake when you make it again. -- Franklin Jones
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/email-sig/attachments/20110325/a353b92a/attachment.html>

From rdmurray at bitdance.com  Fri Mar 25 22:25:20 2011
From: rdmurray at bitdance.com (R. David Murray)
Date: Fri, 25 Mar 2011 17:25:20 -0400
Subject: [Email-SIG] I miss size() (and some latest frustration)
In-Reply-To: <20110325161003.496e418d@neurotica.wooz.org>
References: <20110324161010.GD69753@sherwood.local>
	<20110324174149.78391d3a@neurotica.wooz.org>
	<4D8BCBB8.7050300@g.nevcal.com>
	<20110325161003.496e418d@neurotica.wooz.org>
Message-ID: <20110325212521.3FBFB1454A2@kimball.webabinitio.net>

On Fri, 25 Mar 2011 16:10:03 -0400, Barry Warsaw <barry at python.org> wrote:
> >For message parsing, it seems like allowing registered callbacks for various
> >pieces would be handy... "Call me when you parse this type of a header" (or
> >body part, etc.).
> 
> I think David's design documents to allow for extensions and callbacks based
> on the content-types of things seen.

Effectively, yes.  The idea is that there is a factory that gets called
whenever a mime content type or a header is instantiated, so that
factory can do whatever magic it would like.  The standard factories will
have a lookup table for the factories for individual types, so you can
alternately use a copy of the standard factory with just the headers or
mime types you are interested in hooked.

We'll want to refine the design when I get near to actually implementing
it.

--
R. David Murray           http://www.bitdance.com

From sdaoden at googlemail.com  Sat Mar 26 16:56:53 2011
From: sdaoden at googlemail.com (Steffen Daode Nurpmeso)
Date: Sat, 26 Mar 2011 16:56:53 +0100
Subject: [Email-SIG] I miss size() (and some latest frustration)
In-Reply-To: <20110324174149.78391d3a@neurotica.wooz.org>
References: <20110324174149.78391d3a@neurotica.wooz.org>
Message-ID: <20110326155653.GA44697@sherwood.local>

    First of all i have to say that i am sooo prowd of myself
    that this mail manages to get addressed correctly right away!
    Wow!  (Or WAU! WAU! as those four-legged germans would say;)
    Thanks for your understanding.

On Thu, Mar 24, 2011 at 05:41:49PM -0400, Barry Warsaw wrote:
> Certainly the normal FeedParser will see every byte of the
> message, even if it does save parts of it on disk.  Mailman 3's
> LMTP server also sees every byte

I'm afraid of it, and i hate it from the bottom of my heart, but 
it is to be expected that EMail 6 will see times where mails 
actually contain entire 3-D Blockbusters as MIME attachments. 
And the truth will not be far from that.

Thus i personally would really vote for the possibility that 
parsing can be stopped at defined boundaries so that

    write(target_file, yet_parsed_object.data())
    while 1:
        x = source_file.read()
        target_file.write(x)

can be used directly (i.e. no swallowed boundary line).
Hooks are a fine thing but they are on the wrong side of the story 
for this kind of problem (unless you have full, i.e. linewise, 
control of the input side, too, and set one flag here and there.)

Have a nice weekend - it's cherry blossom, and it smells fantastic!

--
Steffen Daode Nurpmeso <sdaoden at gmail.com>
:wq steffen


From rdmurray at bitdance.com  Tue Mar 29 01:39:21 2011
From: rdmurray at bitdance.com (R. David Murray)
Date: Mon, 28 Mar 2011 19:39:21 -0400
Subject: [Email-SIG] Email6 repository, and policy framework first draft
Message-ID: <20110328233921.70B58D64A7@kimball.webabinitio.net>

I've set up the feature branch for email6:

    http://hg.python.org/features/email6

The branch inside the repo is email6.  I'll probably wind up having
subbranches unless my proposals get approved quickly :)

So far I've checked in the first draft of my proposal for the policy
framework.  I've blogged about this:

    http://www.bitdance.com/blog/2011/03/28_01_Policy_Framework_First_Draft/

Here's the text version of the blog post:


2011-03-28 Policy Framework First Draft
=======================================

Last week turned out to be mostly about tests and bugs.  As per my last
post, I moved the tests into a test package.  Then I went on to add a
bunch of `additional tests`_ developed by Michael Henry at the PyCon sprints.
More tests are always good before starting to modify code, right?

.. _additional tests: http://bugs.python.org/issue11589

Michael's tests had revealed a couple bugs, though, so I then went on to
apply the `fix`_ for those bugs, which included a `rewritten algorithm`_
for encoding strings as quoted printable.  I adapted the algorithm
proposed by Michael, then discovered a different and probably `better
algorithm`_ had already been proposed a while back and gotten lost in the
tracker.  That proposed patch was against the email package in Python2,
though, and the corresponding code in Python3 has a different interface,
so the patch wasn't easily adapted.  Since there are other changes
that need to be made to the quoted printable encoder, I have deferred
implementing the better algorithm until I get as far as touching that
code for the email6 work.

.. _fix: http://bugs.python.org/issue11590
.. _rewritten algorithm: http://bugs.python.org/issue11606
.. _better algorithm: http://bugs.python.org/issue5803

There was also a `bug`_ in the Email5 API that I wanted to fix before
starting to make API changes.  When you deal with "dirty" headers in
Email5.1, you may get back a ``Header`` object when querying a header.
Now, the normal way to deal with crazy headers in Email5 is to pass them
to ``decode_header`` to get the pairs of character sets and original bytes
from the wire out.  But ``decode_header`` wasn't accepting a ``Header``
object for ``decoding``.  My first approach was to try shifting back to
returning strings even when the header was "dirty", by wrapping them up
in encoded words with the ``unknown-8bit`` charset.  That more or less
worked, but doing it that way would mean making some other changes
to methods such as ``get_param`` to handle headers that had gotten
re-encoded into encoded words.  This was far from optimal.  The reporter
of the bug pointed out that I had carefully documented that ``Message``
would return a ``Header`` if the source header had unencoded non-ASCII
bytes in it, which made changing this behavior in a bug fix release
a non-starter.  So I gave in and just fixed ``decode_header`` to handle
``Header`` objects.  Since *all* headers in email6 will be a (new type of)
``Header`` object, programmers may as well get used to dealing with them.

.. _bug: http://bugs.python.org/issue11584

For email6 itself, there is now a `feature branch`_ where I will do
the patch development for email6 before applying the changes to the
main cpython repository.  The branch is named ``email6``, of course.
Anyone may browse or clone this repository to take a look at the current
state of development.

.. _feature branch: http://hg.python.org/features/email6

And that current state is that I have checked in the first draft of
the Policy framework.  This consists of a new module, `policy.py`_,
the associated documentation, `policy.rst`_, and a set of tests,
`test_policy.py`_

.. _policy.py: http://hg.python.org/features/email6/file/email6/Lib/email/policy.py
.. _policy.rst: http://hg.python.org/features/email6/file/email6/Doc/library/email.policy.rst
.. _test_policy.py: http://hg.python.org/features/email6/file/email6/Lib/test/test_email/test_policy.py

The basic idea is that a ``Policy`` object is an immutable container
for a bunch of attributes and callback hooks.  You can call a ``Policy``
object to get a new one with some of the defaults changed.  And you can
add them together, with the non-default settings from the right operand
overriding those from the left operand.

So far we have policies such as:

    * default
    * SMTP
    * HTML
    * Strict

*default* may get renamed *email6*. I'd prefer 'default', since that's
what I'd like it to be by the time we get to Python 3.4.  The actual
default policy when I start adding the parameter to other classes and
functions will be *email5*, though, so the name *default* for email6 is
probably not going to work.

The *SMTP* policy is just like default, but generates "wire format" line
separators (``\r\n``).  *HTML* is like *SMTP*, but does not wrap headers.
*Strict* sets a flag that will (once I implement it) cause the parser to
raise errors when it encounters defects instead of just keeping track
of them.  Using *Strict* is where you can see the utility of adding
policies together::

    >>> StrictSMTP = SMTP + Strict

You could use StrictSMTP to parse an incoming SMTP message where you
wanted your program to blow up if the message was invalid.  (When would
you ever want that?  I don't know, but someone probably will!).

So far I've only defined one hook, ``register_defect``.  You could
subclass ``Policy`` and define your own ``register_defect`` method that
would, say, log all defects to a log file, thus giving you some idea of
the quality of the email being processed by your program, even if you
did nothing else with the defect info.

Now we'll see what the Email SIG thinks of this implementation, and
meanwhile I'll be adding policy arguments to the parser and generator
classes.