From rdmurray at bitdance.com Tue Mar 1 21:40:57 2011 From: rdmurray at bitdance.com (R. David Murray) Date: Tue, 01 Mar 2011 15:40:57 -0500 Subject: [Email-SIG] API thoughts Message-ID: <20110301204058.54C96249A9D@kimball.webabinitio.net> This is a long email, for which my apologies. I hope you all will manage to find some time to read it and provide feedback, as it speaks to fundamental design issues. My subconscious seems to have been very busy last night, since in the shower this morning it presented me with a whole bunch of thoughts about the email API. This was triggered, I think, by Barry's question about __version__, my response that we might want an 'api version' declaration, and some comments made during the email 5.1 discussion by Steven D'Arapano (I think) about how Message is really the idealized representation of an email message. Let me start by saying that I think we can all agree that the fundamental design of the email package is excellent: we have a Parser which handles taking input from the outside world and turning it into a Message, and we have a Generator which handles taking a Message and turning it into something the outside world can handle. In the focus of the original development the "outside world" was, sensibly, RFC 822/2822 encoded byte streams. The idealized message consists of some meta information (addressee, recipient, date, etc, etc) and a body. The body, the content, can be arbitrarily complex. The purpose of the message is to convey some of that meta information and all of the arbitrarily complex body content from the sender to the recipient. Everything else is an implementation detail :) So, if we are writing a program and we want to compose such a message, it makes sense that we can build up this idealized message from its component pieces by attaching objects representing those pieces to the Message. At that stage we care nothing about how it needs to be transformed to get from point A to point B. If we want to look at a message, we again don't are about how it was transformed to get from point A to point B, we just want to be able to access the content in its original form. In today's "outside world" we have more to worry about than just RFC822/2822/5322. The "outside world" could be an http transmission medium. It could (if we re-design things right:) be a SIP session. It could be a disk-based data store, where an RFC822-like message format is being used to store data. I'm sure there are other contexts as well. So keeping the external representation concerns separate from the idealized message model makes sense. The email4/5 API doesn't do this as successfully as it could, especially in a Python3 context. The application program dealing with the idealized message doesn't really care what character set any given piece of a header is encoded in, it really just wants to deal with complete unicode strings. The application program also really doesn't care about the MIME type of a piece of content, it just wants to manage an object that has methods that allow it to manipulate that image, or that audio file, or what have you. Of course, it also needs to know what type of object it is handling in an incoming message, but the mime type is only one piece of the information that determines that (albeit usually the most important one). (Yes, some applications *do* care about internal details...but those are special cases and we can provide additional APIs that allow access at that level for those applications that need it, as we have discussed previously.) We propose to create a new API to make all of this easier for the application programmer. What doesn't change is the fundamental structure of the package: a message in some transmission format is fed to a Parser, which produces a Message object. A Message object can be fed to a Generator, which produces a transmission format object. Now, I lost sight of this a bit while I was working on the email6 header classes, as Barry at least will remember, but I do think it is important, and I want to keep it in the forefront of my mind as I work on adding the proposed policy framework. So, and here is the point of this email, how does the policy framework integrate into this design? I said that the policy pulls together the tunable bits of the email package's algorithms. What does this mean? What are the tunable bits? Here are some candidates: maximum header line length on serialization line ending character on serialization whether or not to raise an exception if a defect is encountered during parsing how much transformation of untouched original data is permissible when re-serializing a message can the serialized form contain any non-ASCII data? what classes to use to represent various MIME types. These are all decisions that can be made one way or another by an application program using the current package. Often, however, modifying the default is not easy or convenient. Note that the last one can only be decided by an application program when constructing a message, not when parsing one. Here are some other things that it might be useful to be able to control: what string to use as the continuation whitespace when needing to add some what classes to use to represent various structured headers what exactly counts as a defect should headers be RFC2047 encoded on serialization, or should another encoding be used?[*] [*] There are current real-world use cases for this: there are nntp servers that use utf-8 for headers, and the http protocol uses latin-1 (or sometimes, I think, utf-8) This list breaks down into items that affect the Parser, ones that affect the Generator, and ones that affect both the Parser and the Message. (Well, the "how much transformation" affects all three in the sense that the data has to be preserved by both the Parser and the Message in order for the Generator to be able to implement it, but I think we can take it as a given that we are going to preserve that data.) The pieces that are shared between the Parser and the Message are really about the Message: how are the sub-objects represented? How are the structured headers represented? So we could consider that the Parser is a *consumer* of those pieces of policy, but that they are defined on the Message, not on the Parser. What this means is that the policy controlling each of the major components (parser, message, generator) are in principle independent. The design of the policy framework envisions having, for example, an 'HTTP' policy that would, say, expect and generate latin-1 encoded headers, and generate headers without line breaks, using CRLF for the line termination. Initially I thought one would declare a policy and that the Message object would remember that policy, but that you could override it when, say, calling the generator. Re-thinking it now, though, I think there are actually two distinct components here: the I/O policy(s), and the Message construction policy. That is, the things that the HTTP policy cares about are all Parser or Generator controls. The only things the Message (should) care about is how to represent its components. The Message is thus independent of any policy *except* the header/mime classes, while the Parser and Generator can be consumers of the header/mime class policy used to construct the Message. It nevertheless makes sense to group the parser and generator policy controls together, since that is how we conceptually think of them ('HTML' implies a coherent set of input and output policies). So, I think the "policy framework" is actually two things: the header/mime-types registry, and the Parser/Generator policies. Let's have 'policy' refer to only the I/O policy, and call the other the email class registry. This narrower definition of policy is a straightforward enhancement of the current API. It makes these "knobs" more easily controlled, and makes it easier to add new knobs without complicating the API. I propose that I write up this policy API as a distinct proposal/patch (with the work I've already done, this is more than half completed). This would add policy keywords to the Parser and Generator classes, and probably to the as_string method of Message. The real meat of email6, then, is the header/mime-types registry, and the changes in the API of the resulting Message objects. The parser currently accepts a _factory argument that specifies the object to be used in creating the Message. I propose that we deprecate this argument, but that any code using it gets the old behavior of the parser (using _factory to create the class for any new sub-objects). Then we introduce a new argument, 'factory'. This new argument would expect a callable that takes a mime-type as its argument, and returns an appropriate class. The parser would be re-written so that it could use this factory, and the backward compatibility case would be trivial to implement. In theory the classes returned by the registry/factory are arbitrary, but in practice we will need to define the minimal API that they should provide. By specifying the API separately from the concrete implementation in email6, we will allow third parties to write classes that can play well with programs expecting to operate on email6 Messages. This will allow, for example, an MUA to provide custom classes to enhance presentation, while still allowing the message to be submitted to smtplib for transmission. I guess I'm proposing, then, that there be an API version definition, with two values as of Python3.3: email5 API, and email6 API. We'll figure out how we name and interrogate these formally later. The Header registry in this vision is accessed through the Message class. I have various thoughts about how this will work, but I'm going to leave those for later, since this email is already long enough. I also have some additional thoughts about backward compatibility, but it is going to require some experimentation to see if they are realistic. --David From v+python at g.nevcal.com Tue Mar 1 22:58:50 2011 From: v+python at g.nevcal.com (Glenn Linderman) Date: Tue, 01 Mar 2011 13:58:50 -0800 Subject: [Email-SIG] API thoughts In-Reply-To: <20110301204058.54C96249A9D@kimball.webabinitio.net> References: <20110301204058.54C96249A9D@kimball.webabinitio.net> Message-ID: <4D6D6C1A.2070200@g.nevcal.com> On 3/1/2011 12:40 PM, R. David Murray wrote: > This is a long email, for which my apologies. I hope you all will > manage to find some time to read it and provide feedback, as it speaks > to fundamental design issues. Indeed. Good to discuss before designing with ready-mix. > Everything else is an implementation detail :) Agreed. > We propose to create a new API to make all of this easier for > the application programmer. YES!! > [*] There are current real-world use cases for this: there are nntp > servers that use utf-8 for headers, and the http protocol uses > latin-1 (or sometimes, I think, utf-8) All the tunables listed are relevant. The HTTP protocol standard claims to use Latin-1 + RFC 2047 encoding for non-Latin-1 characters; in practice, the browser implementations apparently use nearly _any_ encoding for headers!!! For
responses, when there is actually user-specified data involved, they use the encoding defined for the page containing the form, as the encoding of the MIME headers sent back. The "standard headers" seem to be ASCII, and somewhat immune to choice of encoding, except perhaps for those few encodings that are not ASCII supersets. (I have no clue how such are handled, if they are. Anyone want to write an EBCDIC page containing a for testing?) This is useful, as it reduces the amount of character escaping likely to be required, the designer of the page chooses a character set that can represent the page, and is likely in the language of the intended recipient, who is likely to fill out the form using the same language. It would be more useful, if the browsers included a(n ASCII) header that specified the encoding of subsequent headers: they do not. Therefore, the server that receives the headers must somehow "know" the proper encoding. For the situation where the CGI (or equivalent) script both generates the page containing the and receives the form data, this is simple. For the situation where the same web application designer creates the page containing the and the CGI receiving the form data, and explicitly or implicitly declares the same encoding for both, this is functional, but there is the danger of someone changing the static pages to conform to a new standard encoding without realizing the consequences on the associated CGI scripts. It is also rather hard to create "form filling" applications that can send form data to a server bypassing the access of the form itself... such applications must also "know" the proper encoding, and such applications are much more likely to be generated outside the realm of the original development environment, and much less likely to be involved in any planning to change encodings inside the application s and CGIs. To support reading byte-stream HTTP headers, therefore, it is critical that the email API accept an encoding from the application which "knows" the encoding; presently cgi.py has to pre-decode incoming headers because email does not have such a parameter. On the other hand, maybe cgi.py shouldn't use email header parsing at all... since browsers don't use RFC 2047 encoding in practice, the parsing of headers without such is straightforward. Further, HTTP data streams can be extremely large, and thus time-consuming to obtain over the wire. CGI applications cannot afford to keep large blocks of data in RAM during receipt, thus if email wishes to support CGI, it needs features for placing large blocks of data on disk instead of in RAM during the parsing phase; cgi.py presently has to preparse headers, to separate them from the data streams, which it then handles on its own, because of this issue. Hence, cgi.py does sufficient preparsing and private handling of HTTP data streams, that it seems that the only real benefit it gains from using email at all, is the handling of the complex RFC 2047 decoding... which in practice isn't used in HTTP data streams! In any case, if email wants to promulgate itself as the "one true way" to process HTTP data streams, as well as SMTP and NNTP data streams, then it needs to address the issues above. There is, by the way, room for improvement in the cgi.py handler for HTTP data streams; presently all large MIME objects are written to disk (but small ones are kept as string or byte streams), but it isn't necessarily the right disk, and the data must then be again copied, byte by byte, to its final file system location. I see that as abhorrent overhead. There is presently no provision for hooks that ask the CGI application what to do with the data being received, while it is being received, nor for policies to assist with better heuristics, with the goal in mind that a properly and completely received MIME object could then be renamed to its final location rather than copied. > I guess I'm proposing, then, that there be an API version definition, > with two values as of Python3.3: email5 API, and email6 API. We'll > figure out how we name and interrogate these formally later. Question: While it is pretty clear that enhanced behaviors are required to benefit new applications that use email, and while some new APIs may be incompatible with some existing APIs, might it be possible to design the new API, and then build a compatibility layer that looks like the old API on top? Such that there would be policies for the new APIs that would work like the old APIs to ease the implementation of such a layer? I'm not sure I fully understand the use of _factory or factory parameters, but for APIs that have _factory and grow a factory, could not the presence of which parameter imply any variant functionality? (OK, this question comes after not looking at the email API during all the GSOC and your implementation efforts since the last big round of discussion, but your proposals here seem to sound like it would be more possible with your current thinking that with your previous thinking.) > The Header registry in this vision is accessed through the Message class. > I have various thoughts about how this will work, but I'm going to leave > those for later, since this email is already long enough. I also have > some additional thoughts about backward compatibility, but it is going > to require some experimentation to see if they are realistic. Consider me an interested observer; I'll enjoy reading, thinking, and commenting about these ideas too, but sadly am unlikely to implement an email client this year :( But I have aspirations to do so, because none of the existing email clients exactly suit my preferences... (everyone should write an editor and an email client, no? I've done the former several times... what I want, though, is emacs-python, instead of emacs-lisp). Glenn -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdmurray at bitdance.com Tue Mar 1 23:59:10 2011 From: rdmurray at bitdance.com (R. David Murray) Date: Tue, 01 Mar 2011 17:59:10 -0500 Subject: [Email-SIG] API thoughts In-Reply-To: <4D6D6C1A.2070200@g.nevcal.com> References: <20110301204058.54C96249A9D@kimball.webabinitio.net> <4D6D6C1A.2070200@g.nevcal.com> Message-ID: <20110301225910.72D79249A6C@kimball.webabinitio.net> On Tue, 01 Mar 2011 13:58:50 -0800, Glenn Linderman wrote: > To support reading byte-stream HTTP headers, therefore, it is critical > that the email API accept an encoding from the application which "knows" > the encoding; presently cgi.py has to pre-decode incoming headers > because email does not have such a parameter. On the other hand, maybe > cgi.py shouldn't use email header parsing at all... since browsers don't > use RFC 2047 encoding in practice, the parsing of headers without such > is straightforward. I think it could make sense for the default input character set to be a policy parameter for the parser. Maybe not in the first version, though :) Yes, it is simple(r) to parse headers if you don't have to worry about RFC2047, but why duplicate code if you don't need to? This assumes, of course, that email6 does what cgi.py and similar programs need, but I'll try to keep my eye on that. > Further, HTTP data streams can be extremely large, and thus > time-consuming to obtain over the wire. CGI applications cannot afford > to keep large blocks of data in RAM during receipt, thus if email wishes > to support CGI, it needs features for placing large blocks of data on > disk instead of in RAM during the parsing phase; cgi.py presently has to > preparse headers, to separate them from the data streams, which it then > handles on its own, because of this issue. It is already in the plan to add disk caching support to the base email API, so this will get addressed. You may even be the one who suggested designing the API as a general "storage" API so that different back-ends can be hooked up. In any case, that's what I've got in mind. > There is, by the way, room for improvement in the cgi.py handler for > HTTP data streams; presently all large MIME objects are written to disk > (but small ones are kept as string or byte streams), but it isn't > necessarily the right disk, and the data must then be again copied, byte > by byte, to its final file system location. I see that as abhorrent > overhead. There is presently no provision for hooks that ask the CGI > application what to do with the data being received, while it is being > received, nor for policies to assist with better heuristics, with the > goal in mind that a properly and completely received MIME object could > then be renamed to its final location rather than copied. I think the hookable storage back end addresses this, but the concrete implementation (eventually) provided by email ought to support it as well. > > I guess I'm proposing, then, that there be an API version definition, > > with two values as of Python3.3: email5 API, and email6 API. We'll > > figure out how we name and interrogate these formally later. > > Question: While it is pretty clear that enhanced behaviors are required > to benefit new applications that use email, and while some new APIs may > be incompatible with some existing APIs, might it be possible to design > the new API, and then build a compatibility layer that looks like the > old API on top? Such that there would be policies for the new APIs that > would work like the old APIs to ease the implementation of such a Yes, this is what was behind my comment that I had further ideas about backward compatibility. One way is what Barry and I already discussed: a wrapper to put around an email6 object that would support the email5 API. Another approach is to have the email6 message itself support the legacy API. I haven't looked at every method, but most of them would be supportable. The tricky bit is headers: an email6 Message will return Header objects, whereas an email5 application will generally expect to get strings. (It shouldn't! But many will. Even the email package itself expects to get strings when it accesses headers.) My wild thought at this point is: what if Header subclassed string? With the exception of a few structured headers such as address headers, this might actually work pretty well. But experimentation with some at least semi-real-world examples would be needed to prove out the concept. > layer? I'm not sure I fully understand the use of _factory or factory > parameters, but for APIs that have _factory and grow a factory, could > not the presence of which parameter imply any variant functionality? I'm not sure what you are asking here. In what I outlined for the parser API, you'd get an email5-API object if you used _factory or nothing, and and email6 API object if you used factory, so yes, in that sense the parameter determines the API. But what about a library that is accepting a Message object? It needs a way to detect whether or not it has been passed an email5 API message, or an email6 one. > (OK, this question comes after not looking at the email API during all > the GSOC and your implementation efforts since the last big round of > discussion, but your proposals here seem to sound like it would be more > possible with your current thinking that with your previous thinking.) Well, in my previous thinking I was intending on doing much the same thing as far as backward compatibility went (having a policy that provided an email5 compatible object), I just hadn't talked about it much :) The biggest difference now is that email5 will be the default, at least in the Python3.3 release. > Consider me an interested observer; I'll enjoy reading, thinking, and > commenting about these ideas too, but sadly am unlikely to implement an > email client this year :( But I have aspirations to do so, because none > of the existing email clients exactly suit my preferences... (everyone > should write an editor and an email client, no? I've done the former > several times... what I want, though, is emacs-python, instead of > emacs-lisp). Thanks for your attention and comments. I haven't implemented an editor yet (VIM + Python has been good enough so far), but I have implemented parts of an email client, and intend to finish that project as part of working on email6, as an API test bed. --David From rdmurray at bitdance.com Wed Mar 2 01:52:51 2011 From: rdmurray at bitdance.com (R. David Murray) Date: Tue, 01 Mar 2011 19:52:51 -0500 Subject: [Email-SIG] API thoughts In-Reply-To: <20110301225910.72D79249A6C@kimball.webabinitio.net> References: <20110301204058.54C96249A9D@kimball.webabinitio.net> <4D6D6C1A.2070200@g.nevcal.com> <20110301225910.72D79249A6C@kimball.webabinitio.net> Message-ID: <20110302005251.85259249C7B@kimball.webabinitio.net> On Tue, 01 Mar 2011 17:59:10 -0500, "R. David Murray" wrote: > On Tue, 01 Mar 2011 13:58:50 -0800, Glenn Linderman wrote: > > To support reading byte-stream HTTP headers, therefore, it is critical > > that the email API accept an encoding from the application which "knows" > > the encoding; presently cgi.py has to pre-decode incoming headers > > because email does not have such a parameter. On the other hand, maybe > > cgi.py shouldn't use email header parsing at all... since browsers don't > > use RFC 2047 encoding in practice, the parsing of headers without such > > is straightforward. > > I think it could make sense for the default input character set to be > a policy parameter for the parser. Maybe not in the first version, > though :) Just to clarify: in the first version I check in. I'd expect to decide about that part of the API not too far in to the development process, and certainly well before 3.3. --David From rdmurray at bitdance.com Wed Mar 2 02:45:46 2011 From: rdmurray at bitdance.com (R. David Murray) Date: Tue, 01 Mar 2011 20:45:46 -0500 Subject: [Email-SIG] API thoughts In-Reply-To: <4D6D959E.3000800@g.nevcal.com> References: <20110301204058.54C96249A9D@kimball.webabinitio.net> <4D6D6C1A.2070200@g.nevcal.com> <20110301225910.72D79249A6C@kimball.webabinitio.net> <4D6D959E.3000800@g.nevcal.com> Message-ID: <20110302014546.310242497E1@kimball.webabinitio.net> On Tue, 01 Mar 2011 16:55:58 -0800, Glenn Linderman wrote: > On 3/1/2011 2:59 PM, R. David Murray wrote: > > On Tue, 01 Mar 2011 13:58:50 -0800, Glenn Linderman wrote: > Another reason is if the existing code handles many cases that are not > needed, and cannot be optimized for the case that is needed. A "fast > path" reimplementation can eliminate the cases that are not needed, and > speed the result. That, of course, depends on the internals of the > parsing of headers in the email package, and how much overhead RFC 2047 > adds to that, which I haven't investigated and don't know. Happily, > when uploading big files, headers are a tiny fraction of time spent. > Sadly, when using large fill-in-the-blanks forms, header parsing can be > a significant fraction of the time spent. I think the overhead if there are no encoded words in the header should be minimal (probably a re scan, but possibly not even that, we'll see). This could also be controlled by the policy (ie: the HTTP policy could cause the header parser to skip the check-for-rfc2047-encoded-words step). > Presently, the cgi.py stream API only provides a open-file-like handle > to the data... so it can be read, written, and sought, but not assigned > to a specific filesystem, renamed, or moved using os facilities. So a > broader API seems to be necessary for cgi.py; if that were available in > email, that would be helpful for cgi.py. Yeah, additions to the cgi API are probably required to support this properly. > Hmm. And while it might be more complex to handle structured headers, > in fact they come in a character sequences, so a mapping to string is > not impossible. The real issue is if those headers had another API in > email5 (I could look that up, I guess), but perhaps that API could also > be supported along with a subclass of string. They don't. The issue is that what we would like is for the email6 API for the address header to be that it looks like a list of Address objects. So msg['To'][0] would yield an address object. But if we also want the header to look like a string, that won't work, because as a string that should yield the first character of the body of the header. Now, a sensible application would process the list of addresses in a To header by passing it to util.getaddresses, but you can bet that there are applications that don't do that. A compromise would be to have an 'addresses' method that returned the list of addresses. Perhaps this would even be sensible in the context of email6 by itself: it would mean that all headers had a uniform base API (they act like strings) and all structured information is accessed via special methods. > OK, what I was asking boils down to if the Message object can support > both APIs, the application doesn't need to care. New applications would > probably want to use the new APIs, of course. But they could be passed > between old and new applications (or fragments thereof) if they support > both. It certainly wouldn't hurt to introduce the concept of a version > for the object, although in itself, that would only be accessible via a > new API, so old applications wouldn't think to use it... Yeah, that would be an ideal world. Let's see how close we can get :) --David From rdmurray at bitdance.com Wed Mar 2 17:23:27 2011 From: rdmurray at bitdance.com (R. David Murray) Date: Wed, 02 Mar 2011 11:23:27 -0500 Subject: [Email-SIG] API thoughts In-Reply-To: <4D6DAD3F.2090306@g.nevcal.com> References: <20110301204058.54C96249A9D@kimball.webabinitio.net> <4D6D6C1A.2070200@g.nevcal.com> <20110301225910.72D79249A6C@kimball.webabinitio.net> <4D6D959E.3000800@g.nevcal.com> <20110302014546.310242497E1@kimball.webabinitio.net> <4D6DAD3F.2090306@g.nevcal.com> Message-ID: <20110302162327.7714224153E@kimball.webabinitio.net> On Tue, 01 Mar 2011 18:36:47 -0800, Glenn Linderman wrote: > On 3/1/2011 5:45 PM, R. David Murray wrote: > > On Tue, 01 Mar 2011 16:55:58 -0800, Glenn Linderman wrote: > >> On 3/1/2011 2:59 PM, R. David Murray wrote: > >>> On Tue, 01 Mar 2011 13:58:50 -0800, Glenn Linderman wrote: > >> Hmm. And while it might be more complex to handle structured headers, > >> in fact they come in a character sequences, so a mapping to string is > >> not impossible. The real issue is if those headers had another API in > >> email5 (I could look that up, I guess), but perhaps that API could also > >> be supported along with a subclass of string. > > They don't. The issue is that what we would like is for the email6 API > > for the address header to be that it looks like a list of Address objects. > > So msg['To'][0] would yield an address object. But if we also want the > > header to look like a string, that won't work, because as a string that > > should yield the first character of the body of the header. > > > > Now, a sensible application would process the list of addresses in a To > > header by passing it to util.getaddresses, but you can bet that there > > are applications that don't do that. > > > > A compromise would be to have an 'addresses' method that returned the > > list of addresses. Perhaps this would even be sensible in the context of > > email6 by itself: it would mean that all headers had a uniform base API > > (they act like strings) and all structured information is accessed via > > special methods. > > While msg['To'] producing a structured result might not be possible > when subclassing string, you mention one possible alternative, an > additional method... seems like you mean msg['To'].addresses()? It > would also be possible to make msg.p['To'] for parsed/structured > results. I'm not sure which would be easier to implement, or more > flexible under the covers to do caching of parsed/structured results. > Of course there are several headers dealing with lists of addresses, as > you are well aware, so msg.addresses() wouldn't work without some > specification of the header. Yes, exactly msg['To'].addresses (might as well use a property). I think I prefer this to a separate retrieval method, since not all headers are structured headers, and it is not clear what the "parsed" version of a non-structured header would be (a plain string?). --David From barry at python.org Wed Mar 2 21:46:24 2011 From: barry at python.org (Barry Warsaw) Date: Wed, 2 Mar 2011 15:46:24 -0500 Subject: [Email-SIG] API thoughts In-Reply-To: <20110301204058.54C96249A9D@kimball.webabinitio.net> References: <20110301204058.54C96249A9D@kimball.webabinitio.net> Message-ID: <20110302154624.5dea1bd7@limelight.wooz.org> On Mar 01, 2011, at 03:40 PM, R. David Murray wrote: >So, and here is the point of this email, how does the policy framework >integrate into this design? [...] >This list breaks down into items that affect the Parser, ones that affect >the Generator, and ones that affect both the Parser and the Message. >(Well, the "how much transformation" affects all three in the sense that >the data has to be preserved by both the Parser and the Message in order >for the Generator to be able to implement it, but I think we can take >it as a given that we are going to preserve that data.) > >The pieces that are shared between the Parser and the Message are really >about the Message: how are the sub-objects represented? How are the >structured headers represented? So we could consider that the Parser >is a *consumer* of those pieces of policy, but that they are defined on >the Message, not on the Parser. > >What this means is that the policy controlling each of the major >components (parser, message, generator) are in principle independent. [...] >Re-thinking it now, though, I think there are actually two distinct >components here: the I/O policy(s), and the Message construction policy. [...] >So, I think the "policy framework" is actually two things: the >header/mime-types registry, and the Parser/Generator policies. Let's have >'policy' refer to only the I/O policy, and call the other the email >class registry. +1 This makes a lot of sense, and I'm glad you've been thinking about this more deeply than I have since we last bandied it about. At the time, I thought a single policy hierarchy would probably be fine, but you've laid out a good argument for keeping them separate, and in fact not even calling the latter a 'policy'. Here's another distinction: Policy objects should be composable. This would allow for a standard library of policies that could be mixed and matched for specific applications, and might even include some higher level policies like 'CGI' or 'NNTP'. E.g. my applications might combine a standard 'don't-check-rfc-2047' policy with a 'use-only-CRNL' and 'die-on-defect'. I wonder too, how sophisticated policy objects really need to be. Are they just bags of attributes with some defaults, properties for access, maybe some validation, and composability? As for the registry, I don't think you need anything near that. You just need to say "when you see this mime-type, create an object using this callable". Multiple registrations might be useful, but I don't think composability is. >The real meat of email6, then, is the header/mime-types registry, and >the changes in the API of the resulting Message objects. The parser >currently accepts a _factory argument that specifies the object to be used >in creating the Message. I propose that we deprecate this argument, >but that any code using it gets the old behavior of the parser (using >_factory to create the class for any new sub-objects). Then we introduce >a new argument, 'factory'. This new argument would expect a callable >that takes a mime-type as its argument, and returns an appropriate class. >The parser would be re-written so that it could use this factory, and >the backward compatibility case would be trivial to implement. +1. The underscore name in _factory is a historical wart that's not needed any more. I'm not even sure it makes much sense any more in Message subclasses. It *does* still make sense in e.g. add_header() where there's a potential name collision between the arguments and the **params. We should evaluate these more carefully given today's API and clean this up if possible (modulo all b/c considerations). >In theory the classes returned by the registry/factory are arbitrary, >but in practice we will need to define the minimal API that they >should provide. By specifying the API separately from the concrete >implementation in email6, we will allow third parties to write classes >that can play well with programs expecting to operate on email6 Messages. >This will allow, for example, an MUA to provide custom classes to enhance >presentation, while still allowing the message to be submitted to smtplib >for transmission. +1 >I guess I'm proposing, then, that there be an API version definition, >with two values as of Python3.3: email5 API, and email6 API. We'll >figure out how we name and interrogate these formally later. > >The Header registry in this vision is accessed through the Message class. >I have various thoughts about how this will work, but I'm going to leave >those for later, since this email is already long enough. I also have >some additional thoughts about backward compatibility, but it is going >to require some experimentation to see if they are realistic. Cool. Really great stuff David. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: not available URL: From barry at python.org Wed Mar 2 21:52:52 2011 From: barry at python.org (Barry Warsaw) Date: Wed, 2 Mar 2011 15:52:52 -0500 Subject: [Email-SIG] API thoughts In-Reply-To: <4D6D6C1A.2070200@g.nevcal.com> References: <20110301204058.54C96249A9D@kimball.webabinitio.net> <4D6D6C1A.2070200@g.nevcal.com> Message-ID: <20110302155252.5b58619c@limelight.wooz.org> On Mar 01, 2011, at 01:58 PM, Glenn Linderman wrote: >(everyone should write an editor and an email client, no? Is there really any difference? http://www.catb.org/~esr/jargon/html/Z/Zawinskis-Law.html That's also the proof that the email package is the most important one in Python because it will eventually be used by every Python application ever written. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: not available URL: From sdaoden at googlemail.com Wed Mar 2 11:19:25 2011 From: sdaoden at googlemail.com (Steffen Daode Nurpmeso) Date: Wed, 2 Mar 2011 11:19:25 +0100 Subject: [Email-SIG] email6 and Python 3.3 In-Reply-To: <20110228213235.609A3239561@kimball.webabinitio.net> References: <20110228201133.8EECE249BE5@kimball.webabinitio.net> <20110228154829.16c89a32@limelight.wooz.org> <20110228213235.609A3239561@kimball.webabinitio.net> Message-ID: <20110302101925.GA64097@sherwood.local> > On Mon, Feb 28, 2011 at 04:32:35PM -0500, R. David Murray wrote: > Well, fortunately I've been enjoying it, and the increased recognition > is certainly one of the rewards, so thank you. > On Mon, 28 Feb 2011 15:48:29 -0500, Barry Warsaw wrote: > Just wait 'til the hate mail starts. Fortunately, most of that's got raw > 8-bit in the headers, so you're in luck. :) Increasing recognition with a non hate mail! Thank you for my making my thing possible - out of the box. From sdaoden at googlemail.com Wed Mar 2 21:40:39 2011 From: sdaoden at googlemail.com (Steffen Daode Nurpmeso) Date: Wed, 2 Mar 2011 21:40:39 +0100 Subject: [Email-SIG] API thoughts In-Reply-To: <20110301204058.54C96249A9D@kimball.webabinitio.net> References: <20110301204058.54C96249A9D@kimball.webabinitio.net> Message-ID: <20110302204039.GA43276@sherwood.local> I've also read the updated EMAIL-SIG DesignThoughts. But if "what goes in .defects[]" will be configurable i would hope for a generic is_malformed() and maybe is_processable() or the like, i.e. state versus (translatable?) user-info. (The more i think about it the more i agree with David (i hope i don't lie about that) that it's a waste of time to try to convert malformed data to a compliant state, especially if the package is - by design - capable to spit out the data the very same way it came in. Someone will take care - and throw it away.) I also go for lazy parsing when designing an email package. (Pluggable) File-based backend. Besides that all of this, and including the things David explained in the issue tracker, sounds like smoked tofu to me. ;-) Unfortunately my non-hate mail seems to have been mistreated as spam 8-}, therefore i wrote all of the above just to thank David once again for making the email and mailbox packages usable already in Python 3.2. Thanks. From barry at python.org Wed Mar 2 22:12:06 2011 From: barry at python.org (Barry Warsaw) Date: Wed, 2 Mar 2011 16:12:06 -0500 Subject: [Email-SIG] API thoughts In-Reply-To: <20110302014546.310242497E1@kimball.webabinitio.net> References: <20110301204058.54C96249A9D@kimball.webabinitio.net> <4D6D6C1A.2070200@g.nevcal.com> <20110301225910.72D79249A6C@kimball.webabinitio.net> <4D6D959E.3000800@g.nevcal.com> <20110302014546.310242497E1@kimball.webabinitio.net> Message-ID: <20110302161206.3a61d67a@limelight.wooz.org> On Mar 01, 2011, at 08:45 PM, R. David Murray wrote: >They don't. The issue is that what we would like is for the email6 API >for the address header to be that it looks like a list of Address objects. >So msg['To'][0] would yield an address object. But if we also want the >header to look like a string, that won't work, because as a string that >should yield the first character of the body of the header. Here's where things get really interesting because you won't actually know what msg[header][0] could return for any arbitrary value of 'header'. For structured headers like To, msg['To'] can return an ordered sequence of address objects, but what about msg['Received'] or msg['X-Happy-Fun-Ball']? The same will go for anything like .addresses. I'm not sure what the implications of this for the API are, but it's important to keep in mind (I know RDM knows this) that structured headers need extra parsing and will have more sophisticated objects representing them. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: not available URL: From rdmurray at bitdance.com Thu Mar 3 01:40:36 2011 From: rdmurray at bitdance.com (R. David Murray) Date: Wed, 02 Mar 2011 19:40:36 -0500 Subject: [Email-SIG] bug report In-Reply-To: <4CC9FB26.2020100@gmail.com> References: <4CC9FB26.2020100@gmail.com> Message-ID: <20110303004036.A713D239549@kimball.webabinitio.net> On Fri, 29 Oct 2010 00:37:26 +0200, Tobias Koeck wrote: > 'ascii' codec can't encode character u'\xfc' in position 40: ordinal > not in range(128) > Traceback (most recent call last): > File "/usr/lib/calibre/calibre/gui2/device.py", line 588, in > _send_mails > attachment_name = attachment_names[i]) > File "/usr/lib/calibre/calibre/utils/smtp.py", line 179, in > compose_mail > attachment_name=attachment_name) > File "/usr/lib/calibre/calibre/utils/smtp.py", line 29, in create_mail > msg = MIMEText(text) > File "/usr/lib/python2.6/email/mime/text.py", line 30, in __init__ > self.set_payload(_text, _charset) > File "/usr/lib/python2.6/email/message.py", line 224, in set_payload > self.set_charset(charset) > File "/usr/lib/python2.6/email/message.py", line 260, in set_charset > self._payload = self._payload.encode(charset.output_charset) > UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in > position 40: ordinal not in range(128) Please submit a bug report at bugs.python.org with additional details if you can (ie: what was the input to MIMEText that triggered this error, and what version of python are you using?) --David From rdmurray at bitdance.com Thu Mar 3 01:50:20 2011 From: rdmurray at bitdance.com (R. David Murray) Date: Wed, 02 Mar 2011 19:50:20 -0500 Subject: [Email-SIG] API thoughts In-Reply-To: <20110302204039.GA43276@sherwood.local> References: <20110301204058.54C96249A9D@kimball.webabinitio.net> <20110302204039.GA43276@sherwood.local> Message-ID: <20110303005020.6135B2001CD@kimball.webabinitio.net> On Wed, 02 Mar 2011 21:40:39 +0100, Steffen Daode Nurpmeso wrote: > But if "what goes in .defects[]" will be configurable i would hope > for a generic is_malformed() and maybe is_processable() or the > like, i.e. state versus (translatable?) user-info. I'm not sure what you are asking for here. I think "if msg.is_malformed()" is spelled "if msg.defects". That is, if the defects list is non-empty, the message is technically malformed. Of course, that information by itself isn't necessarily useful, which is why defects is a list of defects. "is_processable" lies in the eyes of the application. What defects is it capable of dealing with? The email package can't know that. So, again, that's why defects is a list. Let me clarify what I mean by the policy controlling "what, exactly, is a defect". The idea here is that when parsing an email, each deviance from the RFCs counts as a defect (the current email package, by the way, only detects a small number of such defects!). But when parsing, say, an http stream, non-ascii characters in headers are perfectly legal. So it seems to make sense that the HTTP policy would change what counts as a defect during the operation of the parser. > (The more i think about it the more i agree with David (i hope > i don't lie about that) that it's a waste of time to try to > convert malformed data to a compliant state, especially if the > package is - by design - capable to spit out the data the very > same way it came in. Someone will take care - and throw it away.) Well, I think we may provide some tools to do such "fixups" when it is possible and the application wants it. But they should be app-requested transformations, not automatic ones. > Unfortunately my non-hate mail seems to have been mistreated as > spam 8-}, therefore i wrote all of the above just to thank David > once again for making the email and mailbox packages usable > already in Python 3.2. Thanks. You are welcome :) --David From rdmurray at bitdance.com Thu Mar 3 02:23:41 2011 From: rdmurray at bitdance.com (R. David Murray) Date: Wed, 02 Mar 2011 20:23:41 -0500 Subject: [Email-SIG] API thoughts In-Reply-To: <20110302154624.5dea1bd7@limelight.wooz.org> References: <20110301204058.54C96249A9D@kimball.webabinitio.net> <20110302154624.5dea1bd7@limelight.wooz.org> Message-ID: <20110303012341.DF05922B74A@kimball.webabinitio.net> On Wed, 02 Mar 2011 15:46:24 -0500, Barry Warsaw wrote: > On Mar 01, 2011, at 03:40 PM, R. David Murray wrote: > >So, I think the "policy framework" is actually two things: the > >header/mime-types registry, and the Parser/Generator policies. Let's have > >'policy' refer to only the I/O policy, and call the other the email > >class registry. > > +1 > > This makes a lot of sense, and I'm glad you've been thinking about this more > deeply than I have since we last bandied it about. At the time, I thought a > single policy hierarchy would probably be fine, but you've laid out a good > argument for keeping them separate, and in fact not even calling the latter > a 'policy'. Here's another distinction: > > Policy objects should be composable. This would allow for a standard library > of policies that could be mixed and matched for specific applications, and > might even include some higher level policies like 'CGI' or 'NNTP'. E.g. my > applications might combine a standard 'don't-check-rfc-2047' policy with a > 'use-only-CRNL' and 'die-on-defect'. Yes, my current implementation of policy objects allows you to say things like: policy = HTTP + Strict where HTTP is the obvious and 'Strict' is a policy that sets the "raise on defect" flag. > I wonder too, how sophisticated policy objects really need to be. Are they > just bags of attributes with some defaults, properties for access, maybe some > validation, and composability? Pretty much. I think they will also contain some callable methods, to provide hooks where a policy subclass can implement a custom policy. My current implementation has such a hook for registering defects, which would allow a custom policy to, for example, log the defects in addition to or instead of putting them into the defects list. > As for the registry, I don't think you need anything near that. You just need > to say "when you see this mime-type, create an object using this callable". > Multiple registrations might be useful, but I don't think composability is. Well, I'm thinking that a minimal sort of composability *is* useful. One of the annoying things about class hierarchies is that if you want to add a feature to the base class, you have to make new subclasses for *all* of the classes in the hierarchy (unless you monkey patch). What I was thinking of was to have the registry have a 'base class' slot that got used as the base class for all the mime-type classes, composed on the fly at instantiation time (and similarly for the headers). That way if you wanted to add features to all the classes in the hierarchy, you could register your custom 'base class' and not need to touch anything else. But since the API for the registry is now a callable, and especially if we specify it as returning callables, then doing such composition could be left to the application (perhaps with a recipe in the docs). Composing registries can thus also be left to the application. email6 itself should have only one, I think, or if there are two the other will be the email5 back-compat registry and there'd be no reason to compose with it. I'm not sure what we you mean by multiple registrations. Can you give an example? > >The real meat of email6, then, is the header/mime-types registry, and > >the changes in the API of the resulting Message objects. The parser > >currently accepts a _factory argument that specifies the object to be used > >in creating the Message. I propose that we deprecate this argument, > >but that any code using it gets the old behavior of the parser (using > >_factory to create the class for any new sub-objects). Then we introduce > >a new argument, 'factory'. This new argument would expect a callable > >that takes a mime-type as its argument, and returns an appropriate class. > >The parser would be re-written so that it could use this factory, and > >the backward compatibility case would be trivial to implement. > > +1. The underscore name in _factory is a historical wart that's not needed > any more. I'm not even sure it makes much sense any more in Message > subclasses. It *does* still make sense in e.g. add_header() where there's a > potential name collision between the arguments and the **params. We should > evaluate these more carefully given today's API and clean this up if possible > (modulo all b/c considerations). Ah, so *that's* what those underscores are for. I always wondered. Yeah, I think we can do a lot of cleanup here. > Cool. Really great stuff David. Thanks. --David From rdmurray at bitdance.com Thu Mar 3 02:41:12 2011 From: rdmurray at bitdance.com (R. David Murray) Date: Wed, 02 Mar 2011 20:41:12 -0500 Subject: [Email-SIG] email6 funding Message-ID: <20110303014112.A4F3E21AE4B@kimball.webabinitio.net> So, now that I've cleared my reply backlog, on to the exciting news. Some of you may have seen Jesse Noller's retweet of the tweet from Paul Leroux of QNX. This is big news for me (and for the email-sig :): QNX wants to fund me to do the email6 development. We are still working out the details, but I think you can expect to see email6 development go into overdrive in the near future. Like, right after PyCon. We're preparing things at my consulting firm to allow me to spend a significant amount of my time working on email6. I am *seriously* excited by this, and very grateful to QNX. Anyone interested in an email6 BOF at PyCon or a brainstorming session during the Sprints afterward, please let me know. --David From janssen at parc.com Thu Mar 3 02:57:00 2011 From: janssen at parc.com (Bill Janssen) Date: Wed, 2 Mar 2011 17:57:00 PST Subject: [Email-SIG] email6 funding In-Reply-To: <20110303014112.A4F3E21AE4B@kimball.webabinitio.net> References: <20110303014112.A4F3E21AE4B@kimball.webabinitio.net> Message-ID: <12386.1299117420@parc.com> And RIM just bought QNX, so I'd expect to see interest in Outlook compatibility. Interesting. Bill From sdaoden at googlemail.com Thu Mar 3 16:28:32 2011 From: sdaoden at googlemail.com (Steffen Daode Nurpmeso) Date: Thu, 3 Mar 2011 16:28:32 +0100 Subject: [Email-SIG] API thoughts In-Reply-To: <20110303005020.6135B2001CD@kimball.webabinitio.net> References: <20110301204058.54C96249A9D@kimball.webabinitio.net> <20110302204039.GA43276@sherwood.local> <20110303005020.6135B2001CD@kimball.webabinitio.net> Message-ID: <20110303152832.GA17870@sherwood.local> On Wed, Mar 02, 2011 at 07:50:20PM -0500, R. David Murray wrote: > That is, if the defects list is non-empty, > the message is technically malformed. Of course, that information by > itself isn't necessarily useful, which is why defects is a list > of defects. > "is_processable" lies in the eyes of the application. > What defects is it capable of dealing with? The email package > can't know that. So, again, that's why defects is a list. > > Let me clarify what I mean by the policy controlling "what, exactly, is > a defect". The idea here is that when parsing an email, each deviance > from the RFCs counts as a defect (the current email package, by the way, > only detects a small number of such defects!). But when parsing, say, > an http stream, non-ascii characters in headers are perfectly legal. > So it seems to make sense that the HTTP policy would change what counts > as a defect during the operation of the parser. So i would hope for '.all_defects[]' and (policy-adjusted) '.defects[]'. I would hope for '.had_header_defects(policy_only=True)', '.had_payload_defects(policy_only=True)'. Doing so would fill the huge hole in between 'not len(defects)' and the detailed inspection of a defects list which consists of a highly differentiated tree of classes. The parser has to parse- and does encounter all of these anyway, and an application cannot re-collect this (dropped) information except with expensive effort, i.e. at least choosing a different, stricter policy followed by another parse of the bogus mail. In the end it is my believe that a framework should bring light onto all aspects of a thing, such that no other framework is ever needed, but especially not on a lower level (except the framework is so designed that it allows replacement of its own low-level interface, say). And i don't think there can be a higher level interface than message_from_(bytes|string)(). From rdmurray at bitdance.com Thu Mar 3 17:13:41 2011 From: rdmurray at bitdance.com (R. David Murray) Date: Thu, 03 Mar 2011 11:13:41 -0500 Subject: [Email-SIG] API thoughts In-Reply-To: <20110303152832.GA17870@sherwood.local> References: <20110301204058.54C96249A9D@kimball.webabinitio.net> <20110302204039.GA43276@sherwood.local> <20110303005020.6135B2001CD@kimball.webabinitio.net> <20110303152832.GA17870@sherwood.local> Message-ID: <20110303161341.F297D249F78@kimball.webabinitio.net> On Thu, 03 Mar 2011 16:28:32 +0100, Steffen Daode Nurpmeso wrote: > On Wed, Mar 02, 2011 at 07:50:20PM -0500, R. David Murray wrote: > > That is, if the defects list is non-empty, > > the message is technically malformed. Of course, that information by > > itself isn't necessarily useful, which is why defects is a list > > of defects. > > "is_processable" lies in the eyes of the application. > > What defects is it capable of dealing with? The email package > > can't know that. So, again, that's why defects is a list. > > > > Let me clarify what I mean by the policy controlling "what, exactly, is > > a defect". The idea here is that when parsing an email, each deviance > > from the RFCs counts as a defect (the current email package, by the way, > > only detects a small number of such defects!). But when parsing, say, > > an http stream, non-ascii characters in headers are perfectly legal. > > So it seems to make sense that the HTTP policy would change what counts > > as a defect during the operation of the parser. > > So i would hope for '.all_defects[]' and (policy-adjusted) > '.defects[]'. I would hope for > '.had_header_defects(policy_only=True)', > '.had_payload_defects(policy_only=True)'. Well, what is a defect for an HTTP parse is not the same as what is a defect for an email parse, so I don't know what "all defects" would consist of. The recovery decisions the parser makes can also be affected by the policy, so there can't, as far as I can see, be a single list of "all defects" that applies to all parses. Currently the email package does not report header defects. When it does, my plan is that each Header will have its own defect list, and likewise each message body (using a recursive definition). How the defects list on the Message object interacts with this is an interesting API question worthy of discussion. Perhaps we do, after all, have some sort of "has_defects" method that queries the constituent parts, and perhaps a function that returns a list of parts with defects, possibly divided between headers and body as you suggest. > Doing so would fill the huge hole in between 'not len(defects)' > and the detailed inspection of a defects list which consists of > a highly differentiated tree of classes. Yeah, the number of different defect classes involved in this scheme worries me a little bit. > The parser has to parse- and does encounter all of these anyway, > and an application cannot re-collect this (dropped) information > except with expensive effort, i.e. at least choosing a different, > stricter policy followed by another parse of the bogus mail. Why recollect? The list is there (and, as I indicated above, will be associated with the part that contains the error). The list of defects will be *all* the defects detected by that policy: all RFC deviance (well, perhaps not quite all...see below). Defects don't normally raise errors, so there's no reason not lot look for all of the relevant ones (and indeed, we are probably only detecting the ones that actually affect the parsing). That is, if you parse an HTTP stream, encountering a non-ASCII character is *not* a defect. It doesn't make any sense to me to report an "if this were an email this would be a defect" defect. And if the header for some strange reason included an RFC2047 encoded word that was invalidly formed...well, in an HTTP parse that would *technically* violate the RFC, but in practice it really means that the data should just be passed through as is. That is, it's not a defect, and we would be be wasting time even *looking* for RFC2047 encoded words. (Unless someone finds a browser or server that generates them!) In other words, in the base package I don't think there are "strict" and "less strict" parsing policies; rather there are *different* parsing policies depending on the context. As far as I can see, it makes no sense to parse an HTTP stream, and the reparse it as if it were an email stream. Now, it might be useful to design a "very_strict" policy that did extra work looking for RFC defects that a normal parse wouldn't detect (I can't think of any off the top of my head, but the email RFCs are so complex that I'm sure there are some), but in that case if you parsed it with the less-strict (normal) policy those defects would *not* be noticed by the parser. In any case, I think such a validating parser/policy is out of scope for the current package. --David From barry at python.org Fri Mar 4 03:52:31 2011 From: barry at python.org (Barry Warsaw) Date: Thu, 3 Mar 2011 21:52:31 -0500 Subject: [Email-SIG] email6 funding In-Reply-To: <20110303014112.A4F3E21AE4B@kimball.webabinitio.net> References: <20110303014112.A4F3E21AE4B@kimball.webabinitio.net> Message-ID: <20110303215231.41b55ddd@neurotica.wooz.org> On Mar 02, 2011, at 08:41 PM, R. David Murray wrote: >So, now that I've cleared my reply backlog, on to the exciting news. > >Some of you may have seen Jesse Noller's retweet of the tweet from >Paul Leroux of QNX. This is big news for me (and for the email-sig :): >QNX wants to fund me to do the email6 development. > >We are still working out the details, but I think you can expect to >see email6 development go into overdrive in the near future. Like, >right after PyCon. We're preparing things at my consulting firm to >allow me to spend a significant amount of my time working on email6. > >I am *seriously* excited by this, and very grateful to QNX. What can I say other than: AWESOME! Thanks QNX! >Anyone interested in an email6 BOF at PyCon or a brainstorming >session during the Sprints afterward, please let me know. o/ I probably won't have time to sprint on email this year, but I would love to have a BOF. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: not available URL: From barry at python.org Fri Mar 4 03:55:59 2011 From: barry at python.org (Barry Warsaw) Date: Thu, 3 Mar 2011 21:55:59 -0500 Subject: [Email-SIG] API thoughts In-Reply-To: <20110303012341.DF05922B74A@kimball.webabinitio.net> References: <20110301204058.54C96249A9D@kimball.webabinitio.net> <20110302154624.5dea1bd7@limelight.wooz.org> <20110303012341.DF05922B74A@kimball.webabinitio.net> Message-ID: <20110303215559.572fcede@neurotica.wooz.org> On Mar 02, 2011, at 08:23 PM, R. David Murray wrote: >Pretty much. I think they will also contain some callable methods, >to provide hooks where a policy subclass can implement a custom policy. >My current implementation has such a hook for registering defects, which >would allow a custom policy to, for example, log the defects in addition >to or instead of putting them into the defects list. Makes sense. >I'm not sure what we you mean by multiple registrations. Can you give >an example? I really meant multiple registries, mostly thinking about how to avoid some global state. But Python already has some global registries, so maybe that's not too bad in this case. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: not available URL: From rdmurray at bitdance.com Fri Mar 4 14:33:04 2011 From: rdmurray at bitdance.com (R. David Murray) Date: Fri, 04 Mar 2011 08:33:04 -0500 Subject: [Email-SIG] API thoughts In-Reply-To: <20110303215559.572fcede@neurotica.wooz.org> References: <20110301204058.54C96249A9D@kimball.webabinitio.net> <20110302154624.5dea1bd7@limelight.wooz.org> <20110303012341.DF05922B74A@kimball.webabinitio.net> <20110303215559.572fcede@neurotica.wooz.org> Message-ID: <20110304133304.861442499B0@kimball.webabinitio.net> On Thu, 03 Mar 2011 21:55:59 -0500, Barry Warsaw wrote: > On Mar 02, 2011, at 08:23 PM, R. David Murray wrote: > >I'm not sure what we you mean by multiple registrations. Can you give > >an example? > > I really meant multiple registries, mostly thinking about how to avoid some > global state. But Python already has some global registries, so maybe that's > not too bad in this case. Ah, yes. Well, so far my thought is that there is a global registry for the email package itself, but since email package access to that registry will be through the 'factory', there is nothing that says that has to be the only registry used by an application. The existence of the email package global registry will allow the addition of classes to the "default" registry by libraries (if we dare :) and applications, while access through the factory means that an application is free to manage a completely independent registry if it prefers. Or perhaps it is better to think about the default email package registry as just that, the *default* registry, since I think it's only specialness will be that it is the registry that is used by default. But that's just my current thought, if anyone can think of a better design I'm all ears. I should note that one design concern I have in all this is that it so far looks like importing email will, under this registry design, end up importing pretty much *all* of the email classes (and there will be more of them than in the current package). I'm so far ignoring that issue, treating it as a premature optimization, but if anyone has any clever ideas or other thoughts, let me know. --David From paull at qnx.com Fri Mar 4 16:01:56 2011 From: paull at qnx.com (Paul Leroux) Date: Fri, 4 Mar 2011 10:01:56 -0500 Subject: [Email-SIG] email6 funding In-Reply-To: <20110303215231.41b55ddd@neurotica.wooz.org> References: <20110303014112.A4F3E21AE4B@kimball.webabinitio.net> <20110303215231.41b55ddd@neurotica.wooz.org> Message-ID: <1CF662C832BF6F4AADE933869AB1B701696E24@neptune.ott.qnx.com> Thanks Barry. QNX will have a booth at Pycon, and Andy will be there. Feel free to drop by and say hello to him. - Paul -----Original Message----- From: Barry Warsaw [mailto:barry at python.org] Sent: March 3, 2011 9:53 PM To: R. David Murray Cc: email-sig at python.org; Paul Leroux; Andy Gryc Subject: Re: [Email-SIG] email6 funding On Mar 02, 2011, at 08:41 PM, R. David Murray wrote: >So, now that I've cleared my reply backlog, on to the exciting news. > >Some of you may have seen Jesse Noller's retweet of the tweet from >Paul Leroux of QNX. This is big news for me (and for the email-sig :): >QNX wants to fund me to do the email6 development. > >We are still working out the details, but I think you can expect to >see email6 development go into overdrive in the near future. Like, >right after PyCon. We're preparing things at my consulting firm to >allow me to spend a significant amount of my time working on email6. > >I am *seriously* excited by this, and very grateful to QNX. What can I say other than: AWESOME! Thanks QNX! >Anyone interested in an email6 BOF at PyCon or a brainstorming >session during the Sprints afterward, please let me know. o/ I probably won't have time to sprint on email this year, but I would love to have a BOF. -Barry From barry at python.org Fri Mar 4 16:16:00 2011 From: barry at python.org (Barry Warsaw) Date: Fri, 4 Mar 2011 10:16:00 -0500 Subject: [Email-SIG] email6 funding In-Reply-To: <1CF662C832BF6F4AADE933869AB1B701696E24@neptune.ott.qnx.com> References: <20110303014112.A4F3E21AE4B@kimball.webabinitio.net> <20110303215231.41b55ddd@neurotica.wooz.org> <1CF662C832BF6F4AADE933869AB1B701696E24@neptune.ott.qnx.com> Message-ID: <20110304101600.4407b12c@neurotica.wooz.org> On Mar 04, 2011, at 10:01 AM, Paul Leroux wrote: >Thanks Barry. QNX will have a booth at Pycon, and Andy will be there. >Feel free to drop by and say hello to him. I will! Cheers, -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: not available URL: From barry at python.org Fri Mar 4 17:02:28 2011 From: barry at python.org (Barry Warsaw) Date: Fri, 4 Mar 2011 11:02:28 -0500 Subject: [Email-SIG] API thoughts In-Reply-To: <20110304133304.861442499B0@kimball.webabinitio.net> References: <20110301204058.54C96249A9D@kimball.webabinitio.net> <20110302154624.5dea1bd7@limelight.wooz.org> <20110303012341.DF05922B74A@kimball.webabinitio.net> <20110303215559.572fcede@neurotica.wooz.org> <20110304133304.861442499B0@kimball.webabinitio.net> Message-ID: <20110304110228.206f870f@neurotica.wooz.org> On Mar 04, 2011, at 08:33 AM, R. David Murray wrote: >Ah, yes. Well, so far my thought is that there is a global registry >for the email package itself, but since email package access to that >registry will be through the 'factory', there is nothing that says that >has to be the only registry used by an application. The existence of >the email package global registry will allow the addition of classes >to the "default" registry by libraries (if we dare :) and applications, >while access through the factory means that an application is free >to manage a completely independent registry if it prefers. Or perhaps >it is better to think about the default email package registry as >just that, the *default* registry, since I think it's only specialness >will be that it is the registry that is used by default. I think that's a great place to start. >But that's just my current thought, if anyone can think of a better >design I'm all ears. > >I should note that one design concern I have in all this is that it so >far looks like importing email will, under this registry design, end up >importing pretty much *all* of the email classes (and there will be more >of them than in the current package). I'm so far ignoring that issue, >treating it as a premature optimization, but if anyone has any clever >ideas or other thoughts, let me know. Yeah, that's a problem. Maybe we (the Python community) should invest in good lazy importing support for Python 3.3? I know that this has been reinvented several times already. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: not available URL: From sdaoden at googlemail.com Mon Mar 7 21:06:08 2011 From: sdaoden at googlemail.com (Steffen Daode Nurpmeso) Date: Mon, 7 Mar 2011 21:06:08 +0100 Subject: [Email-SIG] API thoughts Message-ID: <20110307200608.GA31032@sherwood.local> I was never involved in discussions, so that the topics i address may have been defined for EMAIL 6 already etc., but because i've not found anything in the archives of the list back in 2010 i add yet another feature request which really worries me. I find the interface a bit inconsistent in respect to replace_header() (replaces the first header found), __delitem__() (drops them all), __setitem__() (appends) in any case. (I personally would through these __accessor__ things away, they taste a bit strange when used to access email payload.) And i would provide a series of functions which can be used to get/set/modify header fields and bodies: i would check wether the argument is a list and if, it would mean "all bodies of a field". This is of course very hard to implement if it's done gracefully, i.e. with modification-detection, order-preservation etc. Another, easier to implement, idea would be (yet) an(other) iterator which supports in-place editing. Perfect: it could yield a (to be invented) class which offers methods like .field(), .bodies() (all [bodies] - maybe even as sub-iterator), .remove_field() etc... Doing it like this would offer the possibility to easily detect in-place editing of header bodies etc... All of these are just suggestions and my very personal point of view, of course. But one thing is true, and that's that it is currently really hard to remove or replace just one body of a field, especially if there are multiple bodies for a field. -- Steffen Daode From barry at python.org Mon Mar 7 23:15:29 2011 From: barry at python.org (Barry Warsaw) Date: Mon, 7 Mar 2011 17:15:29 -0500 Subject: [Email-SIG] API thoughts In-Reply-To: <20110307200608.GA31032@sherwood.local> References: <20110307200608.GA31032@sherwood.local> Message-ID: <20110307171529.4dc9631c@neurotica.wooz.org> On Mar 07, 2011, at 09:06 PM, Steffen Daode Nurpmeso wrote: >I find the interface a bit inconsistent in respect to >replace_header() (replaces the first header found), __delitem__() >(drops them all), __setitem__() (appends) in any case. >(I personally would through these __accessor__ things away, they >taste a bit strange when used to access email payload.) I personally like this part of the API, and I think it's held up well under years of use. In general you don't care about header order, so using various combinations of del, .get_all(), and __setitem__ work fine. The semantics of message-as-dict API, header ordering, the various header methods, etc. was thought out and discussed, and I don't have a problem with them. >And i would provide a series of functions which can be used >to get/set/modify header fields and bodies: >i would check wether the argument is a list and if, it would mean >"all bodies of a field". This is of course very hard to implement >if it's done gracefully, i.e. with modification-detection, >order-preservation etc. > >Another, easier to implement, idea would be (yet) an(other) >iterator which supports in-place editing. Perfect: it could yield >a (to be invented) class which offers methods like .field(), >.bodies() (all [bodies] - maybe even as sub-iterator), >.remove_field() etc... >Doing it like this would offer the possibility to easily detect >in-place editing of header bodies etc... > >All of these are just suggestions and my very personal point of >view, of course. >But one thing is true, and that's that it is currently really hard >to remove or replace just one body of a field, especially if there >are multiple bodies for a field. Well, replace one header retaining original order is a bit difficult, but I've rarely had to do that. Still, it would probably make sense to add such functionality -- *if* it can be done without complicating the API or the implementation. I think it could too, by adding an index argument to .replace_header(), and using .get_all() to get an ordered list of the headers of interest. Cheers, -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: not available URL: From barry at python.org Mon Mar 7 23:17:33 2011 From: barry at python.org (Barry Warsaw) Date: Mon, 7 Mar 2011 17:17:33 -0500 Subject: [Email-SIG] unixfrom and __str__() Message-ID: <20110307171733.79cc269f@neurotica.wooz.org> One other thing I'm reminded of: we should definitely switch the parity of the 'unixfrom' value in __str__(). IOW, do *not* include the envelope header by default in str(msg). -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: not available URL: From sdaoden at googlemail.com Tue Mar 8 15:32:51 2011 From: sdaoden at googlemail.com (Steffen Daode Nurpmeso) Date: Tue, 8 Mar 2011 15:32:51 +0100 Subject: [Email-SIG] API thoughts In-Reply-To: <20110307171529.4dc9631c@neurotica.wooz.org> References: <20110307171529.4dc9631c@neurotica.wooz.org> Message-ID: <20110308143251.GA61190@sherwood.local> Barry Warsaw wrote: > I personally like this part of the API, and I think it's held up well under > years of use. :-) msg[f] is indeed and really an elegant and understand-at-a-glance way to access headers. (Possible restriction: it would be graceful if it would return and take a list.) > Well, replace one header retaining original order is a bit difficult, but I've > rarely had to do that. [...] > I think it could too, by adding an index argument to .replace_header(), > and using .get_all() to get an ordered list of the headers of interest. ... and give me a way to also delete just one body of a field and i'll be lucky. Maybe simply 'Message._headers = {normalized_field = [bodies]}'? But, why not .delete_all_of(0, 2, 5), realized by a walk in equal spirit to .get_all(). (My thought was that a new Proxy class can be added very easily, requiring only one new method in Message and without affecting the remaining interface, whatever status David's local EMAIL 6 branch is currently in and whatever approach he will have chosen in the end. Anyway, and unless i missed something, this is the current way: def _bewitch_msg(self): """Handle Python 3.2.0/3.3a0 issue 11401 email/message.py error""" if sys.hexversion > 0x030300A1 or sys.hexversion > 0x030200F1: return for f in self._msg: had_repl = False new_ab = [] ab = self._msg.get_all(f) for b in ab: if not len(b): had_repl = True b = ' ' new_ab.append(b) if had_repl: del self._msg[f] for b in new_ab: self._msg[f] = b At best the very same could be achieved (faster and with smaller memory footprint): for p in self._msg.proxy_iter(): for (idx, body) in p: if not len(body): p[idx] = ' ' ) From barry at python.org Tue Mar 8 18:10:51 2011 From: barry at python.org (Barry Warsaw) Date: Tue, 8 Mar 2011 12:10:51 -0500 Subject: [Email-SIG] API thoughts In-Reply-To: <20110308143251.GA61190@sherwood.local> References: <20110307171529.4dc9631c@neurotica.wooz.org> <20110308143251.GA61190@sherwood.local> Message-ID: <20110308121051.35b81289@neurotica.wooz.org> On Mar 08, 2011, at 03:32 PM, Steffen Daode Nurpmeso wrote: >msg[f] is indeed and really an elegant and understand-at-a-glance >way to access headers. >(Possible restriction: it would be graceful if it would return and >take a list.) Actually, I disagree. :) From experience, look at .get_payload(). It tries to manage both scalar payloads and list payloads (for multiparts), and it sucks. In hindsight (and email6) I hope that .get_payload() will be split into separate API methods, one for simple payloads like image or audio data, and another for multipart access. So for headers, I think setitem/getitem/delitem should be reserved for simple manipulation with well defined semantics (as it currently is ), and new API methods should be added for full access to headers when multiple ones are present. >... and give me a way to also delete just one body of a field and >i'll be lucky. That's a good idea too. >Maybe simply 'Message._headers = {normalized_field = [bodies]}'? I'm not sure what that means, but yeah, you definitely don't want to be messing with that private attribute. >But, why not .delete_all_of(0, 2, 5), realized by a walk in equal >spirit to .get_all(). > >(My thought was that a new Proxy class can be added very easily, >requiring only one new method in Message and >without affecting the remaining interface, >whatever status David's local EMAIL 6 branch is currently in and >whatever approach he will have chosen in the end. It's an interesting idea. Why don't you flesh that out and propose something concrete, with a working implementation if possible? Anyway, rewriting headers is not that hard: #! /usr/bin/env python3 from email import message_from_string as mfs msg = mfs("""\ From: aperson at example.com X-Header: aardvark To: bperson at example.com X-Header: beaver Subject: foo X-Header: cougar X-Header: dingo """) def yummy_toppings(): for topping in ('duck', 'cheese', 'black olive', 'anchovy'): yield topping toppings = yummy_toppings() new_headers = [] for header, value in msg.items(): if header.lower() == 'x-header': new_headers.append(('X-Header', toppings.__next__())) else: new_headers.append((header, value)) for header in msg: del msg[header] for header, value in new_headers: msg[header] = value print(msg) Cheers, -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: not available URL: From sdaoden at googlemail.com Thu Mar 24 17:10:11 2011 From: sdaoden at googlemail.com (Steffen Daode Nurpmeso) Date: Thu, 24 Mar 2011 17:10:11 +0100 Subject: [Email-SIG] I miss size() (and some latest frustration) Message-ID: <20110324161010.GD69753@sherwood.local> I'm stressing this list again, but i stumbled over a missing [message_]size(). http://wiki.python.org/moin/Email%20SIG/DesignThoughts makes it a prerequisite for the new EMail package that The API needs to at a minimum have hooks available for an application to store data on disk rather than holding everything in memory. It would be great if the message (file) size would also be provided as a public method, so that code-flow decisions can be made dependend upon the plain size of a message. (The size is known without parsing for many real-life message objects anyway or can be detected *cheap*. True, e.g., for all Message objects which are created by mailbox.py.) It's also so unfortunate that 'headersonly' of Parser is in fact treated as "a backwards compatibility hack", effectively consuming the entire input nonetheless. And *DesignThoughts* treats lazy parsing/partial loading as an "interesting idea" only, though i can think about many cases where it is a good thing to parse a Message{Headers[/Part/Part/Part...]} sequentially. E.g., why should a spam detector load an entire message if it only wants to check addresses against some white-/blacklists and simply throw away bad hits. Even more, why should a companies dispatcher read all the content if it's only about to rewrite addresses and dispatch the mail to some other internal server. (Of course - hey, it's you, you know *such* more about this stuff than i do.) Waiting is an electric experience ... Have fun. -- Steffen Daode Nurpmeso :wq steffen From barry at python.org Thu Mar 24 22:41:49 2011 From: barry at python.org (Barry Warsaw) Date: Thu, 24 Mar 2011 17:41:49 -0400 Subject: [Email-SIG] I miss size() (and some latest frustration) In-Reply-To: <20110324161010.GD69753@sherwood.local> References: <20110324161010.GD69753@sherwood.local> Message-ID: <20110324174149.78391d3a@neurotica.wooz.org> On Mar 24, 2011, at 05:10 PM, Steffen Daode Nurpmeso wrote: >It would be great if the message (file) size would also be >provided as a public method, so that code-flow decisions can be >made dependend upon the plain size of a message. >(The size is known without parsing for many real-life message >objects anyway or can be detected *cheap*. True, e.g., for >all Message objects which are created by mailbox.py.) Certainly the normal FeedParser will see every byte of the message, even if it does save parts of it on disk. Mailman 3's LMTP server also sees every byte and tucks the size away on an .original_size attribute of its Message subclass. But how would you handle it when you are creating the message yourself? I think there are too many places you'd have to hook to get an accurate reading, or you'd have to essentially serialize it via a generator before you'd know, so it's less than helpful. It may indeed be possible to ask some external process what the size of the message is, but it would likely be a hint you couldn't necessarily trust. (I.e. the server might only have an approximate size.) So, I'm not sure whether the email package can have a consistent notion of a message's 'size'. Perhaps though it ought to define an attribute for when the message is created by a parser, but let it be writable so that e.g. your application could get it from an IMAP server or whatever, and stick it in the attribute. >It's also so unfortunate that 'headersonly' of Parser is in fact treated as >"a backwards compatibility hack", effectively consuming the entire input >nonetheless. And *DesignThoughts* treats lazy parsing/partial loading as an >"interesting idea" only, though i can think about many cases where it is a >good thing to parse a Message{Headers[/Part/Part/Part...]} sequentially. > >E.g., why should a spam detector load an entire message if it only wants to >check addresses against some white-/blacklists and simply throw away bad >hits. Even more, why should a companies dispatcher read all the content if >it's only about to rewrite addresses and dispatch the mail to some other >internal server. (Of course - hey, it's you, you know *such* more about this >stuff than i do.) Do you have suggestions for how the email package can help with these use cases? Do you have specific API or implementation proposals? Cheers, -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: not available URL: From v+python at g.nevcal.com Thu Mar 24 23:54:48 2011 From: v+python at g.nevcal.com (Glenn Linderman) Date: Thu, 24 Mar 2011 15:54:48 -0700 Subject: [Email-SIG] I miss size() (and some latest frustration) In-Reply-To: <20110324174149.78391d3a@neurotica.wooz.org> References: <20110324161010.GD69753@sherwood.local> <20110324174149.78391d3a@neurotica.wooz.org> Message-ID: <4D8BCBB8.7050300@g.nevcal.com> On 3/24/2011 2:41 PM, Barry Warsaw wrote: > On Mar 24, 2011, at 05:10 PM, Steffen Daode Nurpmeso wrote: > >> It would be great if the message (file) size would also be >> provided as a public method, so that code-flow decisions can be >> made dependend upon the plain size of a message. >> (The size is known without parsing for many real-life message >> objects anyway or can be detected *cheap*. True, e.g., for >> all Message objects which are created by mailbox.py.) > Certainly the normal FeedParser will see every byte of the message, even if it > does save parts of it on disk. Mailman 3's LMTP server also sees every byte > and tucks the size away on an .original_size attribute of its Message > subclass. > > But how would you handle it when you are creating the message yourself? I > think there are too many places you'd have to hook to get an accurate reading, > or you'd have to essentially serialize it via a generator before you'd know, > so it's less than helpful. > > It may indeed be possible to ask some external process what the size of the > message is, but it would likely be a hint you couldn't necessarily trust. > (I.e. the server might only have an approximate size.) > > So, I'm not sure whether the email package can have a consistent notion of a > message's 'size'. Perhaps though it ought to define an attribute for when the > message is created by a parser, but let it be writable so that e.g. your > application could get it from an IMAP server or whatever, and stick it in the > attribute. When created by a parser, it could have the notion of size-seen-so-far, or bytes-fed. Once the whole message has been processed, the size of the message would be known, as well as of each piece. Incomplete messages, such as those from IMAP servers for which only partial requests have been made for pieces, could only get the concept of "total size" from the server, if it provides it. Since POP servers do, I think IMAP would also, but I'm not an IMAP expert. >> It's also so unfortunate that 'headersonly' of Parser is in fact treated as >> "a backwards compatibility hack", effectively consuming the entire input >> nonetheless. And *DesignThoughts* treats lazy parsing/partial loading as an >> "interesting idea" only, though i can think about many cases where it is a >> good thing to parse a Message{Headers[/Part/Part/Part...]} sequentially. >> >> E.g., why should a spam detector load an entire message if it only wants to >> check addresses against some white-/blacklists and simply throw away bad >> hits. Even more, why should a companies dispatcher read all the content if >> it's only about to rewrite addresses and dispatch the mail to some other >> internal server. (Of course - hey, it's you, you know *such* more about this >> stuff than i do.) > Do you have suggestions for how the email package can help with these use > cases? Do you have specific API or implementation proposals? For message parsing, it seems like allowing registered callbacks for various pieces would be handy... "Call me when you parse this type of a header" (or body part, etc.). -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdaoden at googlemail.com Fri Mar 25 20:15:17 2011 From: sdaoden at googlemail.com (Steffen Daode Nurpmeso) Date: Fri, 25 Mar 2011 20:15:17 +0100 Subject: [Email-SIG] I miss size() (and some latest frustration) In-Reply-To: <20110324174149.78391d3a@neurotica.wooz.org> References: <20110324174149.78391d3a@neurotica.wooz.org> Message-ID: <20110325191517.GE86511@sherwood.local> On Thu, Mar 24, 2011 at 05:41:49PM -0400, Barry Warsaw wrote: > On Mar 24, 2011, at 05:10 PM, Steffen Daode Nurpmeso wrote: > So, I'm not sure whether the email package can have a consistent notion of a > message's 'size'. > Do you have suggestions for how the email package can help with these use > cases? Do you have specific API or implementation proposals? An incremental package must of course have a notion of a "current state of a message", so that all methods of an object must first check wether they're applicable - anyway!? Methods which can be used in multiple states need to document how they react in each of those anyway (if behaviour changes). So that there may be .current_parse_state() returning a to-be-defined enum. Or size() may return a tuple (Bool_is_final_size, current_size) (but that's really ugly). Beside size(), the most simple way would be to extend the FeedParser so that it could stop in a defined way at all boundaries of a message (i.e. Headers,Part,Part...). That would be a state(). It would need to be restartable, i.e., .close() may remain and return an entire message, but .last_part() or so/etc. must be added. .feed() must return something useful, too. E.g.: dataf = SOMERAWDATA.get_fileobject() while 1: l = dataf.readline() .. parser_state = fp.feed() if parser_state == fp.BOUNDARY_SEEN: .. break .. # This is a header object # (Or, simply: Message without payload) headerobject = fp.get_headers() rewrite_headers(headerobject) datachunk = prepare_as_sendfile_header_object(headerobject) call_sendfile_with_headers_and_unchanged_rest_of_dataf Interestingly FeedParser has almost all capabilities which are required to do all that internally, but it does not offer it to the outside. 8-) Anyway, EMail is capable of many things, but it does not expose them to the outside, so that one gets stuck soon if a special task is to be performed. email.message_from_xy() is a fantastic abstraction of a complex set of RFC's and real-life potholes. On the other hand a programming package is not a shelter - you can mess up any package which goes beyond some message_from_xy(). So i really think that it is acceptable to offer an interface which gives you access to partially constructed objects as long as it is well-defined in some manner. -- Steffen Daode Nurpmeso :wq steffen From sdaoden at googlemail.com Fri Mar 25 20:19:21 2011 From: sdaoden at googlemail.com (Steffen Daode Nurpmeso) Date: Fri, 25 Mar 2011 20:19:21 +0100 Subject: [Email-SIG] I miss size() (and some latest frustration) In-Reply-To: <4D8BCBB8.7050300@g.nevcal.com> References: <4D8BCBB8.7050300@g.nevcal.com> Message-ID: <20110325191921.GA29700@sherwood.local> On Thu, Mar 24, 2011 at 03:54:48PM -0700, Glenn Linderman wrote: > For message parsing, it seems like allowing registered callbacks > for various pieces would be handy... "Call me when you parse this > type of a header" (or body part, etc.). A completely different idea, but i also like it. I remember that DOM did not even rock a bit unless SAX came up. From barry at python.org Fri Mar 25 21:10:03 2011 From: barry at python.org (Barry Warsaw) Date: Fri, 25 Mar 2011 16:10:03 -0400 Subject: [Email-SIG] I miss size() (and some latest frustration) In-Reply-To: <4D8BCBB8.7050300@g.nevcal.com> References: <20110324161010.GD69753@sherwood.local> <20110324174149.78391d3a@neurotica.wooz.org> <4D8BCBB8.7050300@g.nevcal.com> Message-ID: <20110325161003.496e418d@neurotica.wooz.org> On Mar 24, 2011, at 03:54 PM, Glenn Linderman wrote: >When created by a parser, it could have the notion of size-seen-so-far, or >bytes-fed. Once the whole message has been processed, the size of the >message would be known, as well as of each piece. It makes sense to record this in the Message objects, but I'd want to be very careful about what that attribute is called. Using just 'size' could be misleading, either because parsing has not completed, or because they might think that it's an exact count of the serialized size. Something like 'parsed_byte_count' might be okay though. >Incomplete messages, such as those from IMAP servers for which only partial >requests have been made for pieces, could only get the concept of "total >size" from the server, if it provides it. Since POP servers do, I think IMAP >would also, but I'm not an IMAP expert. In a case like that, an attribute such as 'server_reported_size' or some such would be okay. >For message parsing, it seems like allowing registered callbacks for various >pieces would be handy... "Call me when you parse this type of a header" (or >body part, etc.). I think David's design documents to allow for extensions and callbacks based on the content-types of things seen. Cheers, -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: not available URL: From glenn at nevcal.com Fri Mar 25 22:19:00 2011 From: glenn at nevcal.com (Glenn Linderman) Date: Fri, 25 Mar 2011 14:19:00 -0700 Subject: [Email-SIG] I miss size() (and some latest frustration) In-Reply-To: <20110325161003.496e418d@neurotica.wooz.org> References: <20110324161010.GD69753@sherwood.local> <20110324174149.78391d3a@neurotica.wooz.org> <4D8BCBB8.7050300@g.nevcal.com> <20110325161003.496e418d@neurotica.wooz.org> Message-ID: <4D8D06C4.9010008@nevcal.com> On 3/25/2011 1:10 PM, Barry Warsaw wrote: >> For message parsing, it seems like allowing registered callbacks for various >> >pieces would be handy... "Call me when you parse this type of a header" (or >> >body part, etc.). > I think David's design documents to allow for extensions and callbacks based > on the content-types of things seen. I recall registration of handlers for various mime times. I don't recall callbacks (registered handlers) being available for header parsing, but no time to find and reread at the moment. Would be a good idea, though. Also, callbacks should have the capability to stop the parse. That technique could be used to implement "only parse headers" also, but it might be nicer to implement that as a flag when parsing starts. Along this line, if parsing is stopped, it would be nice to be able to retrieve the unparsed data for alternate use (some is likely to have been already retrieved from whatever data stream, and passed as a "chunk" to the parser; an early-out would leave a "partial chunk" that hasn't been processed, but may want to be processed by some other entity, even if only for logging or error reporting. -- Glenn ------------------------------------------------------------------------ Experience is that marvelous thing that enables you to recognize a mistake when you make it again. -- Franklin Jones -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdmurray at bitdance.com Fri Mar 25 22:25:20 2011 From: rdmurray at bitdance.com (R. David Murray) Date: Fri, 25 Mar 2011 17:25:20 -0400 Subject: [Email-SIG] I miss size() (and some latest frustration) In-Reply-To: <20110325161003.496e418d@neurotica.wooz.org> References: <20110324161010.GD69753@sherwood.local> <20110324174149.78391d3a@neurotica.wooz.org> <4D8BCBB8.7050300@g.nevcal.com> <20110325161003.496e418d@neurotica.wooz.org> Message-ID: <20110325212521.3FBFB1454A2@kimball.webabinitio.net> On Fri, 25 Mar 2011 16:10:03 -0400, Barry Warsaw wrote: > >For message parsing, it seems like allowing registered callbacks for various > >pieces would be handy... "Call me when you parse this type of a header" (or > >body part, etc.). > > I think David's design documents to allow for extensions and callbacks based > on the content-types of things seen. Effectively, yes. The idea is that there is a factory that gets called whenever a mime content type or a header is instantiated, so that factory can do whatever magic it would like. The standard factories will have a lookup table for the factories for individual types, so you can alternately use a copy of the standard factory with just the headers or mime types you are interested in hooked. We'll want to refine the design when I get near to actually implementing it. -- R. David Murray http://www.bitdance.com From sdaoden at googlemail.com Sat Mar 26 16:56:53 2011 From: sdaoden at googlemail.com (Steffen Daode Nurpmeso) Date: Sat, 26 Mar 2011 16:56:53 +0100 Subject: [Email-SIG] I miss size() (and some latest frustration) In-Reply-To: <20110324174149.78391d3a@neurotica.wooz.org> References: <20110324174149.78391d3a@neurotica.wooz.org> Message-ID: <20110326155653.GA44697@sherwood.local> First of all i have to say that i am sooo prowd of myself that this mail manages to get addressed correctly right away! Wow! (Or WAU! WAU! as those four-legged germans would say;) Thanks for your understanding. On Thu, Mar 24, 2011 at 05:41:49PM -0400, Barry Warsaw wrote: > Certainly the normal FeedParser will see every byte of the > message, even if it does save parts of it on disk. Mailman 3's > LMTP server also sees every byte I'm afraid of it, and i hate it from the bottom of my heart, but it is to be expected that EMail 6 will see times where mails actually contain entire 3-D Blockbusters as MIME attachments. And the truth will not be far from that. Thus i personally would really vote for the possibility that parsing can be stopped at defined boundaries so that write(target_file, yet_parsed_object.data()) while 1: x = source_file.read() target_file.write(x) can be used directly (i.e. no swallowed boundary line). Hooks are a fine thing but they are on the wrong side of the story for this kind of problem (unless you have full, i.e. linewise, control of the input side, too, and set one flag here and there.) Have a nice weekend - it's cherry blossom, and it smells fantastic! -- Steffen Daode Nurpmeso :wq steffen From rdmurray at bitdance.com Tue Mar 29 01:39:21 2011 From: rdmurray at bitdance.com (R. David Murray) Date: Mon, 28 Mar 2011 19:39:21 -0400 Subject: [Email-SIG] Email6 repository, and policy framework first draft Message-ID: <20110328233921.70B58D64A7@kimball.webabinitio.net> I've set up the feature branch for email6: http://hg.python.org/features/email6 The branch inside the repo is email6. I'll probably wind up having subbranches unless my proposals get approved quickly :) So far I've checked in the first draft of my proposal for the policy framework. I've blogged about this: http://www.bitdance.com/blog/2011/03/28_01_Policy_Framework_First_Draft/ Here's the text version of the blog post: 2011-03-28 Policy Framework First Draft ======================================= Last week turned out to be mostly about tests and bugs. As per my last post, I moved the tests into a test package. Then I went on to add a bunch of `additional tests`_ developed by Michael Henry at the PyCon sprints. More tests are always good before starting to modify code, right? .. _additional tests: http://bugs.python.org/issue11589 Michael's tests had revealed a couple bugs, though, so I then went on to apply the `fix`_ for those bugs, which included a `rewritten algorithm`_ for encoding strings as quoted printable. I adapted the algorithm proposed by Michael, then discovered a different and probably `better algorithm`_ had already been proposed a while back and gotten lost in the tracker. That proposed patch was against the email package in Python2, though, and the corresponding code in Python3 has a different interface, so the patch wasn't easily adapted. Since there are other changes that need to be made to the quoted printable encoder, I have deferred implementing the better algorithm until I get as far as touching that code for the email6 work. .. _fix: http://bugs.python.org/issue11590 .. _rewritten algorithm: http://bugs.python.org/issue11606 .. _better algorithm: http://bugs.python.org/issue5803 There was also a `bug`_ in the Email5 API that I wanted to fix before starting to make API changes. When you deal with "dirty" headers in Email5.1, you may get back a ``Header`` object when querying a header. Now, the normal way to deal with crazy headers in Email5 is to pass them to ``decode_header`` to get the pairs of character sets and original bytes from the wire out. But ``decode_header`` wasn't accepting a ``Header`` object for ``decoding``. My first approach was to try shifting back to returning strings even when the header was "dirty", by wrapping them up in encoded words with the ``unknown-8bit`` charset. That more or less worked, but doing it that way would mean making some other changes to methods such as ``get_param`` to handle headers that had gotten re-encoded into encoded words. This was far from optimal. The reporter of the bug pointed out that I had carefully documented that ``Message`` would return a ``Header`` if the source header had unencoded non-ASCII bytes in it, which made changing this behavior in a bug fix release a non-starter. So I gave in and just fixed ``decode_header`` to handle ``Header`` objects. Since *all* headers in email6 will be a (new type of) ``Header`` object, programmers may as well get used to dealing with them. .. _bug: http://bugs.python.org/issue11584 For email6 itself, there is now a `feature branch`_ where I will do the patch development for email6 before applying the changes to the main cpython repository. The branch is named ``email6``, of course. Anyone may browse or clone this repository to take a look at the current state of development. .. _feature branch: http://hg.python.org/features/email6 And that current state is that I have checked in the first draft of the Policy framework. This consists of a new module, `policy.py`_, the associated documentation, `policy.rst`_, and a set of tests, `test_policy.py`_ .. _policy.py: http://hg.python.org/features/email6/file/email6/Lib/email/policy.py .. _policy.rst: http://hg.python.org/features/email6/file/email6/Doc/library/email.policy.rst .. _test_policy.py: http://hg.python.org/features/email6/file/email6/Lib/test/test_email/test_policy.py The basic idea is that a ``Policy`` object is an immutable container for a bunch of attributes and callback hooks. You can call a ``Policy`` object to get a new one with some of the defaults changed. And you can add them together, with the non-default settings from the right operand overriding those from the left operand. So far we have policies such as: * default * SMTP * HTML * Strict *default* may get renamed *email6*. I'd prefer 'default', since that's what I'd like it to be by the time we get to Python 3.4. The actual default policy when I start adding the parameter to other classes and functions will be *email5*, though, so the name *default* for email6 is probably not going to work. The *SMTP* policy is just like default, but generates "wire format" line separators (``\r\n``). *HTML* is like *SMTP*, but does not wrap headers. *Strict* sets a flag that will (once I implement it) cause the parser to raise errors when it encounters defects instead of just keeping track of them. Using *Strict* is where you can see the utility of adding policies together:: >>> StrictSMTP = SMTP + Strict You could use StrictSMTP to parse an incoming SMTP message where you wanted your program to blow up if the message was invalid. (When would you ever want that? I don't know, but someone probably will!). So far I've only defined one hook, ``register_defect``. You could subclass ``Policy`` and define your own ``register_defect`` method that would, say, log all defects to a log file, thus giving you some idea of the quality of the email being processed by your program, even if you did nothing else with the defect info. Now we'll see what the Email SIG thinks of this implementation, and meanwhile I'll be adding policy arguments to the parser and generator classes.