From tfarrell at owassobible.org Sat Oct 3 15:26:10 2009 From: tfarrell at owassobible.org (Timothy Farrell) Date: Sat, 3 Oct 2009 08:26:10 -0500 (CDT) Subject: [Email-SIG] fixing the current email module In-Reply-To: <14505295.7141254576102143.JavaMail.root@boaz> Message-ID: <10506972.7161254576370614.JavaMail.root@boaz> Back in June, David Murray posted the message below about fixing the email module. I have an interest in helping with this due to a personal project I'm working on. However, my ability to help is severely limited by my understanding of email and MIME RFCs. David asked the question of whether or not passing strings to the feedparser is a needed behavior. I don't claim to have enough knowledge to answer the question yes or no, but I would urge us all to consider that if no answer shows up that David's patch should be put in for no better reason than that it's better than what we currently have. David, if you would send it to me, I might be able to fix up some of the test cases. Thanks, -tim -------- So, designing a new interface is one thing. Making the current interface usable in py3k is another. I presume that the latter is desirable? I'm porting a small application that uses the email module to py3k. I've run into two problems, one of which was already reported, the other of which was not: http://bugs.python.org/issue4661 http://bugs.python.org/issue6302 (Then there's the whole string issues relating to email and unicode organized under Issue1685453, but I'm going to ignore those for the moment.) I'd like to try fixing these, but there are design issues involved. The fundamental one is, what format should 'message' be handling message data in? 4661 addresses this obliquely, and we've talked about this somewhat at the higher design level. But the question before me is, how to fix feedparser, message, and decode_header so that I can actually parse a message and display it correctly. I need to be able to feed bytes to feedparser, that much is clear. I've implemented a proof-of-concept fix that has feedparser handle all its input as bytes, has message decode headers and values using the ASCII codec if handled bytes, and has decode_header expect strings and consistently return bytes. With this fix in place my application works. But of course, the email module tests do not pass, and I don't know what other use cases I have broken. My specific question, as posted in issue4661, is: is there any use case for passing strings to feedparser that is not a design error waiting to trap the programmer? --David From barry at python.org Sat Oct 3 16:36:51 2009 From: barry at python.org (Barry Warsaw) Date: Sat, 3 Oct 2009 10:36:51 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <10506972.7161254576370614.JavaMail.root@boaz> References: <10506972.7161254576370614.JavaMail.root@boaz> Message-ID: <8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org> On Oct 3, 2009, at 9:26 AM, Timothy Farrell wrote: > Back in June, David Murray posted the message below about fixing the > email module. I have an interest in helping with this due to a > personal project I'm working on. However, my ability to help is > severely limited by my understanding of email and MIME RFCs. > > David asked the question of whether or not passing strings to the > feedparser is a needed behavior. I don't claim to have enough > knowledge to answer the question yes or no, but I would urge us all > to consider that if no answer shows up that David's patch should be > put in for no better reason than that it's better than what we > currently have. I expect RDM to have some follow ups soon, but I'll put this forward in the meantime. I firmly believe we need parallel feedparser APIs, one for feeding it strings and one for feeding it bytes. In all the tentative attempts at Python3-ification I've done I just keep coming back to that assessment. I don't think it's a terrible burden either since I also firmly believe that /internally/ the email package should be bytes- oriented. So the basic model is: accept strings or bytes at the edges, process everything internally as bytes, output strings and bytes at the edges. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From stephen at xemacs.org Sat Oct 3 17:41:48 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sun, 04 Oct 2009 00:41:48 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: <8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org> References: <10506972.7161254576370614.JavaMail.root@boaz> <8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org> Message-ID: <87zl88h4cj.fsf@uwakimon.sk.tsukuba.ac.jp> Barry Warsaw writes: > So the basic model is: accept strings or bytes at the edges, > process everything internally as bytes, output strings and bytes at > the edges. In a certain pedantic sense, that can't be right, because bytes alone can't represent strings. Practically, you are going need to say how a bytes or bytearray is to be interpreted as a string, and that is going to be one big mess. (MIME?) Going the other way around you have no such problem, or rather the trivial embedding works fine, except that you have to do a range check at some point before you convert to bytes. From tfarrell at owassobible.org Sat Oct 3 19:09:55 2009 From: tfarrell at owassobible.org (Timothy Farrell) Date: Sat, 3 Oct 2009 12:09:55 -0500 (CDT) Subject: [Email-SIG] fixing the current email module In-Reply-To: <25336492.7211254589432905.JavaMail.root@boaz> Message-ID: <8510262.7231254589795083.JavaMail.root@boaz> I agree with Barry insofar as accepting bytes or strings on the input with internal processing in bytes and output bytes or strings depending on the content parsed. Forgive my ignorance...why does converting bytes to strings have to be a mess? Rather than having two Feedparsers, can't we just pass a default encoding when instantiating a feedparser and have it read from the MIME headers otherwise? If not encoding is passed and one can't be determined, simply output as bytes or try a default and raise an exception if it fails. If providing the default encoding, no such range check is needed. ----- Original Message ----- From: "Stephen J. Turnbull" To: "Barry Warsaw" Cc: "Timothy Farrell" , email-sig at python.org Sent: Saturday, October 3, 2009 10:41:48 AM GMT -06:00 US/Canada Central Subject: Re: [Email-SIG] fixing the current email module Barry Warsaw writes: > So the basic model is: accept strings or bytes at the edges, > process everything internally as bytes, output strings and bytes at > the edges. In a certain pedantic sense, that can't be right, because bytes alone can't represent strings. Practically, you are going need to say how a bytes or bytearray is to be interpreted as a string, and that is going to be one big mess. (MIME?) Going the other way around you have no such problem, or rather the trivial embedding works fine, except that you have to do a range check at some point before you convert to bytes. From v+python at g.nevcal.com Tue Oct 6 11:28:41 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Tue, 06 Oct 2009 02:28:41 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: <8510262.7231254589795083.JavaMail.root@boaz> References: <8510262.7231254589795083.JavaMail.root@boaz> Message-ID: <4ACB0DC9.7080307@g.nevcal.com> On approximately 10/3/2009 10:09 AM, came the following characters from the keyboard of Timothy Farrell: > I agree with Barry insofar as accepting bytes or strings on the input with internal processing in bytes and output bytes or strings depending on the content parsed. > > Forgive my ignorance...why does converting bytes to strings have to be a mess? Rather than having two Feedparsers, can't we just pass a default encoding when instantiating a feedparser and have it read from the MIME headers otherwise? If not encoding is passed and one can't be determined, simply output as bytes or try a default and raise an exception if it fails. > > If providing the default encoding, no such range check is needed. > > ----- Original Message ----- > From: "Stephen J. Turnbull" > To: "Barry Warsaw" > Cc: "Timothy Farrell" , email-sig at python.org > Sent: Saturday, October 3, 2009 10:41:48 AM GMT -06:00 US/Canada Central > Subject: Re: [Email-SIG] fixing the current email module > > Barry Warsaw writes: > > > So the basic model is: accept strings or bytes at the edges, > > process everything internally as bytes, output strings and bytes at > > the edges. > > In a certain pedantic sense, that can't be right, because bytes alone > can't represent strings. > > Practically, you are going need to say how a bytes or bytearray is to > be interpreted as a string, and that is going to be one big mess. > (MIME?) > > Going the other way around you have no such problem, or rather the > trivial embedding works fine, except that you have to do a range check > at some point before you convert to bytes. Email messages are bytes. Usually restricted to bytes in the range 32-127, but sometimes permitted to be 0-255 (8bit encoding). Email messages carry sufficient information to convert bytes to strings (usually; and sufficient defaults to cover the other cases adequately, even if not with 100% certainty). So if Barry is considering that the internal form is bytes, particularly bytes encoded via email RFCs, then I can't argue with that being a reasonable internal form.... except for one problem, 2 paragraphs below. The only mess that I can see Stephen referring to is the fact that the email RFCs define rather messy encoding formats and character set specifications. There isn't much cure for this, AFAICS, other than perhaps keeping the bytes in segmented structures, with cached metadata to speed repeated references. Using any other format than email format, means knowing how to translate that format to/from email format, and to/from API format... this means coding two translation routines instead of one. The choice of email RFC byte formats for the internal form makes it quick and easy to produce a complete message when called for, and to defer interpretation when a message is fed in.... sometimes, and herein lies the catch.... One problem with storing messages in bytes format: it seems to me that the choice of which of several legal email bytes formats to represent various email parts (texts and attachments) is problematical for using email format bytes as the internal storage format. An unsophisticated email library could assume that the transfer encoding is always 7bit, and that should be acceptable in all circumstances. A more sophisticated email library would provide support for either 7bit or 8bit transfer encodings.... but the choice of the bytes formats, and MIME type encodings of various message parts to support that difference would be significant. It seems that the present email lib provides only a way to create only a 7bit or 8bit message (and apparently not binary encoding), meaning that the whole message assembly process has to be done after initiating a connection with the SMTP server, to determine whether it supports 8bit (or binary) encoding or not. A more abstract internal format could defer that choice to the generate step, keeping items as str or binary blobs prior to that step. IIUC, 7bit requires that text and binary be encoded to remove "difficult" byte values from the byte stream, so choosing quopri or base64 is appropriate at MIME part definition time to make that choice (although an optimal sized choice could be made based on the data), in the event that generate requests 7bit. However, 8bit has no such requirement, it declares that there are no difficult characters except NULL, CR and LF. However, because no 8bit encodings are defined, the (inefficient, 7-bit) quopri or base64 may still have to be used to avoid lines that are too long, and to encode NULL, CR and LF. 8 bit and UTF-8 text containing no NULL characters and no long lines would qualify without encoding. Finally, binary declares that there are no difficult characters at all. Therefore, the quopri or base64 choice could be ignored, and the raw data passed through. Choosing a particular Content-Transport-Encoding as the internal storage format forces transcoding to the other Content-Transport-Encoding values on the fly after connecting to the SMTP server (using an apparently non-existent parameter to the generate method); not supporting on-the-fly transcoding would force the user to choose a particular Content-Ttransport-Encoding up front, requiring the connection to the SMTP server even earlier in the process. I observe that most of my SMTP providers do not support binary transport, but it seems that MS Exchange does. I observe that binary transport is more efficient than 7bit or 8bit. I observe that even with binary transport, the MIME headers must still be in US-ASCII, by definition, so the headers need not be generated differently for different transports... only the Content-Transfer-Encoding, and the content itself, would be affected by deferring that choice to generate time. Perhaps binary transport, with meta-data indicating whether the user prefers quopri or base64 for parts that must be encoded for 7bit or 8bit transport, would be an appropriate storage format for the email library. This would allow the quopri or base64 encodings to be performed on-the-fly, only if needed, by adding a new parameter to generate, that specifies the Content-Transfer-Encoding (which should default to 7bit for maximal server compatibility, or 8bit if the user specified that along the way so that backwards compatibility is preserved). N.B. I note that the documentation for 2.6.3 section 19.1.3 MIMEtext function (reproduced below) is confusing: /class /email.mime.text.MIMEText(/_text/[, /_subtype/[, /_charset/]])? Module: email.mime.text A subclass of MIMENonMultipart , the MIMEText class is used to create MIME objects of major type /text/. /_text/ is the string for the payload. /_subtype/ is the minor type and defaults to /plain/. /_charset/ is the character set of the text and is passed as a parameter to the MIMENonMultipart constructor; it defaults to us-ascii. No guessing or encoding is performed on the text data. Changed in version 2.4: The previously deprecated /_encoding/ argument has been removed. Encoding happens implicitly based on the /_charset/ argument. The confusion is that it states there is no encoding performed, and then it states that encoding is implicit. It is not clear what it actually does, if anything. The 3.2a0 documentation further muddies the water by removing the last paragraph. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From stephen at xemacs.org Tue Oct 6 16:18:03 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 06 Oct 2009 23:18:03 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACB0DC9.7080307@g.nevcal.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> Message-ID: <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> In the following I use Python 3 terminology: strings are Python Unicode objects, and bytes are Python bytes objects. Glenn Linderman writes: > Email messages are bytes. Usually restricted to bytes in the range > 32-127, but sometimes permitted to be 0-255 (8bit encoding). This is irrelevant to our internal representation. It is both trivial and efficient to convert the wire format (bytes) to a string internally (at least for email messages up to say 5MB). Which internal representation makes the most sense depends on what we are going to do with that internal representation. At this point I'm not sure that strings are better than bytes, but I'm quite sure that I've seen no convincing argument that bytes are TOOWTDI. Nor is it at all obvious to me that should be stored in wire format. > Using any other format than email format, means knowing how to > translate that format to/from email format, and to/from API > format... this means coding two translation routines instead of > one. That sound reasonable, but it's a false economy. The formats you're talking about here are the transfer encodings, and we need to be able to decode all of them, and produce all of them. Internally, they can be represented by a single format, so you need internal-to-transfer and transfer-to-internal for about six of them (7bit, 8bit, binary == Python bytes, BASE64, quoted-printable, Python string) As for runtime economy, if conversion is done once at parse time and once at generate time it is not a big burden, not as compared to the overhead of the Python language itself. > The choice of email RFC byte formats By "byte format", do you mean "wire format"? > for the internal form makes it quick and easy to produce a complete > message when called for, Only for certain kinds of messages, such as automated forwards and signed MIME parts, and cron's messages. For those, there are great advantages to spewing things verbatim as you got them off the wire or the disk. But even there, as long as we use the natural embedding of bytes in Unicode (ie, interpret bytes as ISO 8859/1) it's easy and not particularly inefficient to use strings. For anything else, storing in wire format is going to require checking format (of the stored data if the format is variable, and always of the requesting API) on all attribute accesses, and conversions on many, even most attribute accesses. > One problem with storing messages in bytes format: it seems to me that > the choice of which of several legal email bytes formats None of them are very happy. The email module needs to be able to both read and produce all of 7bit, 8bit, and binary, and they are in fact pretty well trivial to do. So the question to me is "what are the primary use cases for the email module, and how do they affect the choice of internal representation?" I can't claim special expertise on "how", I'll leave that up to Barry. Here are some use cases I can think of. 1. Debugging programs using the email module. Maybe that's a +1 for internally storing textual data in string form. 2. MUA #1: Composition. Input will be strings and multimedia file names, output will be bytes. Will attributes of message objects be manipulated? Not in a conventional MUA, but an email-based MUA might find uses for that. 3. MUA #2: Reading. Input will often be bytes (spool files, IMAP data). Could be strings, though, depending on the internal format of folders. Output will be strings and multimedia objects. Lots of string processing, especially generating folder directory displays from message headers. 4. Mailing list processor. Message input will be bytes. Configuration input, including heading and footer texts that may be added are likely to be strings. Header manipulation (adding topics, sequence numbers, RFC 2369 headers) most conveniently done with strings. Output will be bytes. 5. Mailing list archiver. Input will be bytes or message objects, output will be strings (typically HTML documents or XML fragments). 6. Spam/virus detection. Input may be bytes or message objects. Lots of internal string processing; in most cases the text/* parts need to be converted to strings before grepping; in some cases even images or executables may be reconstituted to look for malware signatures. Output may be a flag or signal, or the message itself may be edited (typically to provide headers recording degree of spamminess, trace headers, maybe a body heading; in some cases, a new message may be generated with the suspected spam as a message/rfc822 MIME body part). From v+python at g.nevcal.com Tue Oct 6 21:14:37 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Tue, 06 Oct 2009 12:14:37 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4ACB971D.9080706@g.nevcal.com> On approximately 10/6/2009 7:18 AM, came the following characters from the keyboard of Stephen J. Turnbull: > In the following I use Python 3 terminology: strings are Python > Unicode objects, and bytes are Python bytes objects. > > Glenn Linderman writes: > > > Email messages are bytes. Usually restricted to bytes in the range > > 32-127, but sometimes permitted to be 0-255 (8bit encoding). > > This is irrelevant to our internal representation. It is both trivial > and efficient to convert the wire format (bytes) to a string > internally (at least for email messages up to say 5MB). > > Which internal representation makes the most sense depends on what we > are going to do with that internal representation. At this point I'm > not sure that strings are better than bytes, but I'm quite sure that > I've seen no convincing argument that bytes are TOOWTDI. > > Nor is it at all obvious to me that should be stored in wire format. > Yes, I interpreted, possibly misinterpreted, Barry's comment about storing things as bytes, as that he was figuring to store them in wire format. > > Using any other format than email format, means knowing how to > > translate that format to/from email format, and to/from API > > format... this means coding two translation routines instead of > > one. > > That sound reasonable, but it's a false economy. And this was actually the point I was trying to make. > The formats you're > talking about here are the transfer encodings, and we need to be able > to decode all of them, and produce all of them. Internally, they can > be represented by a single format, so you need internal-to-transfer > and transfer-to-internal for about six of them (7bit, 8bit, binary == > Python bytes, BASE64, quoted-printable, Python string) > Not all formats apply to all MIME types, but I think you've enumerated the list. > As for runtime economy, if conversion is done once at parse time and > once at generate time it is not a big burden, not as compared to the > overhead of the Python language itself. > I would tend to agree with that, except that if something is received/provided in a particular format, it might want to stay in that format until such time it is needed in a different format... and then the appropriate set of conversions (current format => internal format => needed format) applied as needed, avoiding all conversions when it is already in the needed format. > > The choice of email RFC byte formats > > By "byte format", do you mean "wire format"? > Sure, RFC byte formats == wire format. > > for the internal form makes it quick and easy to produce a complete > > message when called for, > > Only for certain kinds of messages, such as automated forwards and > signed MIME parts, and cron's messages. For those, there are great > advantages to spewing things verbatim as you got them off the wire or > the disk. But even there, as long as we use the natural embedding of > bytes in Unicode (ie, interpret bytes as ISO 8859/1) it's easy and not > particularly inefficient to use strings. > two conversions are slower than none, and use 2-4 times the space in string format. > For anything else, storing in wire format is going to require checking > format (of the stored data if the format is variable, and always of > the requesting API) on all attribute accesses, and conversions on > many, even most attribute accesses. > One has to write the conversion code anyway; it is just a matter of where it is called. Once converted, meta data could be retained in its natural format. > > One problem with storing messages in bytes format: it seems to me that > > the choice of which of several legal email bytes formats > > None of them are very happy. The email module needs to be able to > both read and produce all of 7bit, 8bit, and binary, and they are in > fact pretty well trivial to do. > > So the question to me is "what are the primary use cases for the email > module, and how do they affect the choice of internal representation?" > I can't claim special expertise on "how", I'll leave that up to > Barry. Here are some use cases I can think of. > Yes this is a good question. > 1. Debugging programs using the email module. Maybe that's a +1 for > internally storing textual data in string form. > > 2. MUA #1: Composition. Input will be strings and multimedia file > names, output will be bytes. Will attributes of message objects > be manipulated? Not in a conventional MUA, but an email-based MUA > might find uses for that. > I'm not sure what an email-based MUA is.... seems to me even a conventional MUA is "email-based"??? > 3. MUA #2: Reading. Input will often be bytes (spool files, IMAP > data). Could be strings, though, depending on the internal format > of folders. Output will be strings and multimedia objects. Lots > of string processing, especially generating folder directory > displays from message headers. > > 4. Mailing list processor. Message input will be bytes. > Configuration input, including heading and footer texts that may > be added are likely to be strings. Header manipulation (adding > topics, sequence numbers, RFC 2369 headers) most conveniently done > with strings. Output will be bytes. > But the bulk of the message parts, received in wire format, may not need to be altered to be sent along in the same wire format. Headers must be manipulated somehow, I'd think it would be convenient as strings too. Heading and footing texts are configured boilerplate, and could be cached in a variety of formats to avoid the need to convert them for each message, and could then be obtained from the cache in the appropriate format for this particular message, and prepended or appended as appropriate. > 5. Mailing list archiver. Input will be bytes or message objects, > output will be strings (typically HTML documents or XML > fragments). > An archiver could archive wire format, and do the conversions to *ML on the fly for those messages that might be accessed that way. Depends on the expectation of the usage of the archiver... to retrieve the archived messages via email, wire format could be extremely efficient; to retrieve via HTTP, one should note that there is very little difference between .eml format (another name for wire format) and .mthml format (which is a format IE and Opera will display natively, support in other browsers varies, mostly via addons and conversion utilities). So I'm not at all sure that this use case requires string output, although some implementations might prefer it. > 6. Spam/virus detection. Input may be bytes or message objects. > Lots of internal string processing; in most cases the text/* parts > need to be converted to strings before grepping; in some cases > even images or executables may be reconstituted to look for > malware signatures. Output may be a flag or signal, or the > message itself may be edited (typically to provide headers > recording degree of spamminess, trace headers, maybe a body > heading; in some cases, a new message may be generated with the > suspected spam as a message/rfc822 MIME body part). > So it seems to me that storing the data in the format provided, and converting it to native format when requested and caching that result, and then when generating wire format, if the needed format was not provided or cached, then converting as necessary, would be optimal to minimize conversion (time) costs. This technique would also maximally preserve the original format for use cases 3 and 5, which, for use case 3, at least, seems to be important to this list from past discussion. To minimize memory (space) costs, the caching could be avoided (causing reconversion costs), or, at the expense of not preserving the original format, once converted, retain only the native format of the item (which is generally the smallest, for binary objects, and which is most easily manipulated, but not necessarily smallest, for text objects). So I'd design the internal format with meta data like MIMEpart formatFlag metaData 7bitData 8bitData binaryData nativeText nativeBLOB where the metaData would consist of a variety of pertinent items, obtained by decoding provided wireData or supplied along with provided nativeData. Generate could use 7bitData, 8bitData, or binaryData directly if it exists, or cache it there if it didn't already exist. binaryData would differ from nativeBLOB only by containing the appropriate MIMEheaders... perhaps as a space optimization, it would contain only the appropriate MIMEheaders, with the binaryData being placed in nativeBLOB directly (since this is not a costly conversion, just a choice of where to store the bytes). It could also be possible that a complete, provided, wire format message would be retained as a single BLOB, and the appropriate format data items simply be offsets and lengths within that BLOB, although with cached metaData. Of course, there is already a design within the existing code, and the cost of wholesale redesign may be more than can be afforded. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From stephen at xemacs.org Wed Oct 7 02:30:25 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 07 Oct 2009 09:30:25 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACB971D.9080706@g.nevcal.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> Message-ID: <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> Glenn Linderman writes: > Yes, I interpreted, possibly misinterpreted, Barry's comment about > storing things as bytes, as that he was figuring to store them in wire > format. What that means is unclear, though. Does a "header in wire format" mean before or after MIME encoding? Probably after, but that's pretty useless for the purpose of editing the header. Does it include the tag (the part before the colon) or not? Etc. > I would tend to agree with that, except that if something is > received/provided in a particular format, it might want to stay in that > format until such time it is needed in a different format... and then > the appropriate set of conversions (current format => internal format => > needed format) applied as needed, avoiding all conversions when it is > already in the needed format. If you mean that the email module will keep track of what form the object is currently represented by, that will eventually result in "UnicodeError: octet out of range: 161, ascii". > two conversions are slower than none, and use 2-4 times the space in > string format. Let's get this correct, *then* optimize, please. > One has to write the conversion code anyway; it is just a matter of > where it is called. Once converted, meta data could be retained in its > natural format. Meta data for what? Why would you convert meta data? > > 2. MUA #1: Composition. Input will be strings and multimedia file > > names, output will be bytes. Will attributes of message objects > > be manipulated? Not in a conventional MUA, but an email-based MUA > > might find uses for that. > > I'm not sure what an email-based MUA is.... seems to me even a > conventional MUA is "email-based"??? Only if it's written using the Python email module. > > 4. Mailing list processor. Message input will be bytes. > > Configuration input, including heading and footer texts that may > > be added are likely to be strings. Header manipulation (adding > > topics, sequence numbers, RFC 2369 headers) most conveniently done > > with strings. Output will be bytes. > > > > But the bulk of the message parts, received in wire format, may not need > to be altered to be sent along in the same wire format. That depends. For example, multimedia parts may simply be discarded, in which case it makes sense to not convert them. However, most Mailman lists do add a footer, and because of crappy Windows MUAs that don't implement MIME correctly, it's preferred to add that by concatenating as text. That simply cannot be done correctly in wire format for any character set except ISO 8859/1. > Heading and footing texts are configured boilerplate, and could be > cached in a variety of formats to avoid the need to convert them for > each message, Premature optimization is the root of all error. > An archiver could archive wire format, Are you suggesting that the email module should mandate that? We have a severe tail-dog inversion problem here. From janssen at parc.com Wed Oct 7 03:34:32 2009 From: janssen at parc.com (Bill Janssen) Date: Tue, 6 Oct 2009 18:34:32 PDT Subject: [Email-SIG] fixing the current email module In-Reply-To: <10506972.7161254576370614.JavaMail.root@boaz> References: <10506972.7161254576370614.JavaMail.root@boaz> Message-ID: <7054.1254879272@parc.com> Timothy Farrell wrote: > Back in June, David Murray posted the message below about fixing the > email module. I have an interest in helping with this due to a > personal project I'm working on. However, my ability to help is > severely limited by my understanding of email and MIME RFCs. Tim, familiarity with email and MIME RFCs would be a big help if you want to help with the email module. Even for writing test cases. Bill From v+python at g.nevcal.com Wed Oct 7 04:52:39 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Tue, 06 Oct 2009 19:52:39 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4ACC0277.2060807@g.nevcal.com> On approximately 10/6/2009 5:30 PM, came the following characters from the keyboard of Stephen J. Turnbull: > Glenn Linderman writes: > > > Yes, I interpreted, possibly misinterpreted, Barry's comment about > > storing things as bytes, as that he was figuring to store them in wire > > format. > > What that means is unclear, though. Does a "header in wire format" > mean before or after MIME encoding? Probably after, but that's pretty > useless for the purpose of editing the header. Does it include the > tag (the part before the colon) or not? Etc. > > > I would tend to agree with that, except that if something is > > received/provided in a particular format, it might want to stay in that > > format until such time it is needed in a different format... and then > > the appropriate set of conversions (current format => internal format => > > needed format) applied as needed, avoiding all conversions when it is > > already in the needed format. > > If you mean that the email module will keep track of what form the > object is currently represented by, that will eventually result in > "UnicodeError: octet out of range: 161, ascii". > The above sentence does not communicate your meaning to me... or any meaning, actually. Can you explain? If conversions are avoided, then octets are unlikely to be out of range? And the email module must be aware of the form of the data in order to manipulate it in any format other than wire format, but fortunately, wire format declares the format of the data (not to say there is not buggy wire format data -- but that is an issue best avoided by avoiding as many conversions as possible). > > two conversions are slower than none, and use 2-4 times the space in > > string format. > > Let's get this correct, *then* optimize, please. > That's a nice platitude... I could have used it on you when you said > As for runtime economy, if conversion is done once at parse time and > once at generate time it is not a big burden, not as compared to the > overhead of the Python language itself. but I didn't. You can't design things totally ignoring the reality of time and space performance, and expect to get an efficient result. I agree one can spend too much time on premature optimization issues, and I have that tendency, but if you totally ignore time and space issues, you wind up with Vista. > > One has to write the conversion code anyway; it is just a matter of > > where it is called. Once converted, meta data could be retained in its > > natural format. > > Meta data for what? Why would you convert meta data? > Meta data for the email message... how many MIME parts, their Content-Types, etc. This is small amounts of data, but reasonably likely to referenced multiple times during the message parsing or creation and generation process. So once it is converted from wire format, it should be kept in a useful format, as well as wire format. > > > 2. MUA #1: Composition. Input will be strings and multimedia file > > > names, output will be bytes. Will attributes of message objects > > > be manipulated? Not in a conventional MUA, but an email-based MUA > > > might find uses for that. > > > > I'm not sure what an email-based MUA is.... seems to me even a > > conventional MUA is "email-based"??? > > Only if it's written using the Python email module. > Um. Aren't we talking about use cases for the Python email module? I was trying to interpret what you were saying in that light. Sure, what a conventional (not written using the Python email module) MUA does, is mostly irrelevant, except so far as it shows use cases that might be applied to email-based (written using the Python email module) MUAs. > > > 4. Mailing list processor. Message input will be bytes. > > > Configuration input, including heading and footer texts that may > > > be added are likely to be strings. Header manipulation (adding > > > topics, sequence numbers, RFC 2369 headers) most conveniently done > > > with strings. Output will be bytes. > > > > > > > But the bulk of the message parts, received in wire format, may not need > > to be altered to be sent along in the same wire format. > > That depends. For example, multimedia parts may simply be discarded, > in which case it makes sense to not convert them. However, most > Mailman lists do add a footer, and because of crappy Windows MUAs that > don't implement MIME correctly, it's preferred to add that by > concatenating as text. That simply cannot be done correctly in wire > format for any character set except ISO 8859/1. > Huh? First off, which "crappy Windows MUAs" don't implement MIME correctly, and what do they do wrong? When I look at wire format emails, I'm mostly appalled by the stuff generated by Apple Mail. I have seen a few doozies from Outlook 2000, but they seem to be fixed in newer versions. Adding a header or trailer does require knowledge of the character set and encoding of the message. Given that, you can decode to str, add the header or trailer and encode back to MIME. So that's the inefficient proof of concept. In the identity or quopri encodings, it is possible to add similarly encoded headers and trailers correctly to text/plain parts through normal concatenation. Adding headers to base64 encoding requires that the encoded header be an exact number of base64 lines, or at least a multiple of 3 characters and that you shuffle the line layout through the whole base64 body... it is not clear that this is worth the work. Adding trailers to base64 encoding requires decoding the final partial encoding, noticing how much room is left on that last line, and the encoding from there on... so it is not possible to cache an encoded base64 footer, although it would be possible to cache 3 of them, and only have to tweak the merge and choose the right one of the three and then reshuffle. So since text/plain is seldom encoded in base64, and base64 is so complex to concatenate to in wire format, I'd think it would be a better choice to decode and reencode to concatenate headers or footers to base64 encoded MIME parts.... unless immense base64 encoded MIME parts are expected to be common enough to develop the optimized logic. text/html is trickier, whether encoded or not. You have to parse past any stuff that precedes , and place the header after that, and then you have to find the and place the trailer before that. And unless you run the HTML through a validity checker, you can't be sure that the trailer will even show up, much less actually at the bottom, due to the possibility of unclosed tags within the body. To parse even quopri encoded HTML gets tricky, and basically impossible for base64 encoded HTML. So the first text/html part likely will need to be decoded for adding headers and trailers, if it is an alternative to the text/plain part, or there is no text/plain part. I've seen some systems add an additional MIME part to place a trailer in, and that can be pretty effective for MUAs that will show multiple parts in-line, but there are so many MUAs out there, that it is extremely difficult to make any certain declarations regarding what the user sees as a result. And, ISO 8859/1 is an 8-bit character set, so would require encoding on a 7bit transfer. But it is not unique; if you know how to do ISO 8859/1 concatenation in wire format, then you can do the whole class of ASCII+128 more character sets in the same manner. Not to mention that ASCII itself works fine in wire format. And so does UTF-8. It is just a matter of matching the character set and the encoding. > > Heading and footing texts are configured boilerplate, and could be > > cached in a variety of formats to avoid the need to convert them for > > each message, > > Premature optimization is the root of all error. > Yeah, yeah. I said "could", not "must". I was pushing back from your declaration that: > Configuration input, including heading and footer texts that may > be added are likely to be strings. Such configuration texts are likely to provided as strings, but there is nothing to prevent them from being converted to other formats. Premature optimization may or may not be the root of all error, but discarding perfectly valid design possibilities based on how the input might be supplied seems a similar error. I'm not declaring which design is best, just that there are alternatives. > > An archiver could archive wire format, > > Are you suggesting that the email module should mandate that? We have > a severe tail-dog inversion problem here. Absolutely not. I said "could", not "must". The archiver can do what it wants. The email library should provide access to the message data in all useful formats, so that the archiver can do what it wants. The archiver needs to choose its design and optimizations appropriate for its expected use cases. I was pushing back from your declaration that an archiver would always want string output.... you said: > 5. Mailing list archiver. Input will be bytes or message objects, > output will be strings (typically HTML documents or XML > fragments). -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From stephen at xemacs.org Wed Oct 7 12:33:42 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 07 Oct 2009 19:33:42 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACC0277.2060807@g.nevcal.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> Message-ID: <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> Glenn Linderman writes: > > If you mean that the email module will keep track of what form the > > object is currently represented by, that will eventually result in > > "UnicodeError: octet out of range: 161, ascii". > > The above sentence does not communicate your meaning to me... or any > meaning, actually. Can you explain? Yes, that Unicode error is one that took years for Mailman to work around. If we are going to be converting different objects at different times, I'm sure we'll get to see it agin in the future. Oh, joy. > If conversions are avoided, then octets are unlikely to be out of > range? Haven't looked in your spam bucket recently, I guess. Spammers regularly put 8 bit characters into headers (and into bodies in messages without a Content-Type header), for one thing. > And the email module must be aware of the form of the data in > order to manipulate it in any format other than wire format, but > fortunately, wire format declares the format of the data (not to say > there is not buggy wire format data -- but that is an issue best avoided > by avoiding as many conversions as possible). "Best" I can't speak to; you obviously are willing to accept a much higher error rate than I am. "Robust" handling of buggy wire format data means that the email module must do something sane with it before giving it to the application. Maybe it's reasonable to do that lazily, and/or cache the result, but access to bogus data (that the email module can determine is bogus or suspicious) must not be allowed unless the client says "hit me with your best shot" explicitly. Most clients are simply not going to be prepared for the kind of crap I see in /var/mail/turnbull every day. > I was pushing back from your declaration that an archiver would > always want string output Please don't push back; we won't get anywhere. Use cases are *examples*, not complete specifications of all possible inputs and outputs. Use cases should be simple and clear cut. If you want a different use case, state it. In fact in the real world, *all* of the archivers I know of produce text formats on disk, either deleting multimedia objects or saving them off and linking to them via URLs in the text. If you know of a different kind of archiver, add it as a use case. From phd at phd.pp.ru Wed Oct 7 13:09:58 2009 From: phd at phd.pp.ru (Oleg Broytman) Date: Wed, 7 Oct 2009 15:09:58 +0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <20091007110958.GG24702@phd.pp.ru> On Wed, Oct 07, 2009 at 07:33:42PM +0900, Stephen J. Turnbull wrote: > Haven't looked in your spam bucket recently, I guess. Spammers > regularly put 8 bit characters into headers Legitimate but stupid programs do this as well. Think of phpbb-like forums written by programmers who never understand how non-ascii can be put into Subject field or filenames - they send amazingly crippled emails. Oleg. -- Oleg Broytman http://phd.pp.ru/ phd at phd.pp.ru Programmers don't die, they just GOSUB without RETURN. From stephen at xemacs.org Wed Oct 7 14:19:27 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 07 Oct 2009 21:19:27 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: <20091007110958.GG24702@phd.pp.ru> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <20091007110958.GG24702@phd.pp.ru> Message-ID: <874oqbs8fk.fsf@uwakimon.sk.tsukuba.ac.jp> Oleg Broytman writes: > On Wed, Oct 07, 2009 at 07:33:42PM +0900, Stephen J. Turnbull wrote: > > Haven't looked in your spam bucket recently, I guess. Spammers > > regularly put 8 bit characters into headers > > Legitimate but stupid programs do this as well. Sure, but Glenn may not be subscribed to any of those. *Everybody* is subscribed to spam, though. I'll-let-you-decide-what-kind-of-smiley-that-needs-ly y'rs, From matt at mondoinfo.com Wed Oct 7 18:23:24 2009 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Wed, 7 Oct 2009 11:23:24 -0500 (CDT) Subject: [Email-SIG] fixing the current email module In-Reply-To: <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <1254929486.96.16481@mint-julep.mondoinfo.com> [Stephen J. Turnbull] > Yes, that Unicode error is one that took years for Mailman to work > around. If we are going to be converting different objects at > different times, I'm sure we'll get to see it again in the future. In my opinion, the email module should never raise an exception as a result of working with a malformed message. Though it should certainly make the information that a message was malformed available for the calling program to check. That is, I think that it's extremely unlikely that the calling program wants to blow up as a result of a malformed message. Very probably, it wants to make what sense of the message that it can. The number of ways in which a message can be malformed is pretty large and just how (and when, as has been mentioned) any particular error will cause problems for the module is really a matter that's internal to the module. The module's user shouldn't have to say, "Over here I have to trap UnicodeErrors and over there I have to trap IndexErrors". Regards, Matt From phd at phd.pp.ru Wed Oct 7 19:07:18 2009 From: phd at phd.pp.ru (Oleg Broytman) Date: Wed, 7 Oct 2009 21:07:18 +0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <1254929486.96.16481@mint-julep.mondoinfo.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> Message-ID: <20091007170718.GA1901@phd.pp.ru> On Wed, Oct 07, 2009 at 11:23:24AM -0500, Matthew Dixon Cowles wrote: > In my opinion, the email module should never raise an exception as a > result of working with a malformed message. Though it should > certainly make the information that a message was malformed available > for the calling program to check. I disagree. email package is not a user agent, and exceptions are *the* way to indicate there are problems. > That is, I think that it's extremely unlikely that the calling > program wants to blow up as a result of a malformed message. Then the calling program must catch all exceptions and process they in a reasonable (for this particular application) way. But certainly email package must not dictate what ways are reasonable - they are too application-specific. > Very > probably, it wants to make what sense of the message that it can. Yes, if email parse a message in some way - ok. You can help by creating more intelligent parser(s). But if a parser stumbles upon an unparseable block - it must raises an exception. Oleg. -- Oleg Broytman http://phd.pp.ru/ phd at phd.pp.ru Programmers don't die, they just GOSUB without RETURN. From anthonybaxter at gmail.com Wed Oct 7 16:38:38 2009 From: anthonybaxter at gmail.com (Anthony Baxter) Date: Thu, 8 Oct 2009 01:38:38 +1100 Subject: [Email-SIG] fixing the current email module In-Reply-To: <874oqbs8fk.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <20091007110958.GG24702@phd.pp.ru> <874oqbs8fk.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On Wed, Oct 7, 2009 at 11:19 PM, Stephen J. Turnbull wrote: > Oleg Broytman writes: > > On Wed, Oct 07, 2009 at 07:33:42PM +0900, Stephen J. Turnbull wrote: > > > Haven't looked in your spam bucket recently, I guess. Spammers > > > regularly put 8 bit characters into headers > > > > Legitimate but stupid programs do this as well. > > Sure, but Glenn may not be subscribed to any of those. *Everybody* is > subscribed to spam, though. > > I'll-let-you-decide-what-kind-of-smiley-that-needs-ly y'rs, You'd be amazed how many MUAs shipped by major companies are broken. MS Entourage, anyone? noone-mention-the-nested-multiparts-with-the-same-boundary-tag-on-both-levels-ly y'rs. -------------- next part -------------- An HTML attachment was scrubbed... URL: From v+python at g.nevcal.com Wed Oct 7 19:34:05 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Wed, 07 Oct 2009 10:34:05 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4ACCD10D.4070308@g.nevcal.com> On approximately 10/7/2009 3:33 AM, came the following characters from the keyboard of Stephen J. Turnbull: > Glenn Linderman writes: > > > > If you mean that the email module will keep track of what form the > > > object is currently represented by, that will eventually result in > > > "UnicodeError: octet out of range: 161, ascii". > > > > The above sentence does not communicate your meaning to me... or any > > meaning, actually. Can you explain? > > Yes, that Unicode error is one that took years for Mailman to work > around. If we are going to be converting different objects at > different times, I'm sure we'll get to see it agin in the future. Oh, > joy. > Ah, a historical remark! So that's why it was lost on me, I'm new to the Python world (but programming since 1975...) > > If conversions are avoided, then octets are unlikely to be out of > > range? > > Haven't looked in your spam bucket recently, I guess. Spammers > regularly put 8 bit characters into headers (and into bodies in > messages without a Content-Type header), for one thing. > I'm aware of that, but if conversions are not done, octets are unlikely to be _reported_ to be out of range.... > > And the email module must be aware of the form of the data in > > order to manipulate it in any format other than wire format, but > > fortunately, wire format declares the format of the data (not to say > > there is not buggy wire format data -- but that is an issue best avoided > > by avoiding as many conversions as possible). > > "Best" I can't speak to; you obviously are willing to accept a much > higher error rate than I am. "Robust" handling of buggy wire format > data means that the email module must do something sane with it before > giving it to the application. Maybe it's reasonable to do that > lazily, and/or cache the result, but access to bogus data (that the > email module can determine is bogus or suspicious) must not be allowed > unless the client says "hit me with your best shot" explicitly. Most > clients are simply not going to be prepared for the kind of crap I see > in /var/mail/turnbull every day. > Are you referring to most email clients, or most Python-email-library-using clients? It seems like most email clients are being hit with the same stuff you are seeing... every day... and are handling it somehow... although anti-spam filters do eliminate some of it before the end user's MUA sees it, depending on the ISP, etc. Is it your point of view, then, that incorrectly formed email should be mostly treated as SPAM? Your paragraph above could be interpreted that way. Oleg's point is also valid though, so it seems that isn't your point of view. Your "hit me with your best shot" comment indicates that you want a failure code or exception when the data is bad, and then a way to "retry accepting errors"? > > I was pushing back from your declaration that an archiver would > > always want string output > > Please don't push back; we won't get anywhere. Use cases are > *examples*, not complete specifications of all possible inputs and > outputs. Use cases should be simple and clear cut. If you want a > different use case, state it. In fact in the real world, *all* of the > archivers I know of produce text formats on disk, either deleting > multimedia objects or saving them off and linking to them via URLs in > the text. If you know of a different kind of archiver, add it as a > use case. > I misunderstood the purpose of your list. Sure, everything in your list is a good example of real world uses. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From matt at mondoinfo.com Wed Oct 7 22:05:35 2009 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Wed, 7 Oct 2009 15:05:35 -0500 (CDT) Subject: [Email-SIG] fixing the current email module In-Reply-To: <20091007170718.GA1901@phd.pp.ru> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> Message-ID: <1254944602.12.16665@mint-julep.mondoinfo.com> [me] > In my opinion, the email module should never raise an exception as a > result of working with a malformed message. [Oleg Broytman] > I disagree. email package is not a user agent, and exceptions are > *the* way to indicate there are problems. We may have to agree to disagree. If the email package gives up because a message is malformed, I don't know what exactly it's for. It's certainly not for parsing what arrives in my mailbox. > Then the calling program must catch all exceptions and process they > in a reasonable (for this particular application) way. Then the module's documentation would need to include a list of all exceptions that it might raise and the times that it might raise them. Otherwise the application developer is proceeding in the dark. Regards, Matt From phd at phd.pp.ru Wed Oct 7 22:28:13 2009 From: phd at phd.pp.ru (Oleg Broytman) Date: Thu, 8 Oct 2009 00:28:13 +0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <1254944602.12.16665@mint-julep.mondoinfo.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> <1254944602.12.16665@mint-julep.mondoinfo.com> Message-ID: <20091007202813.GB6832@phd.pp.ru> On Wed, Oct 07, 2009 at 03:05:35PM -0500, Matthew Dixon Cowles wrote: > If the email package gives up > because a message is malformed, I don't know what exactly it's for. > It's certainly not for parsing what arrives in my mailbox. Then it is *your* task to enhance the code. A flow of patches with tests would be the best contribution. > > Then the calling program must catch all exceptions and process they > > in a reasonable (for this particular application) way. > > Then the module's documentation would need to include a list of all > exceptions that it might raise and the times that it might raise > them. You are also welcome to provide patches for documentation. Oleg. -- Oleg Broytman http://phd.pp.ru/ phd at phd.pp.ru Programmers don't die, they just GOSUB without RETURN. From barry at python.org Thu Oct 8 03:10:24 2009 From: barry at python.org (Barry Warsaw) Date: Wed, 7 Oct 2009 21:10:24 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <87zl88h4cj.fsf@uwakimon.sk.tsukuba.ac.jp> References: <10506972.7161254576370614.JavaMail.root@boaz> <8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org> <87zl88h4cj.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <1685F0AB-8B57-445A-BE03-3782E07DB8FD@python.org> On Oct 3, 2009, at 11:41 AM, Stephen J. Turnbull wrote: > Barry Warsaw writes: > >> So the basic model is: accept strings or bytes at the edges, >> process everything internally as bytes, output strings and bytes at >> the edges. > > In a certain pedantic sense, that can't be right, because bytes alone > can't represent strings. > > Practically, you are going need to say how a bytes or bytearray is to > be interpreted as a string, and that is going to be one big mess. > (MIME?) > > Going the other way around you have no such problem, or rather the > trivial embedding works fine, except that you have to do a range check > at some point before you convert to bytes. So, I've taken at least two abortive attempts at updating the email package to Python 3, once using bytes internally and another time using strings internally. Neither one was completely satisfying (to say the least). I've also heard convincing arguments from folks in the Python community in both camps: "using anything other than strings internally is insane; no, using anything other than bytes internally is insane." As for the internal representational format, I'll amend my previous statement and say that I'll keep an open mind, but one thing that seems very clear is that we have to be able to accept strings and bytes at the incoming edges, and produce strings and bytes at the outgoing edges. In a future message, Stephen outlines some excellent use cases, to which I'll follow up when I get there. But I think he generally hits the nail on the head and proves that we'll have both types at the edges. That makes for very interesting API design! There's "internal" and then there's the low-level representation that the model exposes. Here I have more confidence that we need make things much more consistent. The trick is to do that while still making things convenient. For example, we currently represent header values as 8-bit strings or Header instances. The latter can contain triples of the individual chunks, e.g. (content, language, charset). I think we need represent header values as instances in all cases because the type checking is error prone, but even then, it makes for difficult API choices. Still, if the fundamental atom of header values in the model is the Header, and we define both byte and string APIs for headers, then the internal representation matters less since only the email package implementers need to care. But note that even in this limited case, neither bytes nor strings really works. The internal representation is that triple (and in the current model an implicit triple where charset=us-ascii). So internally the charset is carried along for the ride, as it must be. If the internal representation were just strings or bytes, we wouldn't know how to generate the other format, at least not idempotently (or as close as we can get). Just to ramble a little longer, it's been argued that we should give up on idempotency, but I'm not convinced. I think people want to see an email message they throw into the system come out the other end as closely as possible (well, /exactly/ for well-formed messages). -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From barry at python.org Thu Oct 8 03:17:47 2009 From: barry at python.org (Barry Warsaw) Date: Wed, 7 Oct 2009 21:17:47 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <8510262.7231254589795083.JavaMail.root@boaz> References: <8510262.7231254589795083.JavaMail.root@boaz> Message-ID: <9B5501D3-C9CB-46EA-843D-B07BE7E9E288@python.org> On Oct 3, 2009, at 1:09 PM, Timothy Farrell wrote: > Forgive my ignorance...why does converting bytes to strings have to > be a mess? Rather than having two Feedparsers, can't we just pass a > default encoding when instantiating a feedparser and have it read > from the MIME headers otherwise? If not encoding is passed and one > can't be determined, simply output as bytes or try a default and > raise an exception if it fails. A lot of work went into the parser the last (successful) time around to avoid exceptions as much as possible. That's why Message objects have a .defects attribute. I'm more okay with the APIs that are used to hand-craft or modify existing message to throw exceptions when something bad happens, e.g. an unknown charset is used. But the parser itself should never throw an exception. The use case here is: Our MTA has dropped a message on disk and it could be deliberately malformed spam. We don't know that until we parse it though, so we must be able to construct a reasonable message tree from the raw bytes we read off disk. The defects the parser encounters are in fact useful information that goes into a determination of ham/spam. The key thing here is that clients of the email package are severely handicapped at handling any parsing errors. Mailman for example can't do much except log the error and throw the message into a 'bad' bucket. Whoop-de-doo! Nobody can do anything about it! If we can at least give the system a Message object with defects, the system can reason about it and help the human decide what to do. The generator is probably in a similar situation. If you hand it a Message object, it must generate something. In the case of a message with defects, we can compromises though, such as giving up on idempotency, fixing MIME boundaries, substituting legal/known charsets, etc. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From barry at python.org Thu Oct 8 03:25:16 2009 From: barry at python.org (Barry Warsaw) Date: Wed, 7 Oct 2009 21:25:16 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACB0DC9.7080307@g.nevcal.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> Message-ID: <1D2E61F6-B6DD-4D23-9CE3-25AAA4D713EE@python.org> On Oct 6, 2009, at 5:28 AM, Glenn Linderman wrote: > I observe that binary transport is more efficient than 7bit or 8bit. A few principles that I think we should adopt as far as efficiency and performance go. I am not concerned about performance. Yes, we want to make things as fast as possible, but it's more important to be as right as possible. Look at some of the tricks that the parser has to jump through to properly handle MIME nesting. Yuck, and not fast, but mostly right (it could be improved but I think we're darn close). Memory footprint efficiency is very important, in some cases. I don't particularly care about headers or some of the more compact MIME body formats (perhaps like text/*), but some are very problematic. For example, the Twisted guys have told me that can't use the email package because let's say you read a 10MB image/jpg MIME part. You really can't store thousands of these in memory at a time! So again that dictates that our APIs have to support external storage hook points, for parsing, generating, accessing MIME parts on disk or in a database, etc. It's fine if by default we store everything in memory, but we have to at least give applications the ability to parse straight from the wire, store some parts on disk, and still return Message objects that are completely consistent. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From barry at python.org Thu Oct 8 04:05:08 2009 From: barry at python.org (Barry Warsaw) Date: Wed, 7 Oct 2009 22:05:08 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <8E7BBBBB-E9D7-43E9-B87E-B83209BFA298@python.org> On Oct 6, 2009, at 10:18 AM, Stephen J. Turnbull wrote: > In the following I use Python 3 terminology: strings are Python > Unicode objects, and bytes are Python bytes objects. Exactly. 8-bit strings are dead to us. >> for the internal form makes it quick and easy to produce a complete >> message when called for, > > Only for certain kinds of messages, such as automated forwards and > signed MIME parts, and cron's messages. For those, there are great > advantages to spewing things verbatim as you got them off the wire or > the disk. But even there, as long as we use the natural embedding of > bytes in Unicode (ie, interpret bytes as ISO 8859/1) it's easy and not > particularly inefficient to use strings. > > For anything else, storing in wire format is going to require checking > format (of the stored data if the format is variable, and always of > the requesting API) on all attribute accesses, and conversions on > many, even most attribute accesses. I think that's going to be the case either way. Some applications are going to want bytes, others strings, so there needs to be APIs for both. > So the question to me is "what are the primary use cases for the email > module, and how do they affect the choice of internal representation?" > I can't claim special expertise on "how", I'll leave that up to > Barry. Here are some use cases I can think of. > > 1. Debugging programs using the email module. Maybe that's a +1 for > internally storing textual data in string form. > > 2. MUA #1: Composition. Input will be strings and multimedia file > names, output will be bytes. Will attributes of message objects > be manipulated? Not in a conventional MUA, but an email-based MUA > might find uses for that. > > 3. MUA #2: Reading. Input will often be bytes (spool files, IMAP > data). Could be strings, though, depending on the internal format > of folders. Output will be strings and multimedia objects. Lots > of string processing, especially generating folder directory > displays from message headers. > > 4. Mailing list processor. Message input will be bytes. > Configuration input, including heading and footer texts that may > be added are likely to be strings. Header manipulation (adding > topics, sequence numbers, RFC 2369 headers) most conveniently done > with strings. Output will be bytes. > > 5. Mailing list archiver. Input will be bytes or message objects, > output will be strings (typically HTML documents or XML > fragments). > > 6. Spam/virus detection. Input may be bytes or message objects. > Lots of internal string processing; in most cases the text/* parts > need to be converted to strings before grepping; in some cases > even images or executables may be reconstituted to look for > malware signatures. Output may be a flag or signal, or the > message itself may be edited (typically to provide headers > recording degree of spamminess, trace headers, maybe a body > heading; in some cases, a new message may be generated with the > suspected spam as a message/rfc822 MIME body part). I think this is a very good list. The key thing from an application's point of view is that sometimes messages are parsed and sometimes they are crafted. When parsed, the raw input can come from a completely unknown and untrusted source such as the puking mouth of an MTA. Other times it comes from a big blob of string in a doctest. When crafted, it's almost always a program building up a message tree from scratch, or possibly the manipulation of an existing message (e.g. MIME filter). -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From barry at python.org Thu Oct 8 04:14:43 2009 From: barry at python.org (Barry Warsaw) Date: Wed, 7 Oct 2009 22:14:43 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <07EF292E-B5D0-48BD-8F69-8CBD2B6A4486@python.org> On Oct 6, 2009, at 8:30 PM, Stephen J. Turnbull wrote: > What that means is unclear, though. Does a "header in wire format" > mean before or after MIME encoding? Probably after, but that's pretty > useless for the purpose of editing the header. Does it include the > tag (the part before the colon) or not? Etc. This is a great question. As far as headers go, sometimes you want to reason about the entire header (field name + value) and sometimes you just care about one or the other. Putting the field name in the Header instance means it's difficult to copy the header to other fields. Not having the field name in the instance means that some calculations (such as line length) are tricker. > That depends. For example, multimedia parts may simply be discarded, > in which case it makes sense to not convert them. However, most > Mailman lists do add a footer, and because of crappy Windows MUAs that > don't implement MIME correctly, it's preferred to add that by > concatenating as text. That simply cannot be done correctly in wire > format for any character set except ISO 8859/1. Even then, doesn't it depend on the character set of the text you're appending too? Aren't there for example some Japanese character sets that are incompatible with iso-8859-1? Mailman punts and says if the character sets aren't identical, it cannot concatenate. >> Heading and footing texts are configured boilerplate, and could be >> cached in a variety of formats to avoid the need to convert them for >> each message, > > Premature optimization is the root of all error. I could not agree more. Plus, according to Moore's law, computers will all be 256 times faster when we finish the email package redesign than when we started it . >> An archiver could archive wire format, > > Are you suggesting that the email module should mandate that? We have > a severe tail-dog inversion problem here. Right. Remember that the email package is fundamental to all of this, so it must provide the services that client applications need. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From barry at python.org Thu Oct 8 04:25:51 2009 From: barry at python.org (Barry Warsaw) Date: Wed, 7 Oct 2009 22:25:51 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <7054.1254879272@parc.com> References: <10506972.7161254576370614.JavaMail.root@boaz> <7054.1254879272@parc.com> Message-ID: On Oct 6, 2009, at 9:34 PM, Bill Janssen wrote: > Timothy Farrell wrote: > >> Back in June, David Murray posted the message below about fixing the >> email module. I have an interest in helping with this due to a >> personal project I'm working on. However, my ability to help is >> severely limited by my understanding of email and MIME RFCs. > > Tim, familiarity with email and MIME RFCs would be a big help if you > want to help with the email module. Even for writing test cases. Just be forewarned that you'll end up like James T. Kirk staring up at the neural neutralizer on the Tantalus Penal Colony. You'll either be a mindless shell or in agonizing pain. Or both. going-bold-ly y'rs, -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From barry at python.org Thu Oct 8 04:33:24 2009 From: barry at python.org (Barry Warsaw) Date: Wed, 7 Oct 2009 22:33:24 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACC0277.2060807@g.nevcal.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> Message-ID: <8D91BDEF-F0CD-4FA2-8B55-5CF2E1291A8C@python.org> On Oct 6, 2009, at 10:52 PM, Glenn Linderman wrote: > text/html is trickier, If by "trickier" you mean "impossible" then I'll agree. :) Or maybe "insane" is more accurate. Mailman will never try to parse text/html to concatenate a footer. In fact, if it isn't text/plain and a matching character set, it punts to MIME attachment. However... > I've seen some systems add an additional MIME part to place a > trailer in, and that can be pretty effective for MUAs that will show > multiple parts in-line, but there are so many MUAs out there, that > it is extremely difficult to make any certain declarations regarding > what the user sees as a result. It's actually easy to predict: they'll see crap that makes them unhappy. The cases where I'm wrong about that are so rare as to probably not matter . -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From janssen at parc.com Thu Oct 8 04:37:08 2009 From: janssen at parc.com (Bill Janssen) Date: Wed, 7 Oct 2009 19:37:08 PDT Subject: [Email-SIG] fixing the current email module In-Reply-To: <8E7BBBBB-E9D7-43E9-B87E-B83209BFA298@python.org> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <8E7BBBBB-E9D7-43E9-B87E-B83209BFA298@python.org> Message-ID: <13025.1254969428@parc.com> Barry Warsaw wrote: > > 5. Mailing list archiver. Input will be bytes or message objects, > > output will be strings (typically HTML documents or XML > > fragments). I use the email package to implement an email archiver, and I do bytes in and bytes out. I do threading (using header instances), and process attachments separately, which requires that they come out of the message in their native format, whatever that is -- I treat it as bytes. I also maintain a Python IMAP server which uses the email package to construct messages, and then deconstructs them to send out in response to IMAP requests. Bill From barry at python.org Thu Oct 8 04:40:56 2009 From: barry at python.org (Barry Warsaw) Date: Wed, 7 Oct 2009 22:40:56 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org> On Oct 7, 2009, at 6:33 AM, Stephen J. Turnbull wrote: > Haven't looked in your spam bucket recently, I guess. Spammers > regularly put 8 bit characters into headers (and into bodies in > messages without a Content-Type header), for one thing. Interesting story: Launchpad (which is open source now so there are no secrets) uses XMLRPC when Mailman holds a message for moderation, storing it in Launchpad's database for display to the list (team) owner. Well, I was lazy, stupid, or both and didn't wrap the objects in a Binary over the wire, so we were getting tons of failures here. But none of them seemed to have any practical effect on user experience (read: we got zero bug reports for missing held messages). I finally found the time to debug the problem, because the failures in themselves were cryptic and common enough to cause our operations people headaches. So I cowboyed in some additional capture code and ran it for 24 hours. Guess what I found? We were essentially crapping out on /tons/ of messages with 8-bit in headers, and these messages were basically getting dropped on the floor. Why no bug reports? Because /every/ single captured message was spam. How's that for a bug having unintended positive consequences? -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From barry at python.org Thu Oct 8 04:45:35 2009 From: barry at python.org (Barry Warsaw) Date: Wed, 7 Oct 2009 22:45:35 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <1254929486.96.16481@mint-julep.mondoinfo.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> Message-ID: <3E400B08-1834-4A1B-8970-3AD20BD23765@python.org> On Oct 7, 2009, at 12:23 PM, Matthew Dixon Cowles wrote: > In my opinion, the email module should never raise an exception as a > result of working with a malformed message. Though it should > certainly make the information that a message was malformed available > for the calling program to check. > > That is, I think that it's extremely unlikely that the calling > program wants to blow up as a result of a malformed message. Very > probably, it wants to make what sense of the message that it can. The > number of ways in which a message can be malformed is pretty large > and just how (and when, as has been mentioned) any particular error > will cause problems for the module is really a matter that's internal > to the module. The module's user shouldn't have to say, "Over here I > have to trap UnicodeErrors and over there I have to trap IndexErrors". I've said it before: I complete agree with you, at least for parsing. The big problem in my experience with Mailman is that you're sort of too upside down in the application to do anything about parsing errors when they occur except log it and shunt it. And that's just not very helpful. However, when crafting messages from scratch, I think it /would/ be okay to raise exceptions when something is done wrong, because the application has more control over the data and is in a position to either handle the problem or for the bug to be fixed . In this case, complaining early is much better than say failing in the generator. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From barry at python.org Thu Oct 8 04:51:03 2009 From: barry at python.org (Barry Warsaw) Date: Wed, 7 Oct 2009 22:51:03 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <20091007170718.GA1901@phd.pp.ru> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> Message-ID: <5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org> On Oct 7, 2009, at 1:07 PM, Oleg Broytman wrote: > On Wed, Oct 07, 2009 at 11:23:24AM -0500, Matthew Dixon Cowles wrote: >> In my opinion, the email module should never raise an exception as a >> result of working with a malformed message. Though it should >> certainly make the information that a message was malformed available >> for the calling program to check. > > I disagree. email package is not a user agent, and exceptions are > *the* > way to indicate there are problems. By keeping the various components clear in our mind, we can see that both statements are correct in a sense. The parser and generator should never raise exceptions. The model can and probably should. > Yes, if email parse a message in some way - ok. You can help by > creating > more intelligent parser(s). But if a parser stumbles upon an > unparseable > block - it must raises an exception. No. It really can't. Let's say your MTA dropped a bunch of bytes in a file and in some low-level background process you read those bytes and turn them into Message trees. Now your parser throws an exception: what can you possibly do about it except throw away this unparseable jumble of bytes and log the exception? Much much better is soldier on and produce a Message object that has the right format, but additional information, such as a set of defects it encountered. This is what the current email package does and it has made Mailman's life infinitely better (when it all DTRT). If you have a Message with defects, you can reason about it, show partial information, attempt a repair, etc. With an exception, you're hosed. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From barry at python.org Thu Oct 8 04:52:42 2009 From: barry at python.org (Barry Warsaw) Date: Wed, 7 Oct 2009 22:52:42 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <20091007110958.GG24702@phd.pp.ru> <874oqbs8fk.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <2E08B903-E388-42AF-9386-26FA3B4E4270@python.org> On Oct 7, 2009, at 10:38 AM, Anthony Baxter wrote: > noone-mention-the-nested-multiparts-with-the-same-boundary-tag-on- > both-levels-ly stfu. you are evil. -Barry (For the humor impaired, i.e. not Anthony -> :) -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From mark at msapiro.net Thu Oct 8 05:31:41 2009 From: mark at msapiro.net (Mark Sapiro) Date: Wed, 7 Oct 2009 20:31:41 -0700 Subject: [Email-SIG] header info in body of message. is this normal? EOM In-Reply-To: Message-ID: ----- Original Message --------------- Subject: [Email-SIG] header info in body of message. is this normal? EOM From: Michael Lesauis Date: Tue, 18 Aug 2009 08:45:20 -0700 To: "'email-sig at python.org'" The first empty (not just whitespace, but empty) line in the message terminates the headers. -- Mark Sapiro The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan From mark at msapiro.net Thu Oct 8 05:40:13 2009 From: mark at msapiro.net (Mark Sapiro) Date: Wed, 7 Oct 2009 20:40:13 -0700 Subject: [Email-SIG] email.header.decode_header eats my spaces In-Reply-To: Message-ID: ----- Original Message --------------- Subject: [Email-SIG] email.header.decode_header eats my spaces From: 7073049749 at mymetropcs.com Date: 6 Sep 09 02:18:14 -0500 To: email-sig at python.org If you're talking about spaces between encoded words as in the space between the ?= and the =? in Subject: =?iso-8859-1?q?Hello?= =?iso-8859-1?q?World?= it's supposed to. RFC 2047, section 6.2 says in part When displaying a particular header field that contains multiple 'encoded-word's, any 'linear-white-space' that separates a pair of adjacent 'encoded-word's is ignored. (This is to allow the use of multiple 'encoded-word's to represent long strings of unencoded text, without having to separate 'encoded-word's where spaces occur in the unencoded text.) -- Mark Sapiro The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan From v+python at g.nevcal.com Thu Oct 8 08:54:50 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Wed, 07 Oct 2009 23:54:50 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: <13025.1254969428@parc.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <8E7BBBBB-E9D7-43E9-B87E-B83209BFA298@python.org> <13025.1254969428@parc.com> Message-ID: <4ACD8CBA.5090604@g.nevcal.com> On approximately 10/7/2009 7:37 PM, came the following characters from the keyboard of Bill Janssen: > Barry Warsaw wrote: > > >>> 5. Mailing list archiver. Input will be bytes or message objects, >>> output will be strings (typically HTML documents or XML >>> fragments). >>> > > I use the email package to implement an email archiver, and I do bytes > in and bytes out. I do threading (using header instances), and process > attachments separately, which requires that they come out of the message > in their native format, whatever that is -- I treat it as bytes. > > I also maintain a Python IMAP server which uses the email package to > construct messages, and then deconstructs them to send out in response > to IMAP requests. > > Bill > OK, so there's another nice item for the use case list. Thanks Bill for responding, I figured there had to be something like that out there. That's why I was pushing back on Stephen's cases as making too restrictive of assumptions... but now that I understand his purpose, it is appropriate just to add the additional case. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From rdmurray at bitdance.com Thu Oct 8 09:16:47 2009 From: rdmurray at bitdance.com (R. David Murray) Date: Thu, 8 Oct 2009 03:16:47 -0400 (EDT) Subject: [Email-SIG] fixing the current email module In-Reply-To: <5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> <5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org> Message-ID: I'd like to try to summarize what I understand Barry to be saying (which, in this case, also reflects my understanding of what is needed), and see if I'm anywhere close to on target :) In the following discussion, 'text' refers to unicode data, and bytes refers to, well, bytes. (I chose to use 'text' instead of 'string' to avoid confusion). The email package consists of two major conceptual pieces: the API, and the internal data model. The API needs to have facilities for accepting data in either text format or bytes format, and this data is used to generate a model of the input message (a Message). Likewise the API needs to provide facilities for serializing a Message as either bytes or text. The API also provides ways to build up a Message from pieces, or to extract information from a Message in pieces, and to modify a Message, and again input and output as both text and bytes must be supported. The data model used by the email package is an "implementation detail", and we should not spend effort at this stage trying to optimize it for anything except memory requirements with respect to potentially large sub-objects, and even there it is more a matter of providing ways to deal with potentially large sub-objects than it is a true optimization. In general correctness and robustness is much more important than speed. The data model will need to be a practical hybrid of the input data, possibly transformed in some way in some cases, and various sorts of meta-data. The current email package already works this way. An important characteristic of the model is that it be idempotent whenever sensible; that is, if a given byte stream is used to create a Message or subobject, serializing that Message or subobject as bytes should return the original byte stream whenever sensible (ie: when the data is not pathologically malformed). Likewise if a text stream is used to create a Message or subobject, serializing it as text should produce, whenever sensible, the original text stream. In particular, well-formed (per RFC) message data should always be stored and produced idempotently. An important property of the API is that both the parser that transforms an input stream into a Message and Message serialization should not raise exceptions except in the face of errors that leave no way to produce a valid Message or serialization. Instead a defects list is maintained and exposed through the API. In the face of some defects it may not be sensible to maintain idempotency. The APIs that manipulate the data model either for piecewise construction or for transformations may raise exceptions, and in most cases _should_ raise exceptions when encountering invalid data or operations. Also, as an additional note to those thinking about use cases, I'd like to point out something I know well and which Barry reminded me about recently: parts of the email package (eg: MIME and RFC822-style header parsing) are used or can be used by systems other than systems handling email. The particular cases I have run into myself are working with non-email data files that follow RFC822 rules, and handling data from NNTP (which, granted, is almost email...but only almost). In the former case you usually have text input and output, mediated by the encoding of the file(s) on disk. In the latter case you have all the problems of email plus a few more. Further, in the standard library the http package, urllib, the cgi module, and pydoc are all clients of the email package. --David (RDM) From v+python at g.nevcal.com Thu Oct 8 09:29:41 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Thu, 08 Oct 2009 00:29:41 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: <643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org> Message-ID: <4ACD94E5.5020808@g.nevcal.com> On approximately 10/7/2009 7:40 PM, came the following characters from the keyboard of Barry Warsaw: > On Oct 7, 2009, at 6:33 AM, Stephen J. Turnbull wrote: >> Haven't looked in your spam bucket recently, I guess. Spammers >> regularly put 8 bit characters into headers (and into bodies in >> messages without a Content-Type header), for one thing. > Interesting story: Launchpad (which is open source now so there are no > secrets) uses XMLRPC when Mailman holds a message for moderation, > storing it in Launchpad's database for display to the list (team) > owner. Well, I was lazy, stupid, or both and didn't wrap the objects > in a Binary over the wire, so we were getting tons of failures here. > But none of them seemed to have any practical effect on user > experience (read: we got zero bug reports for missing held messages). > > I finally found the time to debug the problem, because the failures in > themselves were cryptic and common enough to cause our operations > people headaches. So I cowboyed in some additional capture code and > ran it for 24 hours. Guess what I found? > > We were essentially crapping out on /tons/ of messages with 8-bit in > headers, and these messages were basically getting dropped on the > floor. Why no bug reports? Because /every/ single captured message > was spam. How's that for a bug having unintended positive consequences? Great anecdote! Spammers shooting themselves in the foot with their ignorance. But still, much too much spam gets through. Seems to me that when there is an error in an encoded base64 MIME part, such that it can't be base64 decoded, the options for the library are: return an error, the data is likely meaningless allow the bytes to be retrieved, undecoded I suppose it might be possible to skip only those 4-character sequences that don't decode properly, and try to decode the rest of the data, if it is text. But some way to flag that data were undecodable would be needed. And if it is text, then it must then undergo charset decoding (below). The application options are to drop the attachment, or pass through the corrupted bytes, and let the next application try to make sense of it. A quopri MIME part that can't be correctly decoded may still be mostly readable... so here it makes sense to return an error but also the data, decoded as best as possible. Applications choices are basically the same. Once quopri decoded, then text parts must also face charset decoding (below). Charset decoding: a charset should be specified, or is assumed to be ASCII by default. If a text MIME part that isn't in the right character set gets decode errors, there are several possibilities: return an error, and the decoded data, with error substitutions allow the bytes to be retrieved decode as Latin-1 (no errors possible, but probably results in mojibake) The application options are to drop the attachment, or choose to pass through one of the three data values. For headers, the choices are basically the same as for text MIME parts, but some headers that contain meta data (rather just text like Subject:) may be critical to proper decoding of other data, and so errors in some headers can cause incorrect behaviour of other headers or of an associated MIME part. And I agree that APIs to retrieve any MIME part as undecoded bytes is appropriate; and to retrieve it as decoded strings is appropriate for text MIME parts. Not sure that non-text MIME parts need to support being returned as strings. Headers could possibly be a quadruple instead of a triple, with the 4th item being the wire format if received? (If constructed, no wire format would be expected until it is generated.) That would help with idempotency, as if a header contains non-ASCII characters, there are many choices of heuristic to encode that are all proper, so it is unlikely two different algorithms would preserve idempotency. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From phd at phd.pp.ru Thu Oct 8 11:18:40 2009 From: phd at phd.pp.ru (Oleg Broytman) Date: Thu, 8 Oct 2009 13:18:40 +0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> <5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org> Message-ID: <20091008091840.GB28906@phd.pp.ru> On Wed, Oct 07, 2009 at 10:51:03PM -0400, Barry Warsaw wrote: > On Oct 7, 2009, at 1:07 PM, Oleg Broytman wrote: > >> On Wed, Oct 07, 2009 at 11:23:24AM -0500, Matthew Dixon Cowles wrote: >>> In my opinion, the email module should never raise an exception as a >>> result of working with a malformed message. Though it should >>> certainly make the information that a message was malformed available >>> for the calling program to check. >> >> I disagree. email package is not a user agent, and exceptions are >> *the* >> way to indicate there are problems. > > By keeping the various components clear in our mind, we can see that > both statements are correct in a sense. The parser and generator should > never raise exceptions. The model can and probably should. Are you going to parse any garbage and create a Message (probably an empty Message) with one defect "cannot parse it at all"? >> But if a parser stumbles upon an >> unparseable >> block - it must raises an exception. > > No. It really can't. Let's say your MTA dropped a bunch of bytes in a > file and in some low-level background process you read those bytes and > turn them into Message trees. Now your parser throws an exception: what > can you possibly do about it except throw away this unparseable jumble of > bytes and log the exception? I don't disagree with that. If a parser can parse an input in some way - let's consider the input a malformed message and create a Message with defects. What I disagree with is that if a parser cannot parse input garbage at all it must raise an exception. And if a parser can raise an exception any calling program must be prepared to catch such exceptions. Oleg. -- Oleg Broytman http://phd.pp.ru/ phd at phd.pp.ru Programmers don't die, they just GOSUB without RETURN. From phd at phd.pp.ru Thu Oct 8 11:22:32 2009 From: phd at phd.pp.ru (Oleg Broytman) Date: Thu, 8 Oct 2009 13:22:32 +0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <20091008091840.GB28906@phd.pp.ru> References: <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> <5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org> <20091008091840.GB28906@phd.pp.ru> Message-ID: <20091008092232.GC28906@phd.pp.ru> On Thu, Oct 08, 2009 at 01:18:40PM +0400, Oleg Broytman wrote: > What I disagree with is that if a parser cannot parse input garbage at > all it must raise an exception. Sorry for my bad wording. What I disagree with is that if a parser cannot parse input garbage at all it must NOT raise an exception. My opinion is - it must raise an exception. Oleg. -- Oleg Broytman http://phd.pp.ru/ phd at phd.pp.ru Programmers don't die, they just GOSUB without RETURN. From stephen at xemacs.org Thu Oct 8 12:46:50 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 08 Oct 2009 19:46:50 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: <1685F0AB-8B57-445A-BE03-3782E07DB8FD@python.org> References: <10506972.7161254576370614.JavaMail.root@boaz> <8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org> <87zl88h4cj.fsf@uwakimon.sk.tsukuba.ac.jp> <1685F0AB-8B57-445A-BE03-3782E07DB8FD@python.org> Message-ID: <87ocoiqi1x.fsf@uwakimon.sk.tsukuba.ac.jp> Barry Warsaw writes: > I've also heard convincing arguments from folks in the Python > community in both camps: "using anything other than strings > internally is insane; no, using anything other than bytes > internally is insane." They're both right, of course. The problem is figuring out who is right when. ;-) > For example, we currently represent header values as 8-bit strings or > Header instances. The latter can contain triples of the individual > chunks, e.g. (content, language, charset). I think we need represent > header values as instances in all cases because the type checking is > error prone, but even then, it makes for difficult API choices. Agreed on both the need and the difficulty. > Just to ramble a little longer, it's been argued that we should give > up on idempotency, but I'm not convinced. If we can't achieve ... ah, isn't "invertibility" what you mean here? ... "idempotency", then we're dropping information somewhere along the line. Also, there are part types (pgp-signed, I'm looking at you) where it's absolutely essential that we be able to roundtrip the body byte for byte. So I'm -1 on giving up. From stephen at xemacs.org Thu Oct 8 13:25:41 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 08 Oct 2009 20:25:41 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: <20091007170718.GA1901@phd.pp.ru> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> Message-ID: <87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp> Oleg Broytman writes: > On Wed, Oct 07, 2009 at 11:23:24AM -0500, Matthew Dixon Cowles wrote: > > In my opinion, the email module should never raise an exception as a > > result of working with a malformed message. Though it should > > certainly make the information that a message was malformed available > > for the calling program to check. > > I disagree. email package is not a user agent, and exceptions are *the* > way to indicate there are problems. Although practicality beats purity. The email package has access to the wire format, and knows what to do with most of it. It should DTRT where that is possible, and punt where not. By "punt" I mean return a special object containing as much of the meta data for an object as it could recover, along with the data itself as a blob. I would suggest that module utilities that require access to the parsed form of data be designed as object methods. The special objects produced when broken wire format is encountered wouldn't have those methods, and thus they'd fail the duck type test. But that makes sense: that "duck" can't quack anyway. So this gives our (== Matt and me) desideratum that email never raises (it's the Python runtime that will raise AttributeError), and also Oleg's (in part, anyway): an exception *will* be raised. I think (== hope) that this will sufficiently localize the issues that even though only AttributeError would even be raised, it will be obvious what went wrong. > Then the calling program must catch all exceptions That is just unreasonable. There are too many ways for things to go wrong. If you have just one exception for all problems, it's easy to catch them all, but then the client doesn't know what went wrong, and has to partially parse the unparsable itself. That's nuts; the reason for using the email module is to delegate that in the first place, and besides, to the extent it's possible, the module has presumably done that. OTOH, a long list of precise exceptions is both a maintenance burden on the email module and on client programmers. > Yes, if email parse a message in some way - ok. You can help by creating > more intelligent parser(s). But if a parser stumbles upon an unparseable > block - it must raises an exception. No, that's the last thing you want it to do. Suppose you have Content-Type: multipart/alternative Content-Type: text/plain Content-Type: text/html; body-parseable=no Clearly you want (a) a vanilla email client to just grab the text/plain part, and (b) a client written by somebody whose boss uses BustedMUA[tm] to be able to try to parse the text/html part, using the special rules that apply to the jumble produced by BustedMUA. In other cases, you might be able to find a valid part terminator, but the header of that part was hosed. So the whole part becomes a blob, but the parser should resync at that point, and start parsing following parts. I can think of no input for which the parser should *ever* throw an exception. Utilities that depend on a particular object's parsed form might have do so, but even then it should be avoided if at all possible. From stephen at xemacs.org Thu Oct 8 13:40:47 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 08 Oct 2009 20:40:47 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACCD10D.4070308@g.nevcal.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACCD10D.4070308@g.nevcal.com> Message-ID: <87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp> Glenn Linderman writes: > > > If conversions are avoided, then octets are unlikely to be out of > > > range? > > > > Haven't looked in your spam bucket recently, I guess. Spammers > > regularly put 8 bit characters into headers (and into bodies in > > messages without a Content-Type header), for one thing. > > I'm aware of that, but if conversions are not done, octets are unlikely > to be _reported_ to be out of range.... Conversions will eventually be done. "Best it were done quickly." > > Most clients are simply not going to be prepared for the kind of > > crap I see in /var/mail/turnbull every day. > > Are you referring to most email clients, or most > Python-email-library-using clients? Sorry. When I mean "MUA" I try to say "MUA". By "client", I'm referring to the higher level logic that is going to be calling the email module. > Is it your point of view, then, that incorrectly formed email should be > mostly treated as SPAM? Heavens no! Not by the email module, anyway! The email module should not know about spam (but see Barry's "we're having spam for Launchpad" post: if you're that good, anything goes!), except maybe at a very high level. > Your "hit me with your best shot" comment indicates that you want a > failure code or exception when the data is bad, and then a way to > "retry accepting errors"? My curent thinking is that the email module should return an object representing a partial parse. The way that you find out if it is partial is to try to access some data that "should" be in the object. If the parse succeeded, the accessor returns the data (which might be empty). If the parse did not succeed, you get an AttributeError. (This is just a paraphrase of what I wrote in response to Oleg.) From barry at python.org Thu Oct 8 14:28:04 2009 From: barry at python.org (Barry Warsaw) Date: Thu, 8 Oct 2009 08:28:04 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> <5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org> Message-ID: <15946EEF-1991-4F43-90E2-3D30715A15B7@python.org> On Oct 8, 2009, at 3:16 AM, R. David Murray wrote: > I'd like to try to summarize what I understand Barry to be saying > (which, > in this case, also reflects my understanding of what is needed), and > see if I'm anywhere close to on target :) Spot on, IMO! I can only quibble about one thing, though I think it's just in the phrasing of what you wrote (or the way I read it), not in your understanding. > An important property of the API is that both the parser that > transforms > an input stream into a Message and Message serialization should not > raise > exceptions except in the face of errors that leave no way to produce a > valid Message or serialization. I'd say it differently, since we all know you can encounter errors leaving invalid Messages. The parser and generator should only raise exceptions when its basic assumptions (embodied as assertions probably) of the internal model are broken. In almost all cases, I think those would be "bugs" :). It may be in fact that the best you can do is produce a Message object with no headers and a big massive body containing everything else, and a huge defects list. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From phd at phd.pp.ru Thu Oct 8 14:31:33 2009 From: phd at phd.pp.ru (Oleg Broytman) Date: Thu, 8 Oct 2009 16:31:33 +0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> <87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <20091008123133.GA3059@phd.pp.ru> On Thu, Oct 08, 2009 at 08:25:41PM +0900, Stephen J. Turnbull wrote: > Oleg Broytman writes: > > I disagree. email package is not a user agent, and exceptions are *the* > > way to indicate there are problems. > > Although practicality beats purity. > > The email package has access to the wire format, and knows what to do > with most of it. It should DTRT where that is possible, and punt > where not. By "punt" I mean return a special object containing as > much of the meta data for an object as it could recover, along with > the data itself as a blob. The special object is an instance of an exception class ;) > I would suggest that module utilities that require access to the > parsed form of data be designed as object methods. The special > objects produced when broken wire format is encountered wouldn't have > those methods, and thus they'd fail the duck type test. But that > makes sense: that "duck" can't quack anyway. > > So this gives our (== Matt and me) desideratum that email never raises > (it's the Python runtime that will raise AttributeError), and also > Oleg's (in part, anyway): an exception *will* be raised. > > I think (== hope) that this will sufficiently localize the issues that > even though only AttributeError would even be raised, it will be > obvious what went wrong. Not exactly. One can see an AttributeError, but what was the cause? why a parser has created a broken object? AttributeError doesn't preserve information from parser. > I can think of no input for which the parser should *ever* throw an > exception. Are you saying that even a random garbage would be parsed to a Message of some kind? No headers, a single unparsed body?.. Oleg. -- Oleg Broytman http://phd.pp.ru/ phd at phd.pp.ru Programmers don't die, they just GOSUB without RETURN. From barry at python.org Thu Oct 8 15:00:31 2009 From: barry at python.org (Barry Warsaw) Date: Thu, 8 Oct 2009 09:00:31 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACD94E5.5020808@g.nevcal.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org> <4ACD94E5.5020808@g.nevcal.com> Message-ID: <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org> On Oct 8, 2009, at 3:29 AM, Glenn Linderman wrote: > Great anecdote! Spammers shooting themselves in the foot with their > ignorance. Indeed. It constantly surprises me that spam would be so malformed, but I guess it could make perverse sense if say, you were trying to DoS a spam filter. > Seems to me that when there is an error in an encoded base64 MIME > part, such that it can't be base64 decoded, the options for the > library are: > return an error, the data is likely meaningless > allow the bytes to be retrieved, undecoded > I suppose it might be possible to skip only those 4-character > sequences that don't decode properly, and try to decode the rest of > the data, if it is text. But some way to flag that data were > undecodable would be needed. > And if it is text, then it must then undergo charset decoding (below). Note that while I'm adamant that the parser and generator not raise exceptions, what the model does is a different matter. Ideally, accessing data from the model would never raise an exception either, but mutating the model could. This is just basic Postel's Law. > The application options are to drop the attachment, or pass through > the corrupted bytes, and let the next application try to make sense > of it. Exactly, and it's not for the email package to say which is right. Here's a use case: I've got a Message that was parsed from wire input and I want to mangle the Subject heading to add the list prefix. I know exactly what charset the prefix is in because that's data I control. When I ask for the original Subject value, I'm handed an instance that I can use to try to figure out how add the prefix. First thing I'll ask it is "are you a single chunk in my prefix charset (or compatible)?" If so, I can probably just prepend my prefix onto the value. If not, "are you composed of multiple valid chunks in different charsets?" If so, I know that I need to encode my prefix, but I can still prepend it to the header value (hopefully using the same API, and I don't care that the implementation could not use string concatenation). If not, then what? Maybe I don't care if some of the chunk charsets aren't known because I can still use the right encode+prepend strategy. But if the header is a gobbledegook of 8-bit bytes? I'm pretty sure I want to be able to ask the API if that's the case rather than get an exception. The thing I'm not so sure about is what happens if my application is just naive enough to just ask for the header as a unicode and that conversion can't be made. I /think/ it should raise an exception in that case. But then when I ask for the header value as a mass of bytes, that should succeed and return me the raw input. > And I agree that APIs to retrieve any MIME part as undecoded bytes > is appropriate; and to retrieve it as decoded strings is appropriate > for text MIME parts. Not sure that non-text MIME parts need to > support being returned as strings. I hate to open another can of worms, but I've been thinking about this a lot too :). It's been discussed on list before, so nothing new here. I think the parser and MIME classes need to be hookable for decoding their contents. For example, if you have a text/* it might well make sense to support bytes() and str()/unicode() on the part instance. But if it's image/* str() makes no sense. part.decode() or something similar makes sense, but this needs to be extensible because the email package will not know how to convert every content-type. At best it will only know how to decode content-types that Python's stdlib knows about. The problem is that if the bytes came off the wire, the parser currently can only attach the most basic MIME base class. It doesn't know that an image/png should create a MIMEImagePNG instance there. This is different from hacking the model directly because the application can instantiate the right class. So the parser either has to have a hookable way for an application to go from content-type to class, or the generic MIME base class needs to be hookable in its .decode() method. > Headers could possibly be a quadruple instead of a triple, with the > 4th item being the wire format if received? (If constructed, no wire > format would be expected until it is generated.) That would help > with idempotency, as if a header contains non-ASCII characters, > there are many choices of heuristic to encode that are all proper, > so it is unlikely two different algorithms would preserve idempotency. I think not a quad. I think other APIs should be used to extract the raw data, e.g. >>> # return a unicode or throw an exception >>> text = str(header) >>> # should always be okay even if gibberish >>> raw = bytes(header) or /something/ like that. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From barry at python.org Thu Oct 8 15:14:05 2009 From: barry at python.org (Barry Warsaw) Date: Thu, 8 Oct 2009 09:14:05 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <20091008091840.GB28906@phd.pp.ru> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> <5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org> <20091008091840.GB28906@phd.pp.ru> Message-ID: <0827D8A4-CC48-4B46-9C3E-EB6282D97BD8@python.org> On Oct 8, 2009, at 5:18 AM, Oleg Broytman wrote: > Are you going to parse any garbage and create a Message (probably an > empty Message) with one defect "cannot parse it at all"? Yes, although the most pathological stream of bytes will probably produce a message with no headers and an undecodeable body of gibberish bytes, with a .defects list possible one or two items long. > What I disagree with is that if a parser cannot parse input > garbage at > all it must raise an exception. And if a parser can raise an > exception any > calling program must be prepared to catch such exceptions. Python 2.6.3 (r263:75183, Oct 4 2009, 19:57:34) [GCC 4.4.1] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from email import message_from_string >>> with open('/dev/urandom') as wire: ... data = wire.read(1024) ... >>> msg = message_from_string(data) >>> # number of headers ... len(msg) 0 >>> len(msg.get_payload()) 1024 >>> msg.defects [] This actually makes perfect sense. A message with no headers and a mass of 1024 bytes in its payload is RFC valid! -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From barry at python.org Thu Oct 8 15:20:26 2009 From: barry at python.org (Barry Warsaw) Date: Thu, 8 Oct 2009 09:20:26 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <87ocoiqi1x.fsf@uwakimon.sk.tsukuba.ac.jp> References: <10506972.7161254576370614.JavaMail.root@boaz> <8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org> <87zl88h4cj.fsf@uwakimon.sk.tsukuba.ac.jp> <1685F0AB-8B57-445A-BE03-3782E07DB8FD@python.org> <87ocoiqi1x.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <34701D4C-2F91-4969-8A4A-9067402A1E70@python.org> On Oct 8, 2009, at 6:46 AM, Stephen J. Turnbull wrote: > Barry Warsaw writes: > >> I've also heard convincing arguments from folks in the Python >> community in both camps: "using anything other than strings >> internally is insane; no, using anything other than bytes >> internally is insane." > > They're both right, of course. The problem is figuring out who is > right when. ;-) Indeed! >> Just to ramble a little longer, it's been argued that we should give >> up on idempotency, but I'm not convinced. > > If we can't achieve ... ah, isn't "invertibility" what you mean here? > ... "idempotency", then we're dropping information somewhere along the > line. Also, there are part types (pgp-signed, I'm looking at you) > where it's absolutely essential that we be able to roundtrip the body > byte for byte. So I'm -1 on giving up. Yeah, "idempotency" probably is not the right term, though I think historically that's what's been used. Math geeks, what's the right term here? :) I completely agree with you (of course :). The way I look at it is that we lose this important principle only when the source data lacks complete information, i.e. is defective. Although we can still invert in the face of some defects (and we should), I think we officially make no such guarantees unless the model is defect-free. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From phd at phd.pp.ru Thu Oct 8 15:22:37 2009 From: phd at phd.pp.ru (Oleg Broytman) Date: Thu, 8 Oct 2009 17:22:37 +0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <0827D8A4-CC48-4B46-9C3E-EB6282D97BD8@python.org> References: <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> <5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org> <20091008091840.GB28906@phd.pp.ru> <0827D8A4-CC48-4B46-9C3E-EB6282D97BD8@python.org> Message-ID: <20091008132237.GB3059@phd.pp.ru> On Thu, Oct 08, 2009 at 09:14:05AM -0400, Barry Warsaw wrote: > On Oct 8, 2009, at 5:18 AM, Oleg Broytman wrote: >> Are you going to parse any garbage and create a Message (probably an >> empty Message) with one defect "cannot parse it at all"? > > Yes, although the most pathological stream of bytes will probably > produce a message with no headers and an undecodeable body of gibberish > bytes, with a .defects list possible one or two items long. Well, then... Oleg. -- Oleg Broytman http://phd.pp.ru/ phd at phd.pp.ru Programmers don't die, they just GOSUB without RETURN. From barry at python.org Thu Oct 8 15:23:42 2009 From: barry at python.org (Barry Warsaw) Date: Thu, 8 Oct 2009 09:23:42 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> <87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <9B033421-3475-4827-8C4B-F3D0116FDEEB@python.org> On Oct 8, 2009, at 7:25 AM, Stephen J. Turnbull wrote: > The email package has access to the wire format, and knows what to do > with most of it. It should DTRT where that is possible, and punt > where not. By "punt" I mean return a special object containing as > much of the meta data for an object as it could recover, along with > the data itself as a blob. > > I would suggest that module utilities that require access to the > parsed form of data be designed as object methods. The special > objects produced when broken wire format is encountered wouldn't have > those methods, and thus they'd fail the duck type test. But that > makes sense: that "duck" can't quack anyway. This is a very interesting idea that I think I like! -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From barry at python.org Thu Oct 8 15:30:12 2009 From: barry at python.org (Barry Warsaw) Date: Thu, 8 Oct 2009 09:30:12 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <20091008123133.GA3059@phd.pp.ru> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> <87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp> <20091008123133.GA3059@phd.pp.ru> Message-ID: <4FD82973-A650-408E-9CD2-FE3F4DF008A7@python.org> On Oct 8, 2009, at 8:31 AM, Oleg Broytman wrote: > Not exactly. One can see an AttributeError, but what was the > cause? why > a parser has created a broken object? AttributeError doesn't preserve > information from parser. But if you got the AttributeError, you'd still have the original object around to ask more detailed questions about. On first blush, what I think I like about this is that it fits in with an interesting generic API design. For example, if you have a message instance (and remember, parts-is-parts-is-messages) that you think is an image, you might just do something like: >>> image = msg.decoded_image and then 'image' is the png that its Content-Type: image/png implies. If the data wasn't actually parseable as a png, this would raise an AttributeError and you'd then have to do: >>> bytes = msg.raw_bytes to get the raw data, but you'd still have the msg object around to do that with. The one possible problem is that Message may have to implement a __getattribute__() to handle this, since you can't know when the class is written whether the data its instances will contain will be valid or not. >> I can think of no input for which the parser should *ever* throw an >> exception. > > Are you saying that even a random garbage would be parsed to a > Message > of some kind? No headers, a single unparsed body?.. Sure, why not? It's valid RFC 822 :) -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From tfarrell at owassobible.org Thu Oct 8 15:27:35 2009 From: tfarrell at owassobible.org (Timothy Farrell) Date: Thu, 08 Oct 2009 08:27:35 -0500 Subject: [Email-SIG] fixing the current email module In-Reply-To: References: <10506972.7161254576370614.JavaMail.root@boaz> <7054.1254879272@parc.com> Message-ID: <4ACDE8C7.2040309@owassobible.org> Barry Warsaw wrote: > On Oct 6, 2009, at 9:34 PM, Bill Janssen wrote: > >> Timothy Farrell wrote: >> >>> Back in June, David Murray posted the message below about fixing the >>> email module. I have an interest in helping with this due to a >>> personal project I'm working on. However, my ability to help is >>> severely limited by my understanding of email and MIME RFCs. >> >> Tim, familiarity with email and MIME RFCs would be a big help if you >> want to help with the email module. Even for writing test cases. > > Just be forewarned that you'll end up like James T. Kirk staring up at > the neural neutralizer on the Tantalus Penal Colony. You'll either be > a mindless shell or in agonizing pain. Or both. > > going-bold-ly y'rs, > -Barry > That's the impression I got when I first started wading through them. Maybe I should leave this to you experts. I think I hear my mom calling. -tim From barry at python.org Thu Oct 8 15:48:01 2009 From: barry at python.org (Barry Warsaw) Date: Thu, 8 Oct 2009 09:48:01 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACDE8C7.2040309@owassobible.org> References: <10506972.7161254576370614.JavaMail.root@boaz> <7054.1254879272@parc.com> <4ACDE8C7.2040309@owassobible.org> Message-ID: <2E1FF440-5102-466F-BD98-BE6594223E97@python.org> On Oct 8, 2009, at 9:27 AM, Timothy Farrell wrote: >> Just be forewarned that you'll end up like James T. Kirk staring up >> at the neural neutralizer on the Tantalus Penal Colony. You'll >> either be a mindless shell or in agonizing pain. Or both. >> > That's the impression I got when I first started wading through > them. Maybe I should leave this to you experts. I think I hear my > mom calling. Tim, pay no attention to that curmudgeon up there. You should definitely take a look through at least the basic ones! -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From janssen at parc.com Thu Oct 8 16:30:45 2009 From: janssen at parc.com (Bill Janssen) Date: Thu, 8 Oct 2009 07:30:45 PDT Subject: [Email-SIG] fixing the current email module In-Reply-To: <2E1FF440-5102-466F-BD98-BE6594223E97@python.org> References: <10506972.7161254576370614.JavaMail.root@boaz> <7054.1254879272@parc.com> <4ACDE8C7.2040309@owassobible.org> <2E1FF440-5102-466F-BD98-BE6594223E97@python.org> Message-ID: <98907.1255012245@parc.com> Barry Warsaw wrote: > On Oct 8, 2009, at 9:27 AM, Timothy Farrell wrote: > > >> Just be forewarned that you'll end up like James T. Kirk staring up > >> at the neural neutralizer on the Tantalus Penal Colony. You'll > >> either be a mindless shell or in agonizing pain. Or both. > >> > > That's the impression I got when I first started wading through > > them. Maybe I should leave this to you experts. I think I hear my > > mom calling. > > Tim, pay no attention to that curmudgeon up there. You should > definitely take a look through at least the basic ones! Everyone should! Bill From stephen at xemacs.org Thu Oct 8 17:31:43 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 09 Oct 2009 00:31:43 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: <20091008123133.GA3059@phd.pp.ru> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> <87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp> <20091008123133.GA3059@phd.pp.ru> Message-ID: <87ab01rjfk.fsf@uwakimon.sk.tsukuba.ac.jp> Oleg Broytman writes: > > where not. By "punt" I mean return a special object containing as > > much of the meta data for an object as it could recover, along with > > the data itself as a blob. > > The special object is an instance of an exception class ;) It could be, but it will be returned with return, not raise. ;) > > I think (== hope) that this will sufficiently localize the issues > > that even though only AttributeError would even be raised, it > > will be obvious what went wrong. > > Not exactly. One can see an AttributeError, but what was the > cause? why a parser has created a broken object? AttributeError > doesn't preserve information from parser. Who said it wouldn't? Granted, I didn't say it would, but in my Content-Type: multipart/alternative Content-Type: text/plain Content-Type: text/html; parseable=no example, I would expect the object returned to reflect that structure. In particular the object representing the second MIME part would indeed possess a valid Header member. I would also attach the original data (which in the case of a missing separator might very well overrun into other parts, etc), but it would *not* be accessible via the usual methods (eg, definitely not from .flatten()). So in fact it's not clear to me that you could ask for more information than that. > > I can think of no input for which the parser should *ever* throw an > > exception. > > Are you saying that even a random garbage would be parsed to a Message > of some kind? No headers, a single unparsed body?.. As long as it contains no NULs or high-bit-set octets, and is separated into at least two parts, each less than 998 characters long, by a CRLF, yes, I would definitely expect that an otherwise randomly generated string would be parsed to a Message. This Message should not be sendable because RFC 5322 requires the presence of a From and a Date. However, if you were implementing a sendmail-compatible MTA or LDA, you might very well wish to accept such a thing on stdin, parse it to a Message, and then default the >From and Date header fields appropriately, and add a Message-ID header field. I would, anyway, wouldn't you? Ah, yes, that's another use case, isn't it?! From stephen at xemacs.org Thu Oct 8 17:09:54 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 09 Oct 2009 00:09:54 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: <0827D8A4-CC48-4B46-9C3E-EB6282D97BD8@python.org> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> <5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org> <20091008091840.GB28906@phd.pp.ru> <0827D8A4-CC48-4B46-9C3E-EB6282D97BD8@python.org> Message-ID: <87bpkhrkfx.fsf@uwakimon.sk.tsukuba.ac.jp> Barry Warsaw writes: > >>> from email import message_from_string > >>> with open('/dev/urandom') as wire: > ... data = wire.read(1024) > ... # insert A > >>> msg = message_from_string(data) > >>> # number of headers > ... len(msg) > 0 > >>> len(msg.get_payload()) > 1024 > >>> msg.defects > [] > > This actually makes perfect sense. A message with no headers and a > mass of 1024 bytes in its payload is RFC valid! If you insert at A >>> wire = "".join(chr(ord(ch) & 127) for ch in wire) >>> # optional with reasonably high probability: >>> wire = wire[0:512] + "\r\n" + wire[512:1024] or similar. Otherwise not. ;-) From stephen at xemacs.org Thu Oct 8 17:43:43 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 09 Oct 2009 00:43:43 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org> <4ACD94E5.5020808@g.nevcal.com> <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org> Message-ID: <878wflrivk.fsf@uwakimon.sk.tsukuba.ac.jp> Barry Warsaw writes: > On Oct 8, 2009, at 3:29 AM, Glenn Linderman wrote: > > Headers could possibly be a quadruple instead of a triple, with the > > 4th item being the wire format if received? I think the whole input format (note, not necessarily wire!) should be saved off on the top-level Message object (possibly in a file, per Barry's comments about that). Subobjects could then refer to to pieces of that as position ranges. > I think not a quad. I think other APIs should be used to extract the > raw data, e.g. > > >>> # return a unicode or throw an exception > >>> text = str(header) > >>> # should always be okay even if gibberish > >>> raw = bytes(header) > > or /something/ like that. Does that work? I would think (especially in parallel to text) you want bytes(header) to be the wire format. If so, you want it to raise if it knows it contains gibberish. And again, we have the problem of whether it should return with the field name prepended or just the field body. I have a feeling we should not try to decide what APIs we're going to spell as __str__ and __bytes__ yet. From janssen at parc.com Thu Oct 8 17:35:49 2009 From: janssen at parc.com (Bill Janssen) Date: Thu, 8 Oct 2009 08:35:49 PDT Subject: [Email-SIG] fixing the current email module In-Reply-To: <87ab01rjfk.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> <87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp> <20091008123133.GA3059@phd.pp.ru> <87ab01rjfk.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <319.1255016149@parc.com> I should point out that I also store lots of metadata in the registered MIME format text/rfc822-headers (defined in RFC 1892), data that doesn't necessarily conform to the specific set of headers mentioned in RFC822. It would be nice if the header support in the email package would also support reading and writing that format. And MIME multipart is sometimes used in applications other than email. It would be nice if the MIME parsing part of the email module could be used for those purposes, as well -- basically without some of the headers defined in 2822 and 2821. I think of those two as lower-level standalone libraries used by the higher-level email library. Bill From v+python at g.nevcal.com Thu Oct 8 09:33:47 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Thu, 08 Oct 2009 00:33:47 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> <5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org> Message-ID: <4ACD95DB.4040800@g.nevcal.com> On approximately 10/8/2009 12:16 AM, came the following characters from the keyboard of R. David Murray: > I'd like to try to summarize what I understand Barry to be saying Good summary! Deleted all but one point that I'd like to have clarified... > The API also provides ways to build up a Message from pieces, or to > extract information from a Message in pieces, and to modify a Message, > and again input and output as both text and bytes must be supported. And I agree that APIs to retrieve any MIME part as undecoded bytes is appropriate; and to retrieve it as decoded strings is appropriate for text MIME parts. Not sure that non-text MIME parts need to support being returned as strings. So there must be APIs that support obtaining text and (same or different) APIs that support obtaining bytes for a given MIME part. However, I think it is proper that a MIME part that is not flagged as text/* might produce an error if asked for as text. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From phd at phd.pp.ru Thu Oct 8 18:54:05 2009 From: phd at phd.pp.ru (Oleg Broytman) Date: Thu, 8 Oct 2009 20:54:05 +0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <87ab01rjfk.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> <87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp> <20091008123133.GA3059@phd.pp.ru> <87ab01rjfk.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <20091008165405.GA12047@phd.pp.ru> On Fri, Oct 09, 2009 at 12:31:43AM +0900, Stephen J. Turnbull wrote: > Oleg Broytman writes: > > > I can think of no input for which the parser should *ever* throw an > > > exception. > > > > Are you saying that even a random garbage would be parsed to a Message > > of some kind? No headers, a single unparsed body?.. > > As long as it contains no NULs or high-bit-set octets, and is > separated into at least two parts, each less than 998 characters long, > by a CRLF After all, you can think of input that should make a parser to raise an exception, can't you? > This Message should not be sendable because RFC 5322 requires the > presence of a From and a Date. However, if you were implementing a > sendmail-compatible MTA or LDA, you might very well wish to accept > such a thing on stdin, parse it to a Message, and then default the > >From and Date header fields appropriately, and add a Message-ID header > field. I would, anyway, wouldn't you? > > Ah, yes, that's another use case, isn't it?! Absolutely. We're talking about parsing data, not necessary from SMTP, even less not necessary sendable. Oleg. -- Oleg Broytman http://phd.pp.ru/ phd at phd.pp.ru Programmers don't die, they just GOSUB without RETURN. From stephen at xemacs.org Thu Oct 8 19:29:32 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 09 Oct 2009 02:29:32 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: <34701D4C-2F91-4969-8A4A-9067402A1E70@python.org> References: <10506972.7161254576370614.JavaMail.root@boaz> <8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org> <87zl88h4cj.fsf@uwakimon.sk.tsukuba.ac.jp> <1685F0AB-8B57-445A-BE03-3782E07DB8FD@python.org> <87ocoiqi1x.fsf@uwakimon.sk.tsukuba.ac.jp> <34701D4C-2F91-4969-8A4A-9067402A1E70@python.org> Message-ID: <87y6nlpzer.fsf@uwakimon.sk.tsukuba.ac.jp> Barry Warsaw writes: > Yeah, "idempotency" probably is not the right term, though I think > historically that's what's been used. Math geeks, what's the right > term here? :) "Invertability" *is* the math term. "Roundtrip" is more likely to make sense to real people. > I completely agree with you (of course :). Other way around, I'm sure. What-about-the-curmudgeon-behind-the-curtain-ly y'rs, From stephen at xemacs.org Thu Oct 8 21:06:40 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 09 Oct 2009 04:06:40 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: <319.1255016149@parc.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> <87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp> <20091008123133.GA3059@phd.pp.ru> <87ab01rjfk.fsf@uwakimon.sk.tsukuba.ac.jp> <319.1255016149@parc.com> Message-ID: <87vdippuwv.fsf@uwakimon.sk.tsukuba.ac.jp> Bill Janssen writes: > I should point out that I also store lots of metadata in the registered > MIME format text/rfc822-headers (defined in RFC 1892), data that doesn't > necessarily conform to the specific set of headers mentioned in RFC822. > It would be nice if the header support in the email package would also > support reading and writing that format. I'm not sure what you're saying here. RFC 822 is inclusive. More or less, if it looks like a header, it is a header, and we need to parse it at least into field name and field body, whether RFC 822 defines more specific syntax for it or not. Is that all, or do you mean you want it to give that MIME format special treatment, such as a method for converting a Message object containing a parsed RFC 822 message to a Message object containing a multipart/report message and a text/rfc822-headers subobject, ready to have the text/plain and message/delivery-status parts filled in per RFC 1892? > And MIME multipart is sometimes used in applications other than email. > It would be nice if the MIME parsing part of the email module could be > used for those purposes, as well -- basically without some of the > headers defined in 2822 and 2821. Ditto, here. I would expect that you could feed an HTTP stream containing headers and content to the Message constructor and get something sensible back. Dunno what Barry thinks of that, though. From stephen at xemacs.org Thu Oct 8 21:31:36 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 09 Oct 2009 04:31:36 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: <20091008165405.GA12047@phd.pp.ru> References: <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> <87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp> <20091008123133.GA3059@phd.pp.ru> <87ab01rjfk.fsf@uwakimon.sk.tsukuba.ac.jp> <20091008165405.GA12047@phd.pp.ru> Message-ID: <87skdtptrb.fsf@uwakimon.sk.tsukuba.ac.jp> Oleg Broytman writes: > On Fri, Oct 09, 2009 at 12:31:43AM +0900, Stephen J. Turnbull wrote: > > Oleg Broytman writes: > > > > I can think of no input for which the parser should *ever* throw an > > > > exception. > > > > > > Are you saying that even a random garbage would be parsed to a Message > > > of some kind? No headers, a single unparsed body?.. > > > > As long as it contains no NULs or high-bit-set octets, and is > > separated into at least two parts, each less than 998 characters long, > > by a CRLF > > After all, you can think of input that should make a parser to raise an > exception, can't you? No, to throw an error on the example above would be a felony, life sentence. Throwing an error on something that had 8-bit octets in it probably wouldn't be a crime, but I'd sue, and any jury in the land would award treble damages. Better try for a change of venue to Moscow. From barry at python.org Thu Oct 8 21:49:58 2009 From: barry at python.org (Barry Warsaw) Date: Thu, 8 Oct 2009 15:49:58 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <87bpkhrkfx.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> <5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org> <20091008091840.GB28906@phd.pp.ru> <0827D8A4-CC48-4B46-9C3E-EB6282D97BD8@python.org> <87bpkhrkfx.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <81800AC4-C935-4EC8-890A-3FB499A2BB95@python.org> On Oct 8, 2009, at 11:09 AM, Stephen J. Turnbull wrote: > Barry Warsaw writes: > >>>>> from email import message_from_string >>>>> with open('/dev/urandom') as wire: >> ... data = wire.read(1024) >> ... > > # insert A > >>>>> msg = message_from_string(data) >>>>> # number of headers >> ... len(msg) >> 0 >>>>> len(msg.get_payload()) >> 1024 >>>>> msg.defects >> [] >> >> This actually makes perfect sense. A message with no headers and a >> mass of 1024 bytes in its payload is RFC valid! > > If you insert at A > >>>> wire = "".join(chr(ord(ch) & 127) for ch in wire) >>>> # optional with reasonably high probability: >>>> wire = wire[0:512] + "\r\n" + wire[512:1024] > > or similar. Otherwise not. ;-) Right! That makes it legal. What's interesting of course is that the parser can (and I submit, still should) handle the stream even without that. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From barry at python.org Thu Oct 8 21:52:02 2009 From: barry at python.org (Barry Warsaw) Date: Thu, 8 Oct 2009 15:52:02 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <878wflrivk.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org> <4ACD94E5.5020808@g.nevcal.com> <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org> <878wflrivk.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On Oct 8, 2009, at 11:43 AM, Stephen J. Turnbull wrote: > I think the whole input format (note, not necessarily wire!) should be > saved off on the top-level Message object (possibly in a file, per > Barry's comments about that). Subobjects could then refer to to > pieces of that as position ranges. I haven't made up my mind about that (it's been suggested before). The tricky thing will be keeping that cache in sync with any other model changes through the approved API. IOW, if I overwrite a message's payload, that input format should probably be blown away. > I have a feeling we should not try to decide what APIs we're going to > spell as __str__ and __bytes__ yet. Very good point. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From barry at python.org Thu Oct 8 21:53:00 2009 From: barry at python.org (Barry Warsaw) Date: Thu, 8 Oct 2009 15:53:00 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <319.1255016149@parc.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> <87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp> <20091008123133.GA3059@phd.pp.ru> <87ab01rjfk.fsf@uwakimon.sk.tsukuba.ac.jp> <319.1255016149@parc.com> Message-ID: <175C60BB-0E64-40FE-9401-F70E23598506@python.org> On Oct 8, 2009, at 11:35 AM, Bill Janssen wrote: > I should point out that I also store lots of metadata in the > registered > MIME format text/rfc822-headers (defined in RFC 1892), data that > doesn't > necessarily conform to the specific set of headers mentioned in > RFC822. > It would be nice if the header support in the email package would also > support reading and writing that format. > > And MIME multipart is sometimes used in applications other than email. > It would be nice if the MIME parsing part of the email module could be > used for those purposes, as well -- basically without some of the > headers defined in 2822 and 2821. > > I think of those two as lower-level standalone libraries used by the > higher-level email library. I agree with this use case. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From barry at python.org Thu Oct 8 21:54:10 2009 From: barry at python.org (Barry Warsaw) Date: Thu, 8 Oct 2009 15:54:10 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACD95DB.4040800@g.nevcal.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> <5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org> <4ACD95DB.4040800@g.nevcal.com> Message-ID: <440B5F4C-E210-46F0-B647-240CDF091F4D@python.org> On Oct 8, 2009, at 3:33 AM, Glenn Linderman wrote: > Not sure that non-text MIME parts need to support being returned as > strings. I don't think they do. But e.g. an image/* MIME part should support returning the decoded image data. > So there must be APIs that support obtaining text and (same or > different) APIs that support obtaining bytes for a given MIME part. > However, I think it is proper that a MIME part that is not flagged > as text/* might produce an error if asked for as text. +1 -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From barry at python.org Thu Oct 8 21:54:50 2009 From: barry at python.org (Barry Warsaw) Date: Thu, 8 Oct 2009 15:54:50 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <87y6nlpzer.fsf@uwakimon.sk.tsukuba.ac.jp> References: <10506972.7161254576370614.JavaMail.root@boaz> <8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org> <87zl88h4cj.fsf@uwakimon.sk.tsukuba.ac.jp> <1685F0AB-8B57-445A-BE03-3782E07DB8FD@python.org> <87ocoiqi1x.fsf@uwakimon.sk.tsukuba.ac.jp> <34701D4C-2F91-4969-8A4A-9067402A1E70@python.org> <87y6nlpzer.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <7B06E031-38F1-4047-A88B-E867B6738D52@python.org> On Oct 8, 2009, at 1:29 PM, Stephen J. Turnbull wrote: > Barry Warsaw writes: > >> Yeah, "idempotency" probably is not the right term, though I think >> historically that's what's been used. Math geeks, what's the right >> term here? :) > > "Invertability" *is* the math term. "Roundtrip" is more likely to > make > sense to real people. Thanks. +1 for roundtrip. >> I completely agree with you (of course :). > > Other way around, I'm sure. > > What-about-the-curmudgeon-behind-the-curtain-ly y'rs, :) -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From rdmurray at bitdance.com Thu Oct 8 21:55:18 2009 From: rdmurray at bitdance.com (R. David Murray) Date: Thu, 8 Oct 2009 15:55:18 -0400 (EDT) Subject: [Email-SIG] fixing the current email module In-Reply-To: <81800AC4-C935-4EC8-890A-3FB499A2BB95@python.org> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> <5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org> <20091008091840.GB28906@phd.pp.ru> <0827D8A4-CC48-4B46-9C3E-EB6282D97BD8@python.org> <87bpkhrkfx.fsf@uwakimon.sk.tsukuba.ac.jp> <81800AC4-C935-4EC8-890A-3FB499A2BB95@python.org> Message-ID: On Thu, 8 Oct 2009 at 15:49, Barry Warsaw wrote: > On Oct 8, 2009, at 11:09 AM, Stephen J. Turnbull wrote: > >> Barry Warsaw writes: >> >> > > > > from email import message_from_string >> > > > > with open('/dev/urandom') as wire: >> > ... data = wire.read(1024) >> > ... >> >> # insert A >> >> > > > > msg = message_from_string(data) >> > > > > # number of headers >> > ... len(msg) >> > 0 >> > > > > len(msg.get_payload()) >> > 1024 >> > > > > msg.defects >> > [] >> > >> > This actually makes perfect sense. A message with no headers and a >> > mass of 1024 bytes in its payload is RFC valid! >> >> If you insert at A >> >> > > > wire = "".join(chr(ord(ch) & 127) for ch in wire) >> > > > # optional with reasonably high probability: >> > > > wire = wire[0:512] + "\r\n" + wire[512:1024] >> >> or similar. Otherwise not. ;-) > > Right! That makes it legal. > > What's interesting of course is that the parser can (and I submit, still > should) handle the stream even without that. But it should be recording a couple defects in that case, right? --David From barry at python.org Thu Oct 8 21:57:17 2009 From: barry at python.org (Barry Warsaw) Date: Thu, 8 Oct 2009 15:57:17 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <87vdippuwv.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> <87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp> <20091008123133.GA3059@phd.pp.ru> <87ab01rjfk.fsf@uwakimon.sk.tsukuba.ac.jp> <319.1255016149@parc.com> <87vdippuwv.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <6B772657-8600-41E7-9E29-8984D70121FD@python.org> On Oct 8, 2009, at 3:06 PM, Stephen J. Turnbull wrote: > Bill Janssen writes: > >> I should point out that I also store lots of metadata in the >> registered >> MIME format text/rfc822-headers (defined in RFC 1892), data that >> doesn't >> necessarily conform to the specific set of headers mentioned in >> RFC822. >> It would be nice if the header support in the email package would >> also >> support reading and writing that format. > > I'm not sure what you're saying here. RFC 822 is inclusive. More or > less, if it looks like a header, it is a header, and we need to parse > it at least into field name and field body, whether RFC 822 defines > more specific syntax for it or not. The way I read it was that certain RFC 5322 requirements should be relaxed in certain cases, e.g. line length limits. If you're mutating the model, you wouldn't necessarily (ever? always?) throw an exception for long lines. > Is that all, or do you mean you want it to give that MIME format > special treatment, such as a method for converting a Message object > containing a parsed RFC 822 message to a Message object containing a > multipart/report message and a text/rfc822-headers subobject, ready to > have the text/plain and message/delivery-status parts filled in per > RFC 1892? > >> And MIME multipart is sometimes used in applications other than >> email. >> It would be nice if the MIME parsing part of the email module could >> be >> used for those purposes, as well -- basically without some of the >> headers defined in 2822 and 2821. > > Ditto, here. > > I would expect that you could feed an HTTP stream containing headers > and content to the Message constructor and get something sensible > back. Dunno what Barry thinks of that, though. I think the Python community would expect the email package to support this and similar use cases. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From barry at python.org Thu Oct 8 22:00:16 2009 From: barry at python.org (Barry Warsaw) Date: Thu, 8 Oct 2009 16:00:16 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> <5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org> <20091008091840.GB28906@phd.pp.ru> <0827D8A4-CC48-4B46-9C3E-EB6282D97BD8@python.org> <87bpkhrkfx.fsf@uwakimon.sk.tsukuba.ac.jp> <81800AC4-C935-4EC8-890A-3FB499A2BB95@python.org> Message-ID: <53530105-6707-4457-8A56-97907980A378@python.org> On Oct 8, 2009, at 3:55 PM, R. David Murray wrote: > On Thu, 8 Oct 2009 at 15:49, Barry Warsaw wrote: >> On Oct 8, 2009, at 11:09 AM, Stephen J. Turnbull wrote: >> >>> Barry Warsaw writes: >>> > > > > from email import message_from_string >>> > > > > with open('/dev/urandom') as wire: >>> > ... data = wire.read(1024) >>> > ... >>> # insert A >>> > > > > msg = message_from_string(data) >>> > > > > # number of headers >>> > ... len(msg) >>> > 0 >>> > > > > len(msg.get_payload()) >>> > 1024 >>> > > > > msg.defects >>> > [] >>> > > This actually makes perfect sense. A message with no headers >>> and a >>> > mass of 1024 bytes in its payload is RFC valid! >>> If you insert at A >>> > > > wire = "".join(chr(ord(ch) & 127) for ch in wire) >>> > > > # optional with reasonably high probability: >>> > > > wire = wire[0:512] + "\r\n" + wire[512:1024] >>> or similar. Otherwise not. ;-) >> >> Right! That makes it legal. >> >> What's interesting of course is that the parser can (and I submit, >> still should) handle the stream even without that. > > But it should be recording a couple defects in that case, right? Possibly so, although on the header instances maybe, which email currently doesn't support, but that it probably should. Which makes for an interesting idea. Let's say protocol PML defines their formats in terms of RFC 5322, but with a line length limit of 10k and allows 8-bit. email would parse that just fine but might drop a few defects onto some headers. The wrapper around PML could then remove those defects since they aren't defects in that protocol. And the generator would still DTRT, though it's possible you'd need subclasses of the email package to support that. Yet another interesting API challenge then. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From barry at python.org Thu Oct 8 22:03:51 2009 From: barry at python.org (Barry Warsaw) Date: Thu, 8 Oct 2009 16:03:51 -0400 Subject: [Email-SIG] Pycon 2010 sprint Message-ID: It's early still, but I'd like to get a sense of who might be interested in sprinting on email at Pycon 2010 in Atlanta. I think the dates will be something around the week of 22-Feb-2010. I'm sure I will have tension between wanting to sprint on email and wanting to sprint on Mailman (and possibly some Canonical stuff). I think an email sprint would only work if there were critical mass. Based on my past experience, I think we need at least three to five experts, with other interested hackers of course welcome. No need to commit right now, but something to think about. RDM, you'll probably be there right? Who else? -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From rdmurray at bitdance.com Thu Oct 8 22:18:00 2009 From: rdmurray at bitdance.com (R. David Murray) Date: Thu, 8 Oct 2009 16:18:00 -0400 (EDT) Subject: [Email-SIG] Pycon 2010 sprint In-Reply-To: References: Message-ID: On Thu, 8 Oct 2009 at 16:03, Barry Warsaw wrote: > No need to commit right now, but something to think about. RDM, you'll > probably be there right? Who else? I would expect to be, though I'll probably want to drop in on Core as well. --David (RDM) From rdmurray at bitdance.com Thu Oct 8 22:19:37 2009 From: rdmurray at bitdance.com (R. David Murray) Date: Thu, 8 Oct 2009 16:19:37 -0400 (EDT) Subject: [Email-SIG] Pycon 2010 sprint In-Reply-To: References: Message-ID: On Thu, 8 Oct 2009 at 16:03, Barry Warsaw wrote: > sprinting on email at Pycon 2010 in Atlanta. I think the dates will be > something around the week of 22-Feb-2010. I'm sure I will have tension The Sprint dates are listed as the 22nd to the 25th on the pycon website. --David (RDM) From janssen at parc.com Thu Oct 8 22:31:27 2009 From: janssen at parc.com (Bill Janssen) Date: Thu, 8 Oct 2009 13:31:27 PDT Subject: [Email-SIG] fixing the current email module In-Reply-To: <87vdippuwv.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> <87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp> <20091008123133.GA3059@phd.pp.ru> <87ab01rjfk.fsf@uwakimon.sk.tsukuba.ac.jp> <319.1255016149@parc.com> <87vdippuwv.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <15809.1255033887@parc.com> Stephen J. Turnbull wrote: > I'm not sure what you're saying here. RFC 822 is inclusive. More or > less, if it looks like a header, it is a header, and we need to parse > it at least into field name and field body, whether RFC 822 defines > more specific syntax for it or not. That's right. I was just pointing out that there might be any collection of headers, even collections without "From" or "Date". Bill From janssen at parc.com Thu Oct 8 22:32:19 2009 From: janssen at parc.com (Bill Janssen) Date: Thu, 8 Oct 2009 13:32:19 PDT Subject: [Email-SIG] fixing the current email module In-Reply-To: <6B772657-8600-41E7-9E29-8984D70121FD@python.org> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> <87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp> <20091008123133.GA3059@phd.pp.ru> <87ab01rjfk.fsf@uwakimon.sk.tsukuba.ac.jp> <319.1255016149@parc.com> <87vdippuwv.fsf@uwakimon.sk.tsukuba.ac.jp> <6B772657-8600-41E7-9E29-8984D70121FD@python.org> Message-ID: <15839.1255033939@parc.com> Barry Warsaw wrote: > On Oct 8, 2009, at 3:06 PM, Stephen J. Turnbull wrote: > > > Bill Janssen writes: > > > >> I should point out that I also store lots of metadata in the > >> registered > >> MIME format text/rfc822-headers (defined in RFC 1892), data that > >> doesn't > >> necessarily conform to the specific set of headers mentioned in > >> RFC822. > >> It would be nice if the header support in the email package would > >> also > >> support reading and writing that format. > > > > I'm not sure what you're saying here. RFC 822 is inclusive. More or > > less, if it looks like a header, it is a header, and we need to parse > > it at least into field name and field body, whether RFC 822 defines > > more specific syntax for it or not. > > The way I read it was that certain RFC 5322 requirements should be > relaxed in certain cases, e.g. line length limits. If you're mutating > the model, you wouldn't necessarily (ever? always?) throw an exception > for long lines. Yes, that's a good build. Bill From stephen at xemacs.org Thu Oct 8 23:29:23 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 09 Oct 2009 06:29:23 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: <440B5F4C-E210-46F0-B647-240CDF091F4D@python.org> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> <5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org> <4ACD95DB.4040800@g.nevcal.com> <440B5F4C-E210-46F0-B647-240CDF091F4D@python.org> Message-ID: <87my41pob0.fsf@uwakimon.sk.tsukuba.ac.jp> Barry Warsaw writes: > On Oct 8, 2009, at 3:33 AM, Glenn Linderman wrote: > > > Not sure that non-text MIME parts need to support being returned as > > strings. > > I don't think they do. Most non-text media do support comments, though. I don't know if extracting comments is a reasonable response to a request for text from an image, but we should provide a place to put any text that the callbacks that do the actual work of decoding might return. > > However, I think it is proper that a MIME part that is not flagged > > as text/* might produce an error if asked for as text. > > +1 That doesn't preclude raising an error/returning a defect object in many or most use cases, but there may be use cases where it would be useful to allow a callback on a non-text object to return text. From v+python at g.nevcal.com Thu Oct 8 23:59:38 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Thu, 08 Oct 2009 14:59:38 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org> <4ACD94E5.5020808@g.nevcal.com> <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org> Message-ID: <4ACE60CA.6010907@g.nevcal.com> On approximately 10/8/2009 6:00 AM, came the following characters from the keyboard of Barry Warsaw: > On Oct 8, 2009, at 3:29 AM, Glenn Linderman wrote: >> The application options are to drop the attachment, or pass through >> the corrupted bytes, and let the next application try to make sense >> of it. > > Exactly, and it's not for the email package to say which is right. > > Here's a use case: I've got a Message that was parsed from wire input > and I want to mangle the Subject heading to add the list prefix. I > know exactly what charset the prefix is in because that's data I > control. When I ask for the original Subject value, I'm handed an > instance that I can use to try to figure out how add the prefix. > > First thing I'll ask it is "are you a single chunk in my prefix > charset (or compatible)?" If so, I can probably just prepend my > prefix onto the value. If not, "are you composed of multiple valid > chunks in different charsets?" If so, I know that I need to encode my > prefix, but I can still prepend it to the header value (hopefully > using the same API, and I don't care that the implementation could not > use string concatenation). > > If not, then what? Maybe I don't care if some of the chunk charsets > aren't known because I can still use the right encode+prepend > strategy. But if the header is a gobbledegook of 8-bit bytes? I'm > pretty sure I want to be able to ask the API if that's the case rather > than get an exception. The thing I'm not so sure about is what > happens if my application is just naive enough to just ask for the > header as a unicode and that conversion can't be made. I /think/ it > should raise an exception in that case. But then when I ask for the > header value as a mass of bytes, that should succeed and return me the > raw input. So for this use case, it is known that all headers are ASCII. So the operation of prepending a list prefix should not care whether the Subject: value is valid or not... it can simply prepend the list prefix, followed by SP, to the existing, raw header that already exists. The only remaining issue is line length limits, so maybe it has to use CR LF TAB instead of space, sometimes. OK, so if the prefix is not ASCII, it gets separately encoded, including a trailing SP, and then prepended to the value followed by SP or CR LF TAB depending on the line length limit. So to prepend into a text header, you shouldn't need to decode the undecodable... there should be a prepend (and possibly also an append) operation provided by the API, so that applications can tweak headers without decoding. This allows useful behavior even if new methods of encoding are invented that are not yet understood by a particular version of the email library. Asking for the header value (or whole header) in Unicode should decode the chunks that are understandable and decodable, and leave the chunks that are not understandable as ASCII-converted-to-Unicode-but-still-possibly-weirdly-encoded ... I think that is what the RFCs encourage. Asking for a header as bytes should return the wire data, if it is available, or an encoding of real data as wire data (like generate would do). There is no Unicode that cannot be encoded to wire format, IIUC, usually via a variety of heuristics once non-ASCII characters are included, that may produce a variety of differing results, all of which should decode back to the original data. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From v+python at g.nevcal.com Fri Oct 9 00:39:23 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Thu, 08 Oct 2009 15:39:23 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org> <4ACD94E5.5020808@g.nevcal.com> <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org> Message-ID: <4ACE6A1B.7060702@g.nevcal.com> On approximately 10/8/2009 6:00 AM, came the following characters from the keyboard of Barry Warsaw: > On Oct 8, 2009, at 3:29 AM, Glenn Linderman wrote: >> And I agree that APIs to retrieve any MIME part as undecoded bytes is >> appropriate; and to retrieve it as decoded strings is appropriate for >> text MIME parts. Not sure that non-text MIME parts need to support >> being returned as strings. > > I hate to open another can of worms, but I've been thinking about this > a lot too :). It's been discussed on list before, so nothing new > here. I think the parser and MIME classes need to be hookable for > decoding their contents. For example, if you have a text/* it might > well make sense to support bytes() and str()/unicode() on the part > instance. But if it's image/* str() makes no sense. part.decode() or > something similar makes sense, but this needs to be extensible because > the email package will not know how to convert every content-type. At > best it will only know how to decode content-types that Python's > stdlib knows about. Seems like the following should be obtainable from a MIME parts: 1) wire format. Either what came in, in the parser case, or what would be generated. 2) internal headers from the MIME part 3) decoded BLOB. This means that quopri and base64 are decoded, no more and no less. This is bytes. No headers, only payload. For Content-Transfer-Encoding: binary, this is mostly a noop. 4) text/* parts should also be obtainable as str()/unicode(), payload only. This is where charset decoding is done. I think your talk in the next paragraph about hooks and other object types being produced is a generalization of 4, not 3, and generally no additional decoding needs to be done, just conversion to the right object type (or file, or file-like object). > The problem is that if the bytes came off the wire, the parser > currently can only attach the most basic MIME base class. It doesn't > know that an image/png should create a MIMEImagePNG instance there. > This is different from hacking the model directly because the > application can instantiate the right class. So the parser either has > to have a hookable way for an application to go from content-type to > class, or the generic MIME base class needs to be hookable in its > .decode() method. So either the email package can stop at 3, and 4 only for text/* parts, or it could learn more types (registered types, with well-defined corresponding objects could be potentially built-in to the email package), and/or it could become hookable for application types. Of course, for disposition to files, storing the BLOB in a file of the right name is adequate... to avoid the file, I agree that converting to a useful object type is handy. But maybe file-like objects would suffice, for most of the types. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From barry at python.org Fri Oct 9 00:39:56 2009 From: barry at python.org (Barry Warsaw) Date: Thu, 8 Oct 2009 18:39:56 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACE60CA.6010907@g.nevcal.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org> <4ACD94E5.5020808@g.nevcal.com> <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org> <4ACE60CA.6010907@g.nevcal.com> Message-ID: On Oct 8, 2009, at 5:59 PM, Glenn Linderman wrote: > So to prepend into a text header, you shouldn't need to decode the > undecodable... Except that you also have to collapse Re:'s and move them to the front of the string. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From v+python at g.nevcal.com Fri Oct 9 00:50:37 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Thu, 08 Oct 2009 15:50:37 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: <87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACCD10D.4070308@g.nevcal.com> <87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4ACE6CBD.2030805@g.nevcal.com> On approximately 10/8/2009 4:40 AM, came the following characters from the keyboard of Stephen J. Turnbull: > Glenn Linderman writes: > > > > > If conversions are avoided, then octets are unlikely to be out of > > > > range? > > > > > > Haven't looked in your spam bucket recently, I guess. Spammers > > > regularly put 8 bit characters into headers (and into bodies in > > > messages without a Content-Type header), for one thing. > > > > I'm aware of that, but if conversions are not done, octets are unlikely > > to be _reported_ to be out of range.... > > Conversions will eventually be done. "Best it were done quickly." > Disagree. Deferring the conversions defers failure issues to the point where the code (hopefully) somewhat understands the type of data being manipulated, and can then handle it appropriately. Converting up front causes errors in things that may never be touched or needed, so the error detection and handling is wasteful. > > > Most clients are simply not going to be prepared for the kind of > > > crap I see in /var/mail/turnbull every day. > > > > Are you referring to most email clients, or most > > Python-email-library-using clients? > > Sorry. When I mean "MUA" I try to say "MUA". By "client", I'm > referring to the higher level logic that is going to be calling the > email module. > Yeah, terminology between people that haven't discussed the topic before can slow communication. So for headers, which are supposed to be ASCII, or encoded via RFC rules to ASCII (no 8-bit chars), then the discovery of an 8-bit char should be produce a defect report, but then simply converted to Unicode as if it were Latin-1 (since there is no other knowledge available that could produce a better conversion). And if the result of that is not expected by the client (your definition), then the client should either notice the defect report and reject it based on that, or attempt to parse it, and reject it if it encounters unexpected syntax. After all, this is, for that client, "raw user input" (albeit from a remote source) so fully error checking the input is appropriate. > > Is it your point of view, then, that incorrectly formed email should be > > mostly treated as SPAM? > > Heavens no! Not by the email module, anyway! The email module should > not know about spam (but see Barry's "we're having spam for Launchpad" > post: if you're that good, anything goes!), except maybe at a very > high level. > I didn't think you'd think that, but things you were saying seemed to be implying that. > > Your "hit me with your best shot" comment indicates that you want a > > failure code or exception when the data is bad, and then a way to > > "retry accepting errors"? > > My curent thinking is that the email module should return an object > representing a partial parse. The way that you find out if it is > partial is to try to access some data that "should" be in the object. > If the parse succeeded, the accessor returns the data (which might be > empty). If the parse did not succeed, you get an AttributeError. > (This is just a paraphrase of what I wrote in response to Oleg.) yeah, or some error, anyway. The problem with the APIs that are spelled __str__ and __bytes__ is that there is no other way to return errors other than exceptions.... the Python way. Since the email library is trying to avoid raising exceptions in large blocks of its code, it is non-Pythonic (which is what Oleg is probably complaining about, in part). But because it needs to avoid exceptions, and is therefore non-Pythonic, it may be inappropriate to spell very many of its APIs __str__ and __bytes__, because that is Pythonic, and requires exceptions. Once you become non-Pythonic in one area, you may have to also be non-Pythonic in some other areas... -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From v+python at g.nevcal.com Fri Oct 9 01:02:47 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Thu, 08 Oct 2009 16:02:47 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org> <4ACD94E5.5020808@g.nevcal.com> <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org> <4ACE60CA.6010907@g.nevcal.com> Message-ID: <4ACE6F97.6010605@g.nevcal.com> On approximately 10/8/2009 3:39 PM, came the following characters from the keyboard of Barry Warsaw: > On Oct 8, 2009, at 5:59 PM, Glenn Linderman wrote: > >> So to prepend into a text header, you shouldn't need to decode the >> undecodable... > > Except that you also have to collapse Re:'s and move them to the front > of the string. Well, that is a feature of some mailing list programs. Those that want to do that, will have to decode and re-encode. However, there are definitely mailing lists that don't do that. Google Groups is one example that doesn't collapse, and always prepends the headers in front of Re:. Seems like all the Python lists do the collapsing (I wonder why! :) ) Other lists don't do prepending (I think the RFCs recommend not prepending in Subject, actually), of the others I'm subscribed to, that prepend, some collapse and some don't. I'm saying that there are use cases where prepending could be done without decoding; while you are positing use cases where that is insufficient, but you shouldn't have said "Except"... you should have said "There are also other use cases". And when you collapse Re:, do you also collapse various language-specific spellings of Re: ??? that is a hard problem. And don't forget removing the prior prepended text before adding the new prepended text. Actually, as long as the prepended text is ASCII, all that work can be done on the encoded value. When it is not ASCII, it may still be separated and recognizable. Still that logic is more complex than decoding, handling as Unicode, and encoding.... when it works. Just pointing out that there is more than one way to do things... -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From mark at msapiro.net Fri Oct 9 01:20:23 2009 From: mark at msapiro.net (Mark Sapiro) Date: Thu, 8 Oct 2009 16:20:23 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACE6F97.6010605@g.nevcal.com> Message-ID: Glenn Linderman wrote: > >However, there are definitely mailing lists that don't do that. Google >Groups is one example that doesn't collapse, and always prepends the >headers in front of Re:. Seems like all the Python lists do the >collapsing (I wonder why! :) ) Other lists don't do prepending (I think >the RFCs recommend not prepending in Subject, actually), of the others >I'm subscribed to, that prepend, some collapse and some don't. You seem to be forgetting the case where the encoded subject already contains the prefix, or do you not care if the subject just continues to grow with Re:'s and repeated prefixes? -- Mark Sapiro The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan From rdmurray at bitdance.com Fri Oct 9 01:52:10 2009 From: rdmurray at bitdance.com (R. David Murray) Date: Thu, 8 Oct 2009 19:52:10 -0400 (EDT) Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACD95DB.4040800@g.nevcal.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> <5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org> <4ACD95DB.4040800@g.nevcal.com> Message-ID: On Thu, 8 Oct 2009 at 00:33, Glenn Linderman wrote: > On approximately 10/8/2009 12:16 AM, came the following characters from the > keyboard of R. David Murray: >> I'd like to try to summarize what I understand Barry to be saying > > Good summary! Deleted all but one point that I'd like to have clarified... Thanks. I have revised my summary to take into account the feedback received. Specifically: I reworded it so that parsing/serialization never raise errors, and that text methods for binary subparts may not make sense. I added the proposal for the object/attribute error method of handling errors in the query portion of the API. I replaced 'idempotent' with 'invertable' (I didn't use 'roundtrip' because it isn't euphonous as an adjective...I just couldn't bring myself to write 'roundtrippable'; however if the consensus that it is clearer I will use it.) I've added a page to the email wiki[1] with this version of the summary, as a 'design overview proposal'. I'll also include the revised text here. Additional comments welcome. --David PS: I also updated the release targets and dates on the wiki. [1] http://wiki.python.org/moin/Email%20SIG ----------------------------------------------------------------- The email package consists of two major conceptual pieces: the API, and the internal data model. The API needs to have facilities for accepting data in either text format or bytes format, and this data is used to generate a model of the input message (a Message). Likewise the API needs to provide facilities for serializing a Message as either bytes or text. The API also provides ways to build up a Message from pieces, or to extract information from a Message in pieces, and to modify a Message, and again input and output as both text and bytes must be supported, except that in some cases text output may not make sense (eg: binary attachments). The data model used by the email package is an "implementation detail", and we should not spend effort at this stage trying to optimize it for anything except memory requirements with respect to potentially large sub-objects, and even there it is more a matter of providing ways to deal with potentially large sub-objects than it is a true optimization. In general correctness and robustness is much more important than speed. The data model will need to be a practical hybrid of the input data, possibly transformed in some way in some cases, and various sorts of meta-data. The current email package already works this way. An important characteristic of the model is that it be invertable whenever sensible; that is, if a given byte stream is used to create a Message or subobject, serializing that Message or subobject as bytes should return the original byte stream whenever sensible (ie: when the data is not pathologically malformed). Likewise if a text stream is used to create a Message or subobject, serializing it as text should produce, whenever sensible, the original text stream. In particular, well-formed (per RFC) message data should always come out of a round trip through the email module in exactly the format it went in. An important property of the API is that both the parser that transforms an input stream into a Message and Message serialization should not raise exceptions. Instead a defects list is maintained and exposed through the API. In the face of some defects it may not be sensible to maintain invertability. In the worst case for parser input the resulting Message object may have no headers, a binary blob body, and a defect list, but a Message object will always be produced. The APIs that manipulate the data model either for piecewise construction or for transformations may raise exceptions, and in most cases _should_ raise exceptions when encountering invalid data or operations. APIs that query the model should return as much information as possible without throwing an exception. (The current proposal to implement this is to return objects that have defect lists, and/or raise exceptions when methods of the object are called that would have worked if the input data were valid, leaving the queryable object itself in the hands of the application so that the application has the maximum possible information available to try to handle the error if it wishes to do so.) From steve at pearwood.info Fri Oct 9 04:10:40 2009 From: steve at pearwood.info (Steven D'Aprano) Date: Fri, 9 Oct 2009 13:10:40 +1100 Subject: [Email-SIG] fixing the current email module In-Reply-To: <20091008091840.GB28906@phd.pp.ru> References: <8510262.7231254589795083.JavaMail.root@boaz> <5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org> <20091008091840.GB28906@phd.pp.ru> Message-ID: <200910091310.41133.steve@pearwood.info> On Thu, 8 Oct 2009 08:18:40 pm Oleg Broytman wrote: > > By keeping the various components clear in our mind, we can see > > that ? both statements are correct in a sense. ?The parser and > > generator should never raise exceptions. ?The model can and > > probably should. > > ? ?Are you going to parse any garbage and create a Message (probably > an empty Message) with one defect "cannot parse it at all"? So long as the raw garbage is available for the caller somehow, that seems like a reasonable approach to me. That lets an application display "Unparsable message" to the user, who can then ask to "View Source" (or equivalent) to get access to the raw bytes of the message. -- Steven D'Aprano From v+python at g.nevcal.com Fri Oct 9 05:20:29 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Thu, 08 Oct 2009 20:20:29 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: References: Message-ID: <4ACEABFD.6010309@g.nevcal.com> On approximately 10/8/2009 4:20 PM, came the following characters from the keyboard of Mark Sapiro: > Glenn Linderman wrote: > >> However, there are definitely mailing lists that don't do that. Google >> Groups is one example that doesn't collapse, and always prepends the >> headers in front of Re:. Seems like all the Python lists do the >> collapsing (I wonder why! :) ) Other lists don't do prepending (I think >> the RFCs recommend not prepending in Subject, actually), of the others >> I'm subscribed to, that prepend, some collapse and some don't. >> > > > You seem to be forgetting the case where the encoded subject already > contains the prefix, or do you not care if the subject just continues > to grow with Re:'s and repeated prefixes? > Mark, Please read the last two paragraphs of my message you replied to, two or three more times. Here they are again for reference. > And don't forget removing the prior prepended text before adding the > new prepended text. > > Actually, as long as the prepended text is ASCII, all that work can be > done on the encoded value. When it is not ASCII, it may still be > separated and recognizable. Still that logic is more complex than > decoding, handling as Unicode, and encoding.... when it works. Just > pointing out that there is more than one way to do things... -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From tkikuchi at is.kochi-u.ac.jp Fri Oct 9 05:47:00 2009 From: tkikuchi at is.kochi-u.ac.jp (Tokio Kikuchi) Date: Fri, 09 Oct 2009 12:47:00 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACEABFD.6010309@g.nevcal.com> References: <4ACEABFD.6010309@g.nevcal.com> Message-ID: <4ACEB234.9030309@is.kochi-u.ac.jp> >> Actually, as long as the prepended text is ASCII, all that work can be >> done on the encoded value. When it is not ASCII, it may still be >> separated and recognizable. Still that logic is more complex than >> decoding, handling as Unicode, and encoding.... when it works. Just >> pointing out that there is more than one way to do things... Oh, really? Base64 is 3 to 4 octets encoding and there is no way to prepend padding. -- Tokio Kikuchi, tkikuchi at is.kochi-u.ac.jp http://weather.is.kochi-u.ac.jp/ From stephen at xemacs.org Fri Oct 9 06:27:56 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 09 Oct 2009 13:27:56 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACE6CBD.2030805@g.nevcal.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACCD10D.4070308@g.nevcal.com> <87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACE6CBD.2030805@g.nevcal.com> Message-ID: <87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp> Glenn Linderman writes: > > Conversions will eventually be done. "Best it were done quickly." > > Disagree. Deferring the conversions defers failure issues to the point > where the code (hopefully) somewhat understands the type of data being > manipulated, and can then handle it appropriately. Converting up front > causes errors in things that may never be touched or needed, so the > error detection and handling is wasteful. That's theory; my position is based on Mailman practice. Don't believe me, ask Barry. I also spend most of my OSS time on the internationalization of XEmacs, and the experience is similar there. Best to convert everything as early as possible, or admit that you don't know how. > So for headers, which are supposed to be ASCII, or encoded via RFC rules > to ASCII (no 8-bit chars), then the discovery of an 8-bit char should be > produce a defect report, but then simply converted to Unicode as if it > were Latin-1 (since there is no other knowledge available that could > produce a better conversion). No, that is already corruption. Most clients will assume that string is valid as a header, because it's valid as a string. > And if the result of that is not expected by the client (your > definition), then the client should either notice the defect report > and reject it based on that, or attempt to parse it, and reject it > if it encounters unexpected syntax. After all, this is, for that > client, "raw user input" (albeit from a remote source) so fully > error checking the input is appropriate. No way. That environment would suck to program in. And it's un-Pythonic: "Errors should never pass silently." > Python way. Since the email library is trying to avoid raising > exceptions in large blocks of its code, it is non-Pythonic I disagree with that. "Unless explicitly silenced." The strategy that Barry and I favor is to signal errors lazily. So we *explicitly* silence errors (at least of the Exception kind) when parsing. If we can't parse, we look for a part terminator, encapsulate the bad stuff and move on to the rest of the input. Later, at use time, *if* the unparsable object is used, *then* the error will be raised, hopefully with enough metainformation to figure out what to do about it. I don't see what's un-Pythonic about that. From v+python at g.nevcal.com Fri Oct 9 08:26:39 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Thu, 08 Oct 2009 23:26:39 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: <87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACCD10D.4070308@g.nevcal.com> <87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACE6CBD.2030805@g.nevcal.com> <87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4ACED79F.6050602@g.nevcal.com> On approximately 10/8/2009 9:27 PM, came the following characters from the keyboard of Stephen J. Turnbull: > Glenn Linderman writes: > > > > Conversions will eventually be done. "Best it were done quickly." > > > > Disagree. Deferring the conversions defers failure issues to the point > > where the code (hopefully) somewhat understands the type of data being > > manipulated, and can then handle it appropriately. Converting up front > > causes errors in things that may never be touched or needed, so the > > error detection and handling is wasteful. > > That's theory; my position is based on Mailman practice. Don't believe > me, ask Barry. I also spend most of my OSS time on the > internationalization of XEmacs, and the experience is similar there. > Best to convert everything as early as possible, or admit that you > don't know how. > Emacs is different than email. Either you can read a file to edit it, or you can't. The Postel principle for email says to try to do the best you can, for as much as you can. > > So for headers, which are supposed to be ASCII, or encoded via RFC rules > > to ASCII (no 8-bit chars), then the discovery of an 8-bit char should be > > produce a defect report, but then simply converted to Unicode as if it > > were Latin-1 (since there is no other knowledge available that could > > produce a better conversion). > > No, that is already corruption. Most clients will assume that string > is valid as a header, because it's valid as a string. > Sure it is corruption. That's why there is a defect report. But the conversion technique is appropriate, per the Postel principle. > > And if the result of that is not expected by the client (your > > definition), then the client should either notice the defect report > > and reject it based on that, or attempt to parse it, and reject it > > if it encounters unexpected syntax. After all, this is, for that > > client, "raw user input" (albeit from a remote source) so fully > > error checking the input is appropriate. > > No way. That environment would suck to program in. And it's > un-Pythonic: "Errors should never pass silently." > Then the Postel principle is un-Pythonic, and to be Pythonic any incorrect email should produce an error, and be unreadable. Again, I mentioned producing a defect report. That is not passing an error silently. It is still raw user input, and should still be checked for proper syntax by the client, even if the email is well-formed and conversion produces no defect report. If you don't want to check proper syntax in your program inputs, I don't want to use your programs, they will be insecure. > > Python way. Since the email library is trying to avoid raising > > exceptions in large blocks of its code, it is non-Pythonic > > I disagree with that. "Unless explicitly silenced." The strategy > that Barry and I favor is to signal errors lazily. So we *explicitly* > silence errors (at least of the Exception kind) when parsing. If we > can't parse, we look for a part terminator, encapsulate the bad stuff > and move on to the rest of the input. Later, at use time, *if* the > unparsable object is used, *then* the error will be raised, hopefully > with enough metainformation to figure out what to do about it. > So there seem to be two techniques: 1) convert quickly, but don't raise errors... instead metainformation structures that record the errors, and raise them later if the converted data is accessed. Because some kinds of not-quite-perfect data have alternate handling techniques, either all techniques must be performed and cached, or *some processing must be deferred until the client can decide*. 2) Store the data, and convert only if the data is accessed. When client accesses the data, the exceptions raised allow the client to choose an appropriate processing technique for handling the not-quite-perfect data, based on the context of the client, the importance of that data item, etc. Only the result of that technique need be cached for future accesses. With both techniques, the data is given to the email library, and the errors are not seen until later... potentially the exact same user experience. But with the technique 1, much effort is expended to convert data, parse data, and create error metainformation ready to return IF the data is accessed. (yeah, don't say it, premature optmization -- I call it design, in this case) With technique 2, little effort is required to store the data, create a state variable to indicate whether it has been converted and parsed, or not, and then IF (and only IF) the data is accessed, the conversion and parsing must be done on the first access, and instead of creating and storing metainformation about the errors, they could just be raised. > I don't see what's un-Pythonic about that. > The un-Pythonic thing is returning defect reports instead of raising errors. There is no way for a simple assignment interface to return an error, because the API for simple assignment doesn't have an in-band signaling mechanism. No "condition code" left around to be checked. And programmers often omit checking condition codes anyway, due to laziness and hubris "nothing will go wrong with THIS statement". So the Pythonic way, AFAIU, is that errors are returned out-of-band via raised exceptions. Perhaps this is why it is so hard to design a Pythonic interface to the Postel principle email handling... an out-of-band signalling system interrupts the flow of control, and the Postel principle wants to provide best-as-you-can data... and the easiest way to do Postel is to supply the not-quite-perfect data so the normal control flow can handle things, yet an out-of-band signal can't easily return to the normal control flow, and wrapping tiny try blocks around nearly every email API call is as annoying to the understanding of the control flow as putting all those if statements in the normal control flow to check "condition codes" (error codes, warning codes, defect reports, whatever you want to call them). Stated another way, it is hard to process potentially not-quite-perfect data without writing complex code. And because the email library wants to simplify the handling of email, it wants to limit the complexity of the client code. But when dealing with not-quite-perfect data, there is a choice of different ways to handle it, and the email library doesn't know the best choice for any particular client application... if it did, then it could make the choices, and the client could be less complex. The simplest client could be handed only perfectly structured, 100% accurately decodable email messages... its logic would be (simply, and Pythonically): while 1: try: getEmail() except: logBadEmailReceived else: processEmail() In order to allow defect reports to be useful, the client logic must be more complex; getEmail must be expanded to make decisions based on the content of the defect reports. More try statements must be used, at a finer granularity, or more if statements to check defect reports. The former is more Pythonic, the latter less, AFAIU. Perhaps a given client knows how it wants to handle all types of not-quite-perfect data -- should the email library allow rules to be set, so that when a situation arises, it can handle it according to the rules? This simplifies the client logic, at the cost of initialization setup, rules creation and caching, documenting the rules, adding the new APIs that don't seem to exist in today's email library. While this could perhaps simplify many clients, it cannot simplify the email library... it still has to have the code for all the variant perfect and not-quite-perfect data handling techniques, plus the complexity of rule definition and usage. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From v+python at g.nevcal.com Fri Oct 9 08:31:32 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Thu, 08 Oct 2009 23:31:32 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACEB234.9030309@is.kochi-u.ac.jp> References: <4ACEABFD.6010309@g.nevcal.com> <4ACEB234.9030309@is.kochi-u.ac.jp> Message-ID: <4ACED8C4.5070906@g.nevcal.com> On approximately 10/8/2009 8:47 PM, came the following characters from the keyboard of Tokio Kikuchi: >>> Actually, as long as the prepended text is ASCII, all that work can be >>> done on the encoded value. When it is not ASCII, it may still be >>> separated and recognizable. Still that logic is more complex than >>> decoding, handling as Unicode, and encoding.... when it works. Just >>> pointing out that there is more than one way to do things... >>> > > Oh, really? > > Base64 is 3 to 4 octets encoding and there is no way to prepend padding. > In header values, encoding is done using encoded-words. A header value consists of a sequence of ASCII words, and encoded-words. While an encoded word, that uses base64 encoding cannot easily be adjusted to prepend data into that encoded-word, additional ASCII or encoded-words can be prepended in front of the other ASCII or encoded words within the header-value. So, yes, really! -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From tkikuchi at is.kochi-u.ac.jp Fri Oct 9 10:38:03 2009 From: tkikuchi at is.kochi-u.ac.jp (Tokio Kikuchi) Date: Fri, 09 Oct 2009 17:38:03 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACED8C4.5070906@g.nevcal.com> References: <4ACEABFD.6010309@g.nevcal.com> <4ACEB234.9030309@is.kochi-u.ac.jp> <4ACED8C4.5070906@g.nevcal.com> Message-ID: <4ACEF66B.3000500@is.kochi-u.ac.jp> Glenn Linderman wrote: > On approximately 10/8/2009 8:47 PM, came the following characters from > the keyboard of Tokio Kikuchi: >>>> Actually, as long as the prepended text is ASCII, all that work can be >>>> done on the encoded value. When it is not ASCII, it may still be >>>> separated and recognizable. Still that logic is more complex than >>>> decoding, handling as Unicode, and encoding.... when it works. Just >>>> pointing out that there is more than one way to do things... >> >> Oh, really? >> >> Base64 is 3 to 4 octets encoding and there is no way to prepend padding. >> > > In header values, encoding is done using encoded-words. A header value > consists of a sequence of ASCII words, and encoded-words. While an > encoded word, that uses base64 encoding cannot easily be adjusted to > prepend data into that encoded-word, additional ASCII or encoded-words > can be prepended in front of the other ASCII or encoded words within the > header-value. > > So, yes, really! > Following two lines have equivalent header contents: Re: [mmjp-users 123] =?iso-2022-jp?b?GyRCRnxLXDhsGyhC?= Re: =?iso-2022-jp?b?W21tanAtdXNlcnMgMTIzXSAbJEJGfEtcOGwbKEI=?= I'd like to see how you can extract ascii part without touching rest of the encoded word in the second example. What we do in mailman is that both are treated equally and delete [mmjp-users 123] from the subject and prefix again by [mmjp-users 124] (with new sequential number). Some MUA encode subjects like the second example and this is beyond our control. Therefore, we are forced to decode the whole part of header content. -- Tokio Kikuchi, tkikuchi at is.kochi-u.ac.jp http://weather.is.kochi-u.ac.jp/ From phd at phd.pp.ru Fri Oct 9 12:54:33 2009 From: phd at phd.pp.ru (Oleg Broytman) Date: Fri, 9 Oct 2009 14:54:33 +0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <200910091310.41133.steve@pearwood.info> References: <8510262.7231254589795083.JavaMail.root@boaz> <5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org> <20091008091840.GB28906@phd.pp.ru> <200910091310.41133.steve@pearwood.info> Message-ID: <20091009105433.GA9096@phd.pp.ru> On Fri, Oct 09, 2009 at 01:10:40PM +1100, Steven D'Aprano wrote: > On Thu, 8 Oct 2009 08:18:40 pm Oleg Broytman wrote: > > ? ?Are you going to parse any garbage and create a Message (probably > > an empty Message) with one defect "cannot parse it at all"? > > So long as the raw garbage is available for the caller somehow, that > seems like a reasonable approach to me. That lets an application > display "Unparsable message" to the user, who can then ask to "View > Source" (or equivalent) to get access to the raw bytes of the message. I don't see any difference with "raise an exception; the calling application catches the exceptions, displays or logs "Unparseable message", and displays or logs the original garbage (that can be an attribute of the exception instance)". The difference IFAIU could be between well-formed messages and complete garbage. A not well-formed input will be parsed to a Message, and such parsing requires a clever algorithm with resynchronizations (jumps from a bad point to a recognized good point to restart parsing there). I don't know if it's possible to create such a clever algorithm; and for complete unparseable garbage I still prefer an exception. Oleg. -- Oleg Broytman http://phd.pp.ru/ phd at phd.pp.ru Programmers don't die, they just GOSUB without RETURN. From barry at python.org Fri Oct 9 13:56:12 2009 From: barry at python.org (Barry Warsaw) Date: Fri, 9 Oct 2009 07:56:12 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <87my41pob0.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> <5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org> <4ACD95DB.4040800@g.nevcal.com> <440B5F4C-E210-46F0-B647-240CDF091F4D@python.org> <87my41pob0.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On Oct 8, 2009, at 5:29 PM, Stephen J. Turnbull wrote: > Most non-text media do support comments, though. I don't know if > extracting comments is a reasonable response to a request for text > from an image, but we should provide a place to put any text that the > callbacks that do the actual work of decoding might return. Are you talking about comments embedded in things like id3 tags and jpg comments? If so, ISTM those are outside the scope of the email package. Message objects can return decoded payloads, but I don't think it should provide the framework for looking inside those payloads. >>> However, I think it is proper that a MIME part that is not flagged >>> as text/* might produce an error if asked for as text. >> >> +1 > > That doesn't preclude raising an error/returning a defect object in > many or most use cases, but there may be use cases where it would be > useful to allow a callback on a non-text object to return text. I think we should re-cast the discussion in terms of returning raw and decoded payloads. The email package can provide methods for returning raw payloads as bytes and decoded payloads in the natural type as described by Content-Type. For the latter, we probably need a registration and plugin system to handle types that email doesn't know about by default, but it should also be used to handle types it does know about. That way, an application could override e.g. decoding text/html content if it wanted to. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From barry at python.org Fri Oct 9 14:05:44 2009 From: barry at python.org (Barry Warsaw) Date: Fri, 9 Oct 2009 08:05:44 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACE6A1B.7060702@g.nevcal.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org> <4ACD94E5.5020808@g.nevcal.com> <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org> <4ACE6A1B.7060702@g.nevcal.com> Message-ID: <3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org> On Oct 8, 2009, at 6:39 PM, Glenn Linderman wrote: > 1) wire format. Either what came in, in the parser case, or what > would be generated. > 2) internal headers from the MIME part > 3) decoded BLOB. This means that quopri and base64 are decoded, no > more and no less. This is bytes. No headers, only payload. For > Content-Transfer-Encoding: binary, this is mostly a noop. > 4) text/* parts should also be obtainable as str()/unicode(), > payload only. This is where charset decoding is done. > > I think your talk in the next paragraph about hooks and other object > types being produced is a generalization of 4, not 3, and generally > no additional decoding needs to be done, just conversion to the > right object type (or file, or file-like object). I mostly agree with that. I've always called #4 the "decoded payload" and #3 I've usually called the "raw payload". Maybe we can bikeshed on better terms to help inform us about the API's method/attribute names. Which brings up another point: right now Message objects have a single .get_payload() method that takes a flag to indicate whether it should be the decoded or raw payload. That's bong. These should be different interfaces. >> The problem is that if the bytes came off the wire, the parser >> currently can only attach the most basic MIME base class. It >> doesn't know that an image/png should create a MIMEImagePNG >> instance there. This is different from hacking the model directly >> because the application can instantiate the right class. So the >> parser either has to have a hookable way for an application to go >> from content-type to class, or the generic MIME base class needs to >> be hookable in its .decode() method. > > So either the email package can stop at 3, and 4 only for text/* > parts, or it could learn more types (registered types, with well- > defined corresponding objects could be potentially built-in to the > email package), and/or it could become hookable for application > types. Of course, for disposition to files, storing the BLOB in a > file of the right name is adequate... to avoid the file, I agree > that converting to a useful object type is handy. But maybe file- > like objects would suffice, for most of the types. My own preferences here is that email does support #4 with a registration system to handle returning concrete payload objects based on the Content-Type. I also think that the email package probably should not implement "store-payloads-on-disk" by default, although it may provide some example implementations for simple applications (much the same way there's wsgiref for simple applications). Still, that's different than say, storing attachments in a file named by the Content- Disposition header's filename parameter. That latter is firmly in the domain of the application. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From barry at python.org Fri Oct 9 14:23:23 2009 From: barry at python.org (Barry Warsaw) Date: Fri, 9 Oct 2009 08:23:23 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACE6CBD.2030805@g.nevcal.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACCD10D.4070308@g.nevcal.com> <87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACE6CBD.2030805@g.nevcal.com> Message-ID: <17899606-EE28-4800-A05D-95525AF90E3E@python.org> On Oct 8, 2009, at 6:50 PM, Glenn Linderman wrote: > On approximately 10/8/2009 4:40 AM, came the following characters > from the keyboard of Stephen J. Turnbull: >> Glenn Linderman writes: >> >> > > > If conversions are avoided, then octets are unlikely to be >> out of > > > range? >> > > >> > > Haven't looked in your spam bucket recently, I guess. Spammers >> > > regularly put 8 bit characters into headers (and into bodies in >> > > messages without a Content-Type header), for one thing. >> > > I'm aware of that, but if conversions are not done, octets are >> unlikely > to be _reported_ to be out of range.... >> >> Conversions will eventually be done. "Best it were done quickly." >> > > Disagree. Deferring the conversions defers failure issues to the > point where the code (hopefully) somewhat understands the type of > data being manipulated, and can then handle it appropriately. > Converting up front causes errors in things that may never be > touched or needed, so the error detection and handling is wasteful. I'm with Stephen here. Remember, we're saying the parser should never throw an exception, so any such conversion exception happens when you manipulate the model directly. That /has/ to error early because otherwise it is impossible to debug. > So for headers, which are supposed to be ASCII, or encoded via RFC > rules to ASCII (no 8-bit chars), then the discovery of an 8-bit char > should be produce a defect report, but then simply converted to > Unicode as if it were Latin-1 (since there is no other knowledge > available that could produce a better conversion). And if the > result of that is not expected by the client (your definition), then > the client should either notice the defect report and reject it > based on that, or attempt to parse it, and reject it if it > encounters unexpected syntax. After all, this is, for that client, > "raw user input" (albeit from a remote source) so fully error > checking the input is appropriate. Sure, but I can also think of lots of other things the client might do, including blowing away the header value and substituting their own, doing the moral equivalent of a str.replace(), etc. etc. It's not our job to decide. It our job to provide the highest fidelity information we can and the best APIs for clients to do what they want. > The problem with the APIs that are spelled __str__ and __bytes__ is > that there is no other way to return errors other than > exceptions.... the Python way. Since the email library is trying to > avoid raising exceptions in large blocks of its code, it is non- > Pythonic (which is what Oleg is probably complaining about, in > part). But because it needs to avoid exceptions, and is therefore > non-Pythonic, it may be inappropriate to spell very many of its APIs > __str__ and __bytes__, because that is Pythonic, and requires > exceptions. Once you become non-Pythonic in one area, you may have > to also be non-Pythonic in some other areas... As was pointed out in a previous message, we shouldn't be too concerned with __str__ and __bytes__ right now. We'll design non- magical APIs for everything and they'll do the right thing. We'll then alias what seems appropriate as __str__ and __bytes__ and they'll be as Pythonic as makes sense. When I say that, I'm thinking about the semantic differences Message objects currently have in their dict- like-plus API (which I still think makes perfect practical sense). -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From barry at python.org Fri Oct 9 14:25:15 2009 From: barry at python.org (Barry Warsaw) Date: Fri, 9 Oct 2009 08:25:15 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACE6F97.6010605@g.nevcal.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org> <4ACD94E5.5020808@g.nevcal.com> <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org> <4ACE60CA.6010907@g.nevcal.com> <4ACE6F97.6010605@g.nevcal.com> Message-ID: <3685E116-E57A-43EF-89E2-1FB03D878C0E@python.org> On Oct 8, 2009, at 7:02 PM, Glenn Linderman wrote: > Well, that is a feature of some mailing list programs. Those that > want to do that, will have to decode and re-encode. > > However, there are definitely mailing lists that don't do that. > Google Groups is one example that doesn't collapse, and always > prepends the headers in front of Re:. Seems like all the Python > lists do the collapsing (I wonder why! :) ) Other lists don't do > prepending (I think the RFCs recommend not prepending in Subject, > actually), of the others I'm subscribed to, that prepend, some > collapse and some don't. > > I'm saying that there are use cases where prepending could be done > without decoding; while you are positing use cases where that is > insufficient, but you shouldn't have said "Except"... you should > have said "There are also other use cases". > > And when you collapse Re:, do you also collapse various language- > specific spellings of Re: ??? that is a hard problem. I don't disagree with any of that. It's all firmly in the scope of the application, not the email package. The email package just has to make it possible. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From barry at python.org Fri Oct 9 14:27:20 2009 From: barry at python.org (Barry Warsaw) Date: Fri, 9 Oct 2009 08:27:20 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <1254929486.96.16481@mint-julep.mondoinfo.com> <20091007170718.GA1901@phd.pp.ru> <5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org> <4ACD95DB.4040800@g.nevcal.com> Message-ID: <1ABDE764-850E-40E0-9491-01E9ECA78DC7@python.org> On Oct 8, 2009, at 7:52 PM, R. David Murray wrote: > [1] http://wiki.python.org/moin/Email%20SIG Fantastic David, thanks! -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From barry at python.org Fri Oct 9 15:21:17 2009 From: barry at python.org (Barry Warsaw) Date: Fri, 9 Oct 2009 09:21:17 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACED79F.6050602@g.nevcal.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACCD10D.4070308@g.nevcal.com> <87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACE6CBD.2030805@g.nevcal.com> <87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACED79F.6050602@g.nevcal.com> Message-ID: On Oct 9, 2009, at 2:26 AM, Glenn Linderman wrote: > Then the Postel principle is un-Pythonic, and to be Pythonic any > incorrect email should produce an error, and be unreadable. Again, I > mentioned producing a defect report. That is not passing an error > silently. There's no conflict between principles here if you keep clear in your mind the two different patterns we're talking about. When parsing raw data, we soldier on in the face of errors as best we can, never raising exceptions, but recording defects. When manipulating the model, we throw exceptions as early as possible because these are application errors and the client controls the application. > The un-Pythonic thing is returning defect reports instead of raising > errors. There is no way for a simple assignment interface to return > an error, because the API for simple assignment doesn't have an in- > band signaling mechanism. This "assignment interface" falls under "manipulating the model". It does reveal an important point though: the parser may not be able to use the same API that model manipulation uses. It may need to use a lower-level (read: more permissive) interface to the model. The current parser mostly works well though because the current model doesn't do any standards checking. Wanna create a 10k Subject header? Fine! In practice this works well, so perhaps we need to think about how "RFC enforcement" can be overlaid on the model. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From stephen at xemacs.org Fri Oct 9 17:10:18 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sat, 10 Oct 2009 00:10:18 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACED79F.6050602@g.nevcal.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACCD10D.4070308@g.nevcal.com> <87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACE6CBD.2030805@g.nevcal.com> <87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACED79F.6050602@g.nevcal.com> Message-ID: <87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp> Glenn Linderman writes: > Emacs is different than email. Either you can read a file to edit it, > or you can't. *sigh* Emacs is as powerful a programming environment as Python, and applications regularly deal with network streams (HTTP, NNTP, and SMTP most commonly, but also raw X protocol and any kind of socket supported by the platform). So, yes, it's different from email, because it's *far* more general. That's precisely why I appreciate Bill's concerns about non-email usage. > The Postel principle for email says to try to do the best you can, > for as much as you can. Actually, it doesn't. It says be lenient in what you accept, strict in what you emit. You accept it ... but you don't have to do anything with it except preserve it verbatim for whoever wants it. > > > produce a defect report, but then simply converted to Unicode as if it > > > were Latin-1 (since there is no other knowledge available that could > > > produce a better conversion). > > > > No, that is already corruption. Most clients will assume that string > > is valid as a header, because it's valid as a string. > > Sure it is corruption. That's why there is a defect report. But > the conversion technique is appropriate, per the Postel principle. Actually, I would say you are emitting leniently, in violation of the Postel principle. You don't know what the client will do, they may eat it in a single gulp without looking at it. Thus you should avoid converting anything that you don't know what it is (unless specifically asked to do your best). > Again, I mentioned producing a defect report. That is not passing > an error silently. But if I access that Unicode object without looking at the defect report, you *will* pass the error silently. OTOH, if I look at the defect report, I won't access the Unicode object. > It is still raw user input, and should still be checked for proper > syntax by the client, Nonsense. The email module had better know a lot more about syntax than the client. If it doesn't, whack it with a 2x4 until it learns! > produces no defect report. If you don't want to check proper syntax in > your program inputs, I don't want to use your programs, they will be > insecure. So you're saying that every program that uses the email module should reproduce 100% of the functionality of the email module's parser, or it's insecure. And you imply that's an excuse for passing corrupt data to any client that asks for it. I disagree. > So there seem to be two techniques: Whatever gave you that idea? > 2) Store the data, and convert only if the data is accessed. > With technique 2, little effort is required to store the data, > create a state variable to indicate whether it has been converted Why do that? It's always "False" in technique 2. > and parsed, or not, and then IF (and only IF) the data is accessed, > the conversion and parsing must be done on the first access, and > instead of creating and storing metainformation about the errors, > they could just be raised. No, they cannot just be raised. If you just raise the error, then the next time you try to access unparsed data, you'll hit the error again. If you use the same handler you did before, you're in an infloop. So you need a second handler to do things differently this time or a flag ... but it's unclear to me that that flag can be a boolean. So you may as well store the defect list and information about where to restart. > So the Pythonic way, AFAIU, is that errors are returned out-of-band > via raised exceptions. Sure. But what you're missing is that "Neither rain, nor snow, nor dark of night may stop the Parser on her appointed rounds." It is not easy to write parsers, but I'll tell you one thing: it's orders of magnitude harder to write a parser that starts in the middle and works outward, than one that starts at the beginning and works forward to the end. So it's OK to write a lazy parser, but it must retain enough state so that it can work forward until the end. Because you don't know that the client will not request the last character of the message, you need to be able to try to get it, no matter what happened to the first 10GB of the message. And if an exception occurs, it must be handled by the parser itself; if not, you put the poor thing in the position of starting over at the beginning (that way lies the madness of infloops), or trying to start a parse in the middle and work out. From v+python at g.nevcal.com Fri Oct 9 20:01:01 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Fri, 09 Oct 2009 11:01:01 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: <3685E116-E57A-43EF-89E2-1FB03D878C0E@python.org> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org> <4ACD94E5.5020808@g.nevcal.com> <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org> <4ACE60CA.6010907@g.nevcal.com> <4ACE6F97.6010605@g.nevcal.com> <3685E116-E57A-43EF-89E2-1FB03D878C0E@python.org> Message-ID: <4ACF7A5D.6030906@g.nevcal.com> On approximately 10/9/2009 5:25 AM, came the following characters from the keyboard of Barry Warsaw: > On Oct 8, 2009, at 7:02 PM, Glenn Linderman wrote: > >> Well, that is a feature of some mailing list programs. Those that >> want to do that, will have to decode and re-encode. >> >> However, there are definitely mailing lists that don't do that. >> Google Groups is one example that doesn't collapse, and always >> prepends the headers in front of Re:. Seems like all the Python >> lists do the collapsing (I wonder why! :) ) Other lists don't do >> prepending (I think the RFCs recommend not prepending in Subject, >> actually), of the others I'm subscribed to, that prepend, some >> collapse and some don't. >> >> I'm saying that there are use cases where prepending could be done >> without decoding; while you are positing use cases where that is >> insufficient, but you shouldn't have said "Except"... you should have >> said "There are also other use cases". >> >> And when you collapse Re:, do you also collapse various >> language-specific spellings of Re: ??? that is a hard problem. > > I don't disagree with any of that. It's all firmly in the scope of > the application, not the email package. The email package just has to > make it possible. Yes. So since the application has such latitude to make such decisions, it seems that the email package should do minimal parsing, analysis, and decoding of incoming messages until such time as the application chooses to request particular information. So it seems there need to be APIs to retrieve and set (using your terminology from another reply) wire format header values, as well as decoded header values. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From barry at python.org Fri Oct 9 20:05:09 2009 From: barry at python.org (Barry Warsaw) Date: Fri, 9 Oct 2009 14:05:09 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACF7A5D.6030906@g.nevcal.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org> <4ACD94E5.5020808@g.nevcal.com> <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org> <4ACE60CA.6010907@g.nevcal.com> <4ACE6F97.6010605@g.nevcal.com> <3685E116-E57A-43EF-89E2-1FB03D878C0E@python.org> <4ACF7A5D.6030906@g.nevcal.com> Message-ID: On Oct 9, 2009, at 2:01 PM, Glenn Linderman wrote: > So it seems there need to be APIs to retrieve and set (using your > terminology from another reply) wire format header values, as well > as decoded header values. Yes, I think everyone agrees that we need both low-level and higher level APIs. (Or at least I hope so! :) -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From v+python at g.nevcal.com Fri Oct 9 20:59:25 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Fri, 09 Oct 2009 11:59:25 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: <3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org> <4ACD94E5.5020808@g.nevcal.com> <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org> <4ACE6A1B.7060702@g.nevcal.com> <3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org> Message-ID: <4ACF880D.5080305@g.nevcal.com> On approximately 10/9/2009 5:05 AM, came the following characters from the keyboard of Barry Warsaw: > On Oct 8, 2009, at 6:39 PM, Glenn Linderman wrote: >> 1) wire format. Either what came in, in the parser case, or what >> would be generated. >> 2) internal headers from the MIME part >> 3) decoded BLOB. This means that quopri and base64 are decoded, no >> more and no less. This is bytes. No headers, only payload. For >> Content-Transfer-Encoding: binary, this is mostly a noop. >> 4) text/* parts should also be obtainable as str()/unicode(), payload >> only. This is where charset decoding is done. >> >> I think your talk in the next paragraph about hooks and other object >> types being produced is a generalization of 4, not 3, and generally >> no additional decoding needs to be done, just conversion to the right >> object type (or file, or file-like object). > I mostly agree with that. I've always called #4 the "decoded payload" > and #3 I've usually called the "raw payload". Maybe we can bikeshed > on better terms to help inform us about the API's method/attribute names. It would be good though to have standardized terms for easier communication. Maybe as they are chosen, they could be added to that Wiki RDM set up? My only problem with "raw" and "decoded" payload, is that there are 3 payload formats, not 2, so there needs to be a 3rd term, corresponding to #1, #3, and #4, above. #2 is somewhat orthogonal from the payload. To me, "raw" conjures up #1, not #3. If Content-Transfer-Encoding is 7bit, 8bit, or binary, then 2 is the same as 1, it is just a terminology change. Only for Content-Transfer-Encoding of quoted-printable or base64 must work be done to convert from #1 to #3. If Content-Type is text/*, then the transformation from 2 to 3 is more than a cast, but for many other formats, it is mostly a cast. > Which brings up another point: right now Message objects have a single > .get_payload() method that takes a flag to indicate whether it should > be the decoded or raw payload. That's bong. These should be > different interfaces. Separate APIs would be clearer, but for compatibility, should .get_payload() be retained, with the flag? Fortunately, there is only one result value in any case, so it is just a matter of what the type of that output value is, and how it should be handled. Perhaps the flag parameter should be extended to allow retrieval of all three payload formats instead of only two? .get_payload could be converted to call the appropriate specific APIs, should it be desired to invent separate APIs for each payload format. >>> The problem is that if the bytes came off the wire, the parser >>> currently can only attach the most basic MIME base class. It >>> doesn't know that an image/png should create a MIMEImagePNG instance >>> there. This is different from hacking the model directly because >>> the application can instantiate the right class. So the parser >>> either has to have a hookable way for an application to go from >>> content-type to class, or the generic MIME base class needs to be >>> hookable in its .decode() method. >> >> So either the email package can stop at 3, and 4 only for text/* >> parts, or it could learn more types (registered types, with >> well-defined corresponding objects could be potentially built-in to >> the email package), and/or it could become hookable for application >> types. Of course, for disposition to files, storing the BLOB in a >> file of the right name is adequate... to avoid the file, I agree that >> converting to a useful object type is handy. But maybe file-like >> objects would suffice, for most of the types. > > My own preferences here is that email does support #4 with a > registration system to handle returning concrete payload objects based > on the Content-Type. Sure, a registration system is fine. It could work for any type that has a method that can be registered, that accepts a binary BLOB and returns an appropriate typed and functioning object that can manipulate that type. That would mean that the application would have to make all the registration calls up front, instead of making the API calls when the objects are retrieved. Basically, if the email package doesn't have a registration system that the application can use, the application has to invent its own, so this is work that could benefit all applications. I suppose the default registration for text/* would be to convert from whatever to Unicode, and the default registration for all other Content-Type would be to pass back bytes(). Or maybe a few other common types, for which specific types are available, some specific image/* types, perhaps, that seems to have MIME types defined for them, although perhaps people may still prefer to register, say, a PIL type, for images, so I agree the email package should only provide default registrations. On the other hand, I'm not sure how the registration system should work with threads, if different threads want different registrations... Actually, although it is not common practice to have encodings other than the RFC defined base64 and quoted-printable, a registration system for converting from #1 to #3, with appropriate defaults for base64, quoted-printable, binary, 7bit, 8bit, would be appropriate, and would provide a framework for allowing easy extensions to the encodings. Future mail RFCs may define some, but more likely, applications that wish to use email transports, where both ends are application controlled, might wish to define other encodings... the RFCs do allow for x-* encodings that are user defined. If a registration system is created for #3 to #4 encodings, the same mechanism could likely be use for the registration system for #1 to #3 encodings, so there would be added flexibility at very little cost. > I also think that the email package probably should not implement > "store-payloads-on-disk" by default, although it may provide some > example implementations for simple applications (much the same way > there's wsgiref for simple applications). Thinking about this, I agree that storing payloads on disk should not be the default action. However, if an application wants to control its memory consumption, the receipt of a large email could negatively impact that desire. It might be appropriate to place individual MIME parts on disk, as they are parsed, if the application indicates a threshold part size and/or threshold aggregate size, beyond which parts should be placed in cache. Along with that, the temporary storage location in which to place them would have to be configured. > Still, that's different than say, storing attachments in a file > named by the Content-Disposition header's filename parameter. That > latter is firmly in the domain of the application. I again agree that this should not be the default action, but I assume that an API should be provided such that an application could tell the email package to place the content in the header's filename parameter. If such an API doesn't already exist, it seems it would be a helpful extension, and if the part was already cached on disk because of the above thresholds, the email package could possibly use rename instead of file copy to achieve the goal. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From v+python at g.nevcal.com Fri Oct 9 21:40:33 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Fri, 09 Oct 2009 12:40:33 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: <17899606-EE28-4800-A05D-95525AF90E3E@python.org> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACCD10D.4070308@g.nevcal.com> <87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACE6CBD.2030805@g.nevcal.com> <17899606-EE28-4800-A05D-95525AF90E3E@python.org> Message-ID: <4ACF91B1.2090303@g.nevcal.com> On approximately 10/9/2009 5:23 AM, came the following characters from the keyboard of Barry Warsaw: > On Oct 8, 2009, at 6:50 PM, Glenn Linderman wrote: > >> On approximately 10/8/2009 4:40 AM, came the following characters >> from the keyboard of Stephen J. Turnbull: >>> Glenn Linderman writes: >>> >>> > > > If conversions are avoided, then octets are unlikely to be >>> out of > > > range? >>> > > >>> > > Haven't looked in your spam bucket recently, I guess. Spammers >>> > > regularly put 8 bit characters into headers (and into bodies in >>> > > messages without a Content-Type header), for one thing. >>> > > I'm aware of that, but if conversions are not done, octets are >>> unlikely > to be _reported_ to be out of range.... >>> >>> Conversions will eventually be done. "Best it were done quickly." >>> >> >> Disagree. Deferring the conversions defers failure issues to the >> point where the code (hopefully) somewhat understands the type of >> data being manipulated, and can then handle it appropriately. >> Converting up front causes errors in things that may never be touched >> or needed, so the error detection and handling is wasteful. > > I'm with Stephen here. Remember, we're saying the parser should never > throw an exception, so any such conversion exception happens when you > manipulate the model directly. That /has/ to error early because > otherwise it is impossible to debug. I suspect we are talking with different terminology somehow, here. At least it seems that way, between myself and Stephen. So let me return to ground zero, and ask some very basic questions, to see what, if anything, I am missing in my understanding of Stephen's and perhaps your, terminology. Let me speak in terms of parsing incoming wire-format messages, because the creation of a valid email from API calls should be straightforward. I see the necessary job of the parser to received chunks of the message, parse the headers into individual headers (based mostly on CR LF TAB detection, and find the end of the headers. Then, in order to properly handle the body, it needs to find several specific headers, or supply defaults for them if lacking. They include validation of the MIME-Version, determining the Content-Type, and Content-Transfer-Encoding. Other headers do not need to be decoded at parse time, if I understand things, just parsed into buckets (a list to preserve order, with possibly an index of some sort for performance if necessary). The 3 headers mentioned should be fully validated and decoded, so that parsing the body can proceed. Parsing the body finds one or more MIME parts, and for each part, a list of its headers should be created. Content-Type and Content-Transfer-Encoding should again be fully validated and decoded, so that parsing the body of each part can proceed recursively. The leaf MIME parts should have their wire format data stored also. Do you agree with that minimal functionality of message parsing? If content boundaries cannot be found, then the parsing will fail, and a defect report generated for that part, and any higher-level parts that include it, because they will also be incomplete. That is just a parse-error flag, in the tree of MIME parts, AFAICT. I see the further validation and decoding of the MIME tree for the message to be all based on API calls by the application to manipulate the model, which should be able to raise exceptions as needed, and could have fully Pythonic interfaces. If the client wishes to have all headers, header values, and charset decoding validated before doing model manipulations, then it should call email package APIs that are provided to do that individually, per MIME part, or recursively over the model (and which might raise exceptions). If the client wishes to have all leaf MIME parts decoded from wire format to "raw payload" or "decoded payload", before manipulating the model, then it should call the email package APIs that are provided to do that individually, per MIME part, or recursively over the model (and which might raise exceptions). Is there any other functionality that should be performed? If so, why? It seems that Stephen is perhaps saying that the functionality in the above two paragraphs should be performed during parsing. Is that what is being said? I can hardly believe it, if so. Since there are multiple ways to interpret not-quite-perfect data, application guidance is required for those choices, and the creation of defect reports along the way would be a bookkeeping headache. >> So for headers, which are supposed to be ASCII, or encoded via RFC >> rules to ASCII (no 8-bit chars), then the discovery of an 8-bit char >> should be produce a defect report, but then simply converted to >> Unicode as if it were Latin-1 (since there is no other knowledge >> available that could produce a better conversion). And if the result >> of that is not expected by the client (your definition), then the >> client should either notice the defect report and reject it based on >> that, or attempt to parse it, and reject it if it encounters >> unexpected syntax. After all, this is, for that client, "raw user >> input" (albeit from a remote source) so fully error checking the >> input is appropriate. > > Sure, but I can also think of lots of other things the client might > do, including blowing away the header value and substituting their > own, doing the moral equivalent of a str.replace(), etc. etc. It's > not our job to decide. It our job to provide the highest fidelity > information we can and the best APIs for clients to do what they want. Exactly. So if the client is going to blow away the header value, no point to validate and decode it. If the client is going to send it on, the client can choose to validate before sending, or just send what was received, whether or not it was valid. This depends on the purpose and functionality of the client. >> The problem with the APIs that are spelled __str__ and __bytes__ is >> that there is no other way to return errors other than exceptions.... >> the Python way. Since the email library is trying to avoid raising >> exceptions in large blocks of its code, it is non-Pythonic (which is >> what Oleg is probably complaining about, in part). But because it >> needs to avoid exceptions, and is therefore non-Pythonic, it may be >> inappropriate to spell very many of its APIs __str__ and __bytes__, >> because that is Pythonic, and requires exceptions. Once you become >> non-Pythonic in one area, you may have to also be non-Pythonic in >> some other areas... > > As was pointed out in a previous message, we shouldn't be too > concerned with __str__ and __bytes__ right now. We'll design > non-magical APIs for everything and they'll do the right thing. We'll > then alias what seems appropriate as __str__ and __bytes__ and they'll > be as Pythonic as makes sense. When I say that, I'm thinking about > the semantic differences Message objects currently have in their > dict-like-plus API (which I still think makes perfect practical sense). OK, it seems we all understand the limitations of the __str__, __bytes__, and assignment type APIs: they must either succeed, or raise exceptions. Can we agree to that clients should only use such APIs when success is assured, or raising exceptions is acceptable? And that if a client complains about an exception in a case they thought success should have been assured, that it is not a bug if they misunderstood? Clearly the email package should document the conditions for which success can be assured, if there are any... and that it is fair game to raise exceptions if those conditions are not met. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From v+python at g.nevcal.com Fri Oct 9 22:26:19 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Fri, 09 Oct 2009 13:26:19 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: <87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACCD10D.4070308@g.nevcal.com> <87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACE6CBD.2030805@g.nevcal.com> <87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACED79F.6050602@g.nevcal.com> <87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4ACF9C6B.4020508@g.nevcal.com> On approximately 10/9/2009 8:10 AM, came the following characters from the keyboard of Stephen J. Turnbull: > Glenn Linderman writes: > > > Emacs is different than email. Either you can read a file to edit it, > > or you can't. > > *sigh* Emacs is as powerful a programming environment as Python, and > applications regularly deal with network streams (HTTP, NNTP, and SMTP > most commonly, but also raw X protocol and any kind of socket > supported by the platform). So, yes, it's different from email, > because it's *far* more general. That's precisely why I appreciate > Bill's concerns about non-email usage. > OK, yes, Emacs is an operating system. I am an Emacs user. And yes, I know Emacs can read email (I used it to read and write email, but found it seriously lacking for the way I handle email, and annoying that the email buffers and edit buffers were all in the same buffer pool, and I quit using it for email). And I know it can be programmed, and I've done a little of that, but I hate Lisp, so I mostly Google for the packages that do what I need, and don't try to create my own. > > The Postel principle for email says to try to do the best you can, > > for as much as you can. > > Actually, it doesn't. It says be lenient in what you accept, strict > in what you emit. You accept it ... but you don't have to do > anything with it except preserve it verbatim for whoever wants it. > Yes, that is what it says, I agree. But unless you do the best you can, for as much as you can, no one is going to want it, so they are basically the same. > > > > produce a defect report, but then simply converted to Unicode as if it > > > > were Latin-1 (since there is no other knowledge available that could > > > > produce a better conversion). > > > > > > No, that is already corruption. Most clients will assume that string > > > is valid as a header, because it's valid as a string. > > > > Sure it is corruption. That's why there is a defect report. But > > the conversion technique is appropriate, per the Postel principle. > > Actually, I would say you are emitting leniently, in violation of the > Postel principle. You can say that, but I don't have to believe it. I'm talking about accepting; the message has arrived, it is here, the client is trying to look at it, and I'm talking about ways the client can look at not-quite-perfect data, knowing that it is not quite perfect, but still being able to see it. I'm not at all talking about emitting data. You seem to be calling the email package helping the client to accept not-quite-perfect data, as a form of emitting data. It is not. > You don't know what the client will do, they may > eat it in a single gulp without looking at it. Thus you should avoid > converting anything that you don't know what it is (unless > specifically asked to do your best). > The email package cannot police the client... if it chooses to "eat it in a single gulp without looking at it" then it may get indigestion. I never suggested that "converting to Unicode as if it were Latin-1" should be done without informing the client, or being requested by the client to do that via a special API call... I was only talking about an appropriate method of doing conversions in the presence of not-quite-perfect data input, so that the client, and possibly even a human, can try to make some sense out of the not-quite-perfect data. > > Again, I mentioned producing a defect report. That is not passing > > an error silently. > > But if I access that Unicode object without looking at the defect > report, you *will* pass the error silently. OTOH, if I look at the > defect report, I won't access the Unicode object. > If those are the only two choices you see, then you are not doing your whole job. If you ignore defect reports, you are ignorant (blunt, but not intended to be offensive). If you treat all defect reports as fatal errors, then you are not being lenient in what you accept (non-Postel). > > It is still raw user input, and should still be checked for proper > > syntax by the client, > > Nonsense. The email module had better know a lot more about syntax > than the client. If it doesn't, whack it with a 2x4 until it learns! > I think we are talking at cross purposes here. I find it quite difficult to follow where you cross the boundary between talking about one sort of email package client, and then switch to another type, or switch to the responsibilities of the email package. A client which is an MUA is just going to present the best possible data to a human user, and is done. A client with is an email archiver preserves the data for presenting via other MUAs. An application which is using email as a transport, has specific goals, which require specific content. You were mentioning clients. It is this sort of client I thought you were talking about, and about which I responded to. If such a client doesn't validate the syntax of that content, it isn't much of an application. The email module does not, and cannot, understand the application domain; it can only validate that the message has proper (or improper) structure. The transported content is fully the responsibility of the application to validate, parse, and manipulate. The email module may detect if the transport cause garbling in the structure of the message, and may be able to warn the application about such garbling. But that may not prevent the application from finding its content within even a garbled email, and so it may still be able to validate, parse, and manipulate that content. Such clients may transfer content either in headers or in MIME parts... in any case, whatever client specific content is expected in those headers or MIME parts should be validated by the client. > > produces no defect report. If you don't want to check proper syntax in > > your program inputs, I don't want to use your programs, they will be > > insecure. > > So you're saying that every program that uses the email module should > reproduce 100% of the functionality of the email module's parser, or > it's insecure. And you imply that's an excuse for passing corrupt > data to any client that asks for it. > > I disagree. > I'm glad you disagree with what you thought I was saying, because that isn't what I was saying, and I also disagree with your paraphrase of what I was saying. The email package should parse email. Where it finds not-quite-perfect data, the client may get involved to choose a path for interpreting the not-quite-perfect data... or to reject the not-quite-perfect data. Once the data from the email is discovered, then the client must operate on the data. An MUA would simply display it to a human. Other clients would attempt to interpret the content. The interpretation of the content requires the client to parse, validate the syntax of, and manipulate the content. An example would be a program that does appointments via email. If it finds an appointment in a known format, it enters it into the calendar. The email package knows nothing about appointments or calendars (of the sort that hold appointments). It cannot help, only the client can do that part of the job. > > So there seem to be two techniques: > > Whatever gave you that idea? > I'm not sure you what you are asking here. > > 2) Store the data, and convert only if the data is accessed. > > > With technique 2, little effort is required to store the data, > > create a state variable to indicate whether it has been converted > > Why do that? It's always "False" in technique 2. > The first time it is always false. Subsequent requests can leverage the work done by the first request, if results were created and cached. > > and parsed, or not, and then IF (and only IF) the data is accessed, > > the conversion and parsing must be done on the first access, and > > instead of creating and storing metainformation about the errors, > > they could just be raised. > > No, they cannot just be raised. If you just raise the error, then the > next time you try to access unparsed data, you'll hit the error > again. If you use the same handler you did before, you're in an > infloop. So you need a second handler to do things differently this > time or a flag ... but it's unclear to me that that flag can be a > boolean. So you may as well store the defect list and information > about where to restart. > From the point of view of the email package, the errors can just be raised. Then the client can make choices, and use other APIs or other parameters to the API to direct the email package to attempt a different technique to access the data. If the technique is successful, then progress is made. If unsuccessful, another error is raised by the different technique. If there are more techniques, repeat. When out of techniques, and no success, then the client needs to remember (possibly with the help of APIs of the email package) that it cannot interpret this data in a useful manner. If it then continues to attempt to access the data using failed techniques, and goes into an infinite loop, then the client has a bug. > > So the Pythonic way, AFAIU, is that errors are returned out-of-band > > via raised exceptions. > > Sure. But what you're missing is that "Neither rain, nor snow, nor > dark of night may stop the Parser on her appointed rounds." I haven't forgotten that, but clearly we haven't been communicating effectively. That may be partly my fault, partly because I'm relatively new to Python and to the email package (having only experimented with it using Python 2.6, not coded inside it, to date), but I'm trying... I'm hoping to write some email processing programs using the Python email package, and so I do have a strong interest in this topic. I'm hoping I don't have to start from scratch and write my own email package, because Python's isn't functional enough, or doesn't perform well enough. Being new to Python, I've chosen to focus on building my applications with Python 3, understanding that there are fewer fully functional pieces in that arena to date, and since email is one that has some rough edges because of the Unicode strings, I'm trying to participate where I can. > It is not > easy to write parsers, but I'll tell you one thing: it's orders of > magnitude harder to write a parser that starts in the middle and works > outward, than one that starts at the beginning and works forward to > the end. > Yes, I have learned that in my 34 years of programming. I agree. > So it's OK to write a lazy parser, but it must retain enough state so > that it can work forward until the end. Because you don't know that > the client will not request the last character of the message, you > need to be able to try to get it, no matter what happened to the first > 10GB of the message. And if an exception occurs, it must be handled > by the parser itself; if not, you put the poor thing in the position > of starting over at the beginning (that way lies the madness of > infloops), or trying to start a parse in the middle and work out. > Are you speaking about parsing the message into MIME parts, or parsing a particular MIME part contained within the message, or both? -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From v+python at g.nevcal.com Fri Oct 9 22:43:59 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Fri, 09 Oct 2009 13:43:59 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACEF66B.3000500@is.kochi-u.ac.jp> References: <4ACEABFD.6010309@g.nevcal.com> <4ACEB234.9030309@is.kochi-u.ac.jp> <4ACED8C4.5070906@g.nevcal.com> <4ACEF66B.3000500@is.kochi-u.ac.jp> Message-ID: <4ACFA08F.9080307@g.nevcal.com> On approximately 10/9/2009 1:38 AM, came the following characters from the keyboard of Tokio Kikuchi: > Glenn Linderman wrote: > >> On approximately 10/8/2009 8:47 PM, came the following characters from >> the keyboard of Tokio Kikuchi: >> >>>>> Actually, as long as the prepended text is ASCII, all that work can be >>>>> done on the encoded value. When it is not ASCII, it may still be >>>>> separated and recognizable. Still that logic is more complex than >>>>> decoding, handling as Unicode, and encoding.... when it works. Just >>>>> pointing out that there is more than one way to do things... >>>>> >>> Oh, really? >>> >>> Base64 is 3 to 4 octets encoding and there is no way to prepend padding. >>> >>> >> In header values, encoding is done using encoded-words. A header value >> consists of a sequence of ASCII words, and encoded-words. While an >> encoded word, that uses base64 encoding cannot easily be adjusted to >> prepend data into that encoded-word, additional ASCII or encoded-words >> can be prepended in front of the other ASCII or encoded words within the >> header-value. >> >> So, yes, really! >> >> > Following two lines have equivalent header contents: > > Re: [mmjp-users 123] =?iso-2022-jp?b?GyRCRnxLXDhsGyhC?= > Re: =?iso-2022-jp?b?W21tanAtdXNlcnMgMTIzXSAbJEJGfEtcOGwbKEI=?= > > I'd like to see how you can extract ascii part without touching rest of > the encoded word in the second example. > I can't, and I didn't say I could. > What we do in mailman is that both are treated equally and delete > [mmjp-users 123] from the subject and prefix again by [mmjp-users 124] > (with new sequential number). Some MUA encode subjects like the second > example and this is beyond our control. Therefore, we are forced to > decode the whole part of header content. > Yes, if the MUA has created the second encoding, decoding is required in order to replace the header prefix. If the MUA has created the first encoding, then decoding would not be required in order to replace the header prefix, but the logic to detect which case and handle them separately, results in more complexity in the application. What I said, was that prefixing a header value with additional text didn't require decoding, and that is true. What you are saying, is that you want to do more than prefix a header value with additional text. What you are saying is that you would rather choose to keep the application logic simple, by assuming or requiring that the existing header value is able to be decoded. If that is sufficient for your application, it is a reasonable choice. What do you do with messages for which the header you wish to modify cannot be decoded? Some options would be: 1) bounce the message 2) discard the message 3) determine if the header value is partially able to be decoded, and if the part that can be decoded contains the data you wish to modify, modify it, and simply preserve and pass-through the parts that could not be decoded. 4) if the header value cannot be at all decoded, or the parts that can be decoded do not contain the data you wish to modify, then you could possibly choose to simply prefix information into the header in that case, again preserving and passing through the parts that could not be decoded (or, in this case, the whole value). Perhaps you can think of other alternatives besides these, feel free to suggest some. Naturally, doing options 3 or 4 above requires more complex logic for the application than options 1 or 2. The requirements of your application should determine the types of choices you make. For example, if a new or non-standard charset appears, an application that requires the ability to decode the header, but hasn't been update to understand the charset, will fail to decode it. Yet, if it has logic like 3 or 4, it may be more successful, and would be a more robust application. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From tkikuchi at is.kochi-u.ac.jp Sat Oct 10 00:08:22 2009 From: tkikuchi at is.kochi-u.ac.jp (Tokio Kikuchi) Date: Sat, 10 Oct 2009 07:08:22 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACFA08F.9080307@g.nevcal.com> References: <4ACEABFD.6010309@g.nevcal.com> <4ACEB234.9030309@is.kochi-u.ac.jp> <4ACED8C4.5070906@g.nevcal.com> <4ACEF66B.3000500@is.kochi-u.ac.jp> <4ACFA08F.9080307@g.nevcal.com> Message-ID: <4ACFB456.6010106@is.kochi-u.ac.jp> What you said in message-id: <4ACE6F97.6010605 at g.nevcal.com> was: > When it is not ASCII, it may still be > separated and recognizable. and our discussion might be concluded that this is true 'not really, but only theoretically.' Your suggestions 1)-4) are not accesptable to Japanese users at all. I couldn't resist writing because the discussion was important in designing Mailman's subject prefixing and numbering. I'll shut up my mouth again because I'm so busy. Sorry for disturbing, -- ???? tkikuchi at is.kochi-u.ac.jp http://weather.is.kochi-u.ac.jp/ ?780-8520 ????????????? From rdmurray at bitdance.com Sat Oct 10 01:20:54 2009 From: rdmurray at bitdance.com (R. David Murray) Date: Fri, 9 Oct 2009 19:20:54 -0400 (EDT) Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACF9C6B.4020508@g.nevcal.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACCD10D.4070308@g.nevcal.com> <87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACE6CBD.2030805@g.nevcal.com> <87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACED79F.6050602@g.nevcal.com> <87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACF9C6B.4020508@g.nevcal.com> Message-ID: On Fri, 9 Oct 2009 at 13:26, Glenn Linderman wrote: > On approximately 10/9/2009 8:10 AM, came the following characters from the > keyboard of Stephen J. Turnbull: >> Glenn Linderman writes: >> > > > produce a defect report, but then simply converted to Unicode as if >> > > > it were Latin-1 (since there is no other knowledge available that >> > > > could produce a better conversion). >> > > >> > > No, that is already corruption. Most clients will assume that string >> > > is valid as a header, because it's valid as a string. >> > >> > Sure it is corruption. That's why there is a defect report. But >> > the conversion technique is appropriate, per the Postel principle. >> >> Actually, I would say you are emitting leniently, in violation of the >> Postel principle. > > You can say that, but I don't have to believe it. I'm talking about > accepting; the message has arrived, it is here, the client is trying to look > at it, and I'm talking about ways the client can look at not-quite-perfect > data, knowing that it is not quite perfect, but still being able to see it. > I'm not at all talking about emitting data. You seem to be calling the email > package helping the client to accept not-quite-perfect data, as a form of > emitting data. It is not. IMO, the appropriate way for the email package to provide the API you are talking about is it provide the client with a way to get at the raw byte string, which I think everyone agrees on. If the client wants to decode it as if it were latin-1 to process it, it can then do that. --David (RDM) From v+python at g.nevcal.com Sat Oct 10 02:54:20 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Fri, 09 Oct 2009 17:54:20 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACCD10D.4070308@g.nevcal.com> <87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACE6CBD.2030805@g.nevcal.com> <87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACED79F.6050602@g.nevcal.com> <87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACF9C6B.4020508@g.nevcal.com> Message-ID: <4ACFDB3C.5040307@g.nevcal.com> On approximately 10/9/2009 4:20 PM, came the following characters from the keyboard of R. David Murray: > On Fri, 9 Oct 2009 at 13:26, Glenn Linderman wrote: >> On approximately 10/9/2009 8:10 AM, came the following characters >> from the keyboard of Stephen J. Turnbull: >>> Glenn Linderman writes: >>> > > > produce a defect report, but then simply converted to Unicode >>> as if > > > it were Latin-1 (since there is no other knowledge >>> available that > > > could produce a better conversion). >>> > > > > No, that is already corruption. Most clients will assume >>> that string >>> > > is valid as a header, because it's valid as a string. >>> > > Sure it is corruption. That's why there is a defect report. But >>> > the conversion technique is appropriate, per the Postel principle. >>> >>> Actually, I would say you are emitting leniently, in violation of the >>> Postel principle. >> >> You can say that, but I don't have to believe it. I'm talking about >> accepting; the message has arrived, it is here, the client is trying >> to look at it, and I'm talking about ways the client can look at >> not-quite-perfect data, knowing that it is not quite perfect, but >> still being able to see it. I'm not at all talking about emitting >> data. You seem to be calling the email package helping the client to >> accept not-quite-perfect data, as a form of emitting data. It is not. > > IMO, the appropriate way for the email package to provide the API you > are talking about is it provide the client with a way to get at the raw > byte string, which I think everyone agrees on. If the client wants to > decode it as if it were latin-1 to process it, it can then do that. That certainly works, but it isn't very helpful... that forces the client application to reproduce the logic to parse the header value and decode the parts that can be decoded successfully, and that is exactly the sort of thing Stephen was complaining about when he thought I was suggesting that to be a requirement (but he was confused about what I was suggesting). -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From v+python at g.nevcal.com Sat Oct 10 03:12:38 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Fri, 09 Oct 2009 18:12:38 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACFB456.6010106@is.kochi-u.ac.jp> References: <4ACEABFD.6010309@g.nevcal.com> <4ACEB234.9030309@is.kochi-u.ac.jp> <4ACED8C4.5070906@g.nevcal.com> <4ACEF66B.3000500@is.kochi-u.ac.jp> <4ACFA08F.9080307@g.nevcal.com> <4ACFB456.6010106@is.kochi-u.ac.jp> Message-ID: <4ACFDF86.8040104@g.nevcal.com> On approximately 10/9/2009 3:08 PM, came the following characters from the keyboard of Tokio Kikuchi: > What you said in message-id: <4ACE6F97.6010605 at g.nevcal.com> was: > >> When it is not ASCII, it may still be >> separated and recognizable. >> > and our discussion might be concluded that this is true 'not really, but > only theoretically.' Your suggestions 1)-4) are not accesptable to > Japanese users at all. > There's something I don't understand here, and I'll hope you'll take a few moments to explain... If a message with an encoded header arrives (like your number 2 sample) but it cannot be decoded, what action _is_ acceptable to Japanese users? And what action is implemented in Mailman (if different)? I can think of a 5th technique... don't modify the header, and send it through unchanged. Now I think I've covered the gamut of possibilities, so if there is one I've missed, I'm extremely interested to learn (or be reminded) of it. What I meant by "may still be separated and recognizable", is, in fact, somewhat theoretical. Since I can't type Japanese, I'll just use a single accented non-ASCII character in my explanation, but here goes: Message A arrives at Mailman for distribution. No subject prefix or numbering. Since it is Mailman doing it, Mailman could notice that the prefix is like [abcd?fg 126] and must be encoded. Mailman could encode the prefix as a separate encoded word than the rest of the subject value. Let's assume that it does. The rest cannot be guaranteed, because we have no control over the MUA of the person that replies. But it might come back in the same manner... one encoded word with the prefix and then the rest of the subject line, possibly encoded, possibly not. If it does, then if the first encoded word can be decoded, and the prefix and numbering recognized, then the modification to assign a new number can be done, whether or not the remaining part of the subject line can be decoded or not. So that is what I meant, by the above. It isn't a guarantee in any manner. It could realistically happen, though, if an MUA simply adds "Re: " to the front of the stuff that it is passed (or an encoded word with an appropriate translation for "Re: "). MUAs or mailing list handlers that decode, process, and reencode, will probably not produce headers with that pattern, but more likely like the one you showed in example 2. MUAs or mailing list handlers that attempt to retain what was sent (idempotency or invertibility), would be more likely to do what I describe, and are more robust when faced with new character sets that they don't understand how to decode. > I couldn't resist writing because the discussion was important in > designing Mailman's subject prefixing and numbering. I'll shut up my > mouth again because I'm so busy. > > Sorry for disturbing Thanks for your contribution; I hope for one more, at least. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From rdmurray at bitdance.com Sat Oct 10 03:25:56 2009 From: rdmurray at bitdance.com (R. David Murray) Date: Fri, 9 Oct 2009 21:25:56 -0400 (EDT) Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACFDB3C.5040307@g.nevcal.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACCD10D.4070308@g.nevcal.com> <87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACE6CBD.2030805@g.nevcal.com> <87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACED79F.6050602@g.nevcal.com> <87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACF9C6B.4020508@g.nevcal.com> <4ACFDB3C.5040307@g.nevcal.com> Message-ID: On Fri, 9 Oct 2009 at 17:54, Glenn Linderman wrote: > On approximately 10/9/2009 4:20 PM, came the following characters from the > keyboard of R. David Murray: >> On Fri, 9 Oct 2009 at 13:26, Glenn Linderman wrote: >> > On approximately 10/9/2009 8:10 AM, came the following characters from >> > the keyboard of Stephen J. Turnbull: >> > > Glenn Linderman writes: >> > > > > > produce a defect report, but then simply converted to Unicode >> > > as if > > > it were Latin-1 (since there is no other knowledge >> > > available that > > > could produce a better conversion). >> > > > > > > No, that is already corruption. Most clients will assume >> > > that string >> > > > > is valid as a header, because it's valid as a string. >> > > > > Sure it is corruption. That's why there is a defect report. But >> > > > the conversion technique is appropriate, per the Postel principle. >> > > >> > > Actually, I would say you are emitting leniently, in violation of the >> > > Postel principle. >> > >> > You can say that, but I don't have to believe it. I'm talking about >> > accepting; the message has arrived, it is here, the client is trying to >> > look at it, and I'm talking about ways the client can look at >> > not-quite-perfect data, knowing that it is not quite perfect, but still >> > being able to see it. I'm not at all talking about emitting data. You >> > seem to be calling the email package helping the client to accept >> > not-quite-perfect data, as a form of emitting data. It is not. >> >> IMO, the appropriate way for the email package to provide the API you >> are talking about is it provide the client with a way to get at the raw >> byte string, which I think everyone agrees on. If the client wants to >> decode it as if it were latin-1 to process it, it can then do that. > > That certainly works, but it isn't very helpful... that forces the client > application to reproduce the logic to parse the header value and decode the > parts that can be decoded successfully, and that is exactly the sort of thing > Stephen was complaining about when he thought I was suggesting that to be a > requirement (but he was confused about what I was suggesting). I wasn't clear, sorry :). The current API has a 'decode_header' function, which doesn't do the byte-to-unicode decode (yeah, there's another naming problem here...we have two types of decoding and only one word for both) but instead returns (bytes, charset) tuples. This piece of the API is broken in python3, and I don't think it is the right API going forward, but that _kind_ of API is what I meant by 'getting at the raw byte string': the byte string that failed the bytes-to-unicode decoding, not the entire header (though there will also be a way to get that if you need it, I presume.) --David (RDM) From v+python at g.nevcal.com Sat Oct 10 05:46:23 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Fri, 09 Oct 2009 20:46:23 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACCD10D.4070308@g.nevcal.com> <87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACE6CBD.2030805@g.nevcal.com> <87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACED79F.6050602@g.nevcal.com> <87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACF9C6B.4020508@g.nevcal.com> <4ACFDB3C.5040307@g.nevcal.com> Message-ID: <4AD0038F.4000705@g.nevcal.com> On approximately 10/9/2009 6:25 PM, came the following characters from the keyboard of R. David Murray: > On Fri, 9 Oct 2009 at 17:54, Glenn Linderman wrote: >> On approximately 10/9/2009 4:20 PM, came the following characters >> from the keyboard of R. David Murray: >>> On Fri, 9 Oct 2009 at 13:26, Glenn Linderman wrote: >>> > On approximately 10/9/2009 8:10 AM, came the following characters >>> from > the keyboard of Stephen J. Turnbull: >>> > > Glenn Linderman writes: >>> > > > > > produce a defect report, but then simply converted to >>> Unicode > > as if > > > it were Latin-1 (since there is no other >>> knowledge > > available that > > > could produce a better >>> conversion). >>> > > > > > > No, that is already corruption. Most clients will >>> assume > > that string >>> > > > > is valid as a header, because it's valid as a string. >>> > > > > Sure it is corruption. That's why there is a defect >>> report. But >>> > > > the conversion technique is appropriate, per the Postel >>> principle. >>> > > > > Actually, I would say you are emitting leniently, in >>> violation of the >>> > > Postel principle. > > You can say that, but I don't have to >>> believe it. I'm talking about > accepting; the message has >>> arrived, it is here, the client is trying to > look at it, and I'm >>> talking about ways the client can look at > not-quite-perfect data, >>> knowing that it is not quite perfect, but still > being able to see >>> it. I'm not at all talking about emitting data. You > seem to be >>> calling the email package helping the client to accept > >>> not-quite-perfect data, as a form of emitting data. It is not. >>> >>> IMO, the appropriate way for the email package to provide the API you >>> are talking about is it provide the client with a way to get at the >>> raw >>> byte string, which I think everyone agrees on. If the client wants to >>> decode it as if it were latin-1 to process it, it can then do that. >> >> That certainly works, but it isn't very helpful... that forces the >> client application to reproduce the logic to parse the header value >> and decode the parts that can be decoded successfully, and that is >> exactly the sort of thing Stephen was complaining about when he >> thought I was suggesting that to be a requirement (but he was >> confused about what I was suggesting). > > I wasn't clear, sorry :). The current API has a 'decode_header' > function, > which doesn't do the byte-to-unicode decode (yeah, there's another naming > problem here...we have two types of decoding and only one word for both) > but instead returns (bytes, charset) tuples. This piece of the API is > broken in python3, and I don't think it is the right API going forward, > but that _kind_ of API is what I meant by 'getting at the raw byte > string': the byte string that failed the bytes-to-unicode decoding, > not the entire header (though there will also be a way to get that if > you need it, I presume.) Yeah, that'd be better. Of course, when returning Unicode strings, there would be no particular need to identify the various charsets in which the header was transmitted, except for invertibility and error handling, unless the client wanted to track that for some reason. If the goal is to preserve invertibility, then maybe tuples like (str, charset, defect) would be better.... where defect would be None for good data, but if defect were "non-ASCII", then you'd know the str was converted as if it were charset [Latin-1 in my book, but if email package had rules or the API had parameters for how to deal with non-ASCII stuff, some other charset could be specified, perhaps, but if that fails it might still have to fall back to Latin-1]; if defect were "ASCII", then you'd know that the str looked like an encoded word, but couldn't be decoded because the charset wasn't recognized, or the decoding via that charset failed, so the encoded word was supplied. Correspondingly, a header value could be set by supplying such a list, even with defect values as described above, to permit invertibility, and passing on what was obtained, so that if there are overriding local conventions (yep, such things used to be used, and maybe still are in some areas), that the data would be preserved as best as possible, and so that the email package could support creation of messages according to the local conventions. I'd hope that a separate tuple would be used for each encoded-word, or, if charset ASCII and defect None, then it would describe a run of ASCII between encoded words. Yes, an encoded word can be encoded in ASCII for rare use (if the input word looks like an encoded word), so that would cause a sequence of charset ASCII, defect None tuples, but otherwise a plain ASCII header value would have a single entry in the list of tuples. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From stephen at xemacs.org Sat Oct 10 15:59:03 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sat, 10 Oct 2009 22:59:03 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACF9C6B.4020508@g.nevcal.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACCD10D.4070308@g.nevcal.com> <87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACE6CBD.2030805@g.nevcal.com> <87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACED79F.6050602@g.nevcal.com> <87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACF9C6B.4020508@g.nevcal.com> Message-ID: <87iqenl594.fsf@uwakimon.sk.tsukuba.ac.jp> I'm running out of time to work on this (yeah, I know it's the weekend, but my life is like that lately). I think we're converging, though, so I'd like try and tie some of those ends together. Glenn Linderman writes: > On approximately 10/9/2009 8:10 AM, came the following characters from > the keyboard of Stephen J. Turnbull: > > Actually, I would say you are emitting leniently, in violation of the > > Postel principle. > > You can say that, but I don't have to believe it. I'm talking about > accepting; the message has arrived, it is here, the client is trying to > look at it, and I'm talking about ways the client can look at > not-quite-perfect data, knowing that it is not quite perfect, but still > being able to see it. I'm not at all talking about emitting data. It would be indeed, if the corrupt data is stored in the place where correctly decoded data normally is stored, and is accessible in the same way. But I gather that's not what you were talking about, my mistake. > You seem to be calling the email package helping the client to > accept not-quite-perfect data, as a form of emitting data. It is > not. No, I was confused by the way you wrote. Saving the data *somewhere* is absolutely necessary; not losing data is the #1 commandment of low-level mail processing. Surely the email module is subject to that commandment. *Nobody* is talking about losing any data yet, except Barry indirectly when he says that some people think giving up on invertibility (often called "idempotency"), and even he is quite adamant that he's not going to give up on that. So when you wrote about saving and converting to text form, without mentioning that the specific APIs, I assumed you meant the "mainline" APIs for parsing and accessing parts of a correctly formatted message. > The email package cannot police the client... if it chooses to "eat it > in a single gulp without looking at it" then it may get indigestion. I > never suggested that "converting to Unicode as if it were Latin-1" > should be done without informing the client, or being requested by the > client to do that via a special API call... Well, maybe I misread it, but it certainly looked like that to me. I would not object to that special API call defaulting to ISO 8859/1. > If you ignore defect reports, you are ignorant (blunt, but not intended > to be offensive). What I worried about is that if defect reports are present, *but displayable data is also present*, programmers *will* simply display it, for example in producing a prototype program. It will be impossible to determine without very close analysis of that program that an early version became a production version without adding appropriate checks. In practice, this bug will be discovered when some end user's installation breaks. It seems that you agree with this, and because the special API call is necessary, it will be easy to identify whether proper care is being taken or not. Right? > > > It is still raw user input, and should still be checked for proper > > > syntax by the client, > > > > Nonsense. The email module had better know a lot more about syntax > > than the client. If it doesn't, whack it with a 2x4 until it learns! > > I think we are talking at cross purposes here. I find it quite > difficult to follow where you cross the boundary between talking about > one sort of email package client, and then switch to another type, or > switch to the responsibilities of the email package. Excuse me? The "raw user input" you referred to above is material that the client software receives from the email package. The email package should give it to the client in the "normal" (convenient) way only if it can certify that it conforms to the appropriate standard. That standard should be specified in the API documentation. Any more detailed structure, of course, is the responsibility of the client. > An application which is using email as a transport, has specific goals, > which require specific content. You were mentioning clients. I've already said that when I speak of an MUA, I write "MUA". In speaking of the calling program, which might even be a user running the module via the Python interpreter, I write "client". It's a very convenient way to describe the user of an API, in contrast to the provider of the API (the implementation). > If such a client doesn't validate the syntax of that content, it > isn't much of an application. If that MUA or email application uses RFC 822 addresses, it should be able to rely on the email module to parse those addresses correctly, or provide a defect report. One might even go so far as to suggest that it be able to parse the (non-RFC, but very common) "+" notation for separating the "mailbox" from "additional data" used for VERP and challenge-response applications. That would have to be documented, but if so documented client applications like the MUA should be able to rely on it (and you can bet many will). Application domain syntax of course is not the email module's problem whether it arrives by email or Pony Express, and I'm really confused why you're going so far afield. > > No, they cannot just be raised. If you just raise the error, then the > > next time you try to access unparsed data, you'll hit the error > > again. If you use the same handler you did before, you're in an > > infloop. So you need a second handler to do things differently this > > time or a flag ... but it's unclear to me that that flag can be a > > boolean. So you may as well store the defect list and information > > about where to restart. > > From the point of view of the email package, the errors can just be > raised. Then the client can make choices, and use other APIs or other > parameters to the API to direct the email package to attempt a different > technique to access the data. The problem is that by this point some of the state of the parse may be lost. We can't say "just raise", we need to say "interrupt the parse, preserve state, and then raise". Python does absolutely nothing to help with the problem of preserving the state. We also need to determine just what state to preserve. > Yes, I have learned that in my 34 years of programming. I agree. > > > So it's OK to write a lazy parser, but it must retain enough state so > > that it can work forward until the end. [...] > > Are you speaking about parsing the message into MIME parts, or parsing a > particular MIME part contained within the message, or both? Both. I *believe* (but it needs to be checked) that in a correctly formed multipart MIME object (message or part), any internal structure is context-free within the MIME boundaries. If that is so, then individual parts of the object can be stored in raw form and parsed lazily. Similarly, for any MIME or RFC 822 object, the object can be parsed into header section and body section, and each can be stored and parsed lazily, subject to the condition that the header section must be sufficiently parsed to identify all headers that might affect parsing the body part before the body part is parsed. That "condition" is the context. From stephen at xemacs.org Sat Oct 10 17:40:50 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sun, 11 Oct 2009 00:40:50 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACFDF86.8040104@g.nevcal.com> References: <4ACEABFD.6010309@g.nevcal.com> <4ACEB234.9030309@is.kochi-u.ac.jp> <4ACED8C4.5070906@g.nevcal.com> <4ACEF66B.3000500@is.kochi-u.ac.jp> <4ACFA08F.9080307@g.nevcal.com> <4ACFB456.6010106@is.kochi-u.ac.jp> <4ACFDF86.8040104@g.nevcal.com> Message-ID: <87fx9rl0jh.fsf@uwakimon.sk.tsukuba.ac.jp> Glenn Linderman writes: > On approximately 10/9/2009 3:08 PM, came the following characters from > the keyboard of Tokio Kikuchi: > > Your suggestions 1)-4) are not accesptable to Japanese users at > > all. > If a message with an encoded header arrives (like your number 2 sample) > but it cannot be decoded, what action _is_ acceptable to Japanese > users? And what action is implemented in Mailman (if different)? I know a fair bit about Japanese (both the language and the users), and I'm having difficulty understanding what Tokio means, given your list of hypotheses. I suspect he's basically rejecting the hypothesis that it can't be decoded -- if it can't be decoded, then learn how to do so! > I can think of a 5th technique... don't modify the header, and send > it through unchanged. Now I think I've covered the gamut of > possibilities, I agree. However, I think we're way out of bounds here. We already know how to decode anything that RFC 2047 can throw at us in charsets that Python can handle. Anything that can't be decoded then is seriously malformed from the point of view of the mailing list users. So why are we discussing this? We don't even know what our mainline APIs are going to look like, why are we discussing forcibly operating on broken input? [[ Aside: > with an appropriate translation for "Re: "). "Re" is a Latin abbreviation; there is no appropriate translation. ;-) ]] > MUAs or mailing list handlers that attempt to retain what was sent > (idempotency or invertibility), would be more likely to do what I > describe, and are more robust when faced with new character sets > that they don't understand how to decode. Maybe they are, but the email module doesn't know or care about what they do. Let's stick within what the email module is supposed to handle. From v+python at g.nevcal.com Sat Oct 10 22:01:46 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Sat, 10 Oct 2009 13:01:46 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: <87iqenl594.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACCD10D.4070308@g.nevcal.com> <87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACE6CBD.2030805@g.nevcal.com> <87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACED79F.6050602@g.nevcal.com> <87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACF9C6B.4020508@g.nevcal.com> <87iqenl594.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4AD0E82A.5000603@g.nevcal.com> On approximately 10/10/2009 6:59 AM, came the following characters from the keyboard of Stephen J. Turnbull: > I'm running out of time to work on this (yeah, I know it's the > weekend, but my life is like that lately). I think we're converging, > though, so I'd like try and tie some of those ends together. > I think we are converging too... mostly terminology issues, and assumptions were causing a bit of misunderstandings. > Glenn Linderman writes: > > On approximately 10/9/2009 8:10 AM, came the following characters from > > the keyboard of Stephen J. Turnbull: > > > > Actually, I would say you are emitting leniently, in violation of the > > > Postel principle. > > > > You can say that, but I don't have to believe it. I'm talking about > > accepting; the message has arrived, it is here, the client is trying to > > look at it, and I'm talking about ways the client can look at > > not-quite-perfect data, knowing that it is not quite perfect, but still > > being able to see it. I'm not at all talking about emitting data. > > It would be indeed, if the corrupt data is stored in the place where > correctly decoded data normally is stored, and is accessible in the > same way. But I gather that's not what you were talking about, my > mistake. > Well, the client tells us where to store it, and we can't prevent it from being the same place. But accessible in the same way? Not. Some extra parameter or different API, would surely be required to get not-quite-perfect data. > > You seem to be calling the email package helping the client to > > accept not-quite-perfect data, as a form of emitting data. It is > > not. > > No, I was confused by the way you wrote. Saving the data *somewhere* > is absolutely necessary; not losing data is the #1 commandment of > low-level mail processing. Surely the email module is subject to that > commandment. *Nobody* is talking about losing any data yet, except > Barry indirectly when he says that some people think giving up on > invertibility (often called "idempotency"), and even he is quite > adamant that he's not going to give up on that. > > So when you wrote about saving and converting to text form, without > mentioning that the specific APIs, I assumed you meant the "mainline" > APIs for parsing and accessing parts of a correctly formatted message. > Mostly, I hadn't bothered about APIs yet; I'm not yet very familiar with the existing ones, because neither nPOPuk nor SeaMonkey nor Thunderbird, the only email programs that I have looked at source code for, use the Python email package! So while I'm reasonably familiar with the RFCs, and quite familiar with nPOPuk source, and have looked at a small fraction of the SeaMonkey/Thunderbird source code (and been amazed at how big it is), and have examined email from a large variety of sources comparing it to the RFCs to see where it goes wrong and why it doesn't display in SeaMonkey/Thunderbird the same way as in Outlook/Outlook Express (or other programs), and have found Outlook 2000 and Apple Mail to be quite creative in interpreting the RFCs, I'm new to the Python email package. > > The email package cannot police the client... if it chooses to "eat it > > in a single gulp without looking at it" then it may get indigestion. I > > never suggested that "converting to Unicode as if it were Latin-1" > > should be done without informing the client, or being requested by the > > client to do that via a special API call... > > Well, maybe I misread it, but it certainly looked like that to me. I > would not object to that special API call defaulting to ISO 8859/1. > > > If you ignore defect reports, you are ignorant (blunt, but not intended > > to be offensive). > > What I worried about is that if defect reports are present, *but > displayable data is also present*, programmers *will* simply display > it, for example in producing a prototype program. It will be > impossible to determine without very close analysis of that program > that an early version became a production version without adding > appropriate checks. In practice, this bug will be discovered when > some end user's installation breaks. > > It seems that you agree with this, and because the special API call is > necessary, it will be easy to identify whether proper care is being > taken or not. Right? > Well, yes and no. I think that the email package should require that some special action needs to be taken by the client to request not-quite-perfect data, either a special parameter value, or different API, etc. But there is nothing that says that some client might not pass that all the time, and ignore the defect reports. Whether that is easy to identify or not, and whether the email package wants to require that the normal APIs be tried before the not-quite-perfect APIs are issues for discussion. Ultimately, the email package cannot enforce that proper case is taken by the client; only code reviews of the client can encourage that. > > > > It is still raw user input, and should still be checked for proper > > > > syntax by the client, > > > > > > Nonsense. The email module had better know a lot more about syntax > > > than the client. If it doesn't, whack it with a 2x4 until it learns! > > > > I think we are talking at cross purposes here. I find it quite > > difficult to follow where you cross the boundary between talking about > > one sort of email package client, and then switch to another type, or > > switch to the responsibilities of the email package. > > Excuse me? The "raw user input" you referred to above is material > that the client software receives from the email package. The email > package should give it to the client in the "normal" (convenient) way > only if it can certify that it conforms to the appropriate standard. > Yes, agreed. And a special way or ways to get various algorithms for attempting to interpret not-quite-perfect data, when the client thinks that might be useful. Then the client has "tweaked" user input. > That standard should be specified in the API documentation. Any more > detailed structure, of course, is the responsibility of the client. > Right. And it is the more detailed structure that I was referring to... Even if the structure of the email is incorrect, if the client can find its input among the various attempts to obtain data from the not-quite-perfect email message, and can validate and check its input, it may choose to process it even if the email message is imperfect... it should probably note somewhere that the email message from which the data was obtained was not perfect, but really, that is up to the client to figure out, based on its requirements. > > An application which is using email as a transport, has specific goals, > > which require specific content. You were mentioning clients. > > I've already said that when I speak of an MUA, I write "MUA". In > speaking of the calling program, which might even be a user running > the module via the Python interpreter, I write "client". It's a very > convenient way to describe the user of an API, in contrast to the > provider of the API (the implementation). > Yep, so I think my "application" and your "client" are the same thing. I'm trying to use your term as I continue responding in these threads, it is reasonable. > > If such a client doesn't validate the syntax of that content, it > > isn't much of an application. > > If that MUA or email application uses RFC 822 addresses, it should be > able to rely on the email module to parse those addresses correctly, > or provide a defect report. One might even go so far as to suggest > that it be able to parse the (non-RFC, but very common) "+" notation > for separating the "mailbox" from "additional data" used for VERP and > challenge-response applications. That would have to be documented, > but if so documented client applications like the MUA should be able > to rely on it (and you can bet many will). > Hmim. This is an interesting digression... "+", according to the RFCs, is just another of the legal characters that can be found before the @ in an unquoted email address... the list is !#$%&'*+-/=?^_`{}|~ in addition to the alphanumerics. How a particular email server interprets the "stuff before the @" is pretty much up to it... so as long as it does something appropriate, it can interpret all or a fraction of it as a mailbox name, or could it intuit a mailbox name from the body content if it wants, or even from a special header. So yeah, particular interpretations of the address is non-RFC stuff. > Application domain syntax of course is not the email module's problem > whether it arrives by email or Pony Express, and I'm really confused > why you're going so far afield. > Just to point out that good data can be obtained from bad email messages, I think, and that that is a use case. > > > No, they cannot just be raised. If you just raise the error, then the > > > next time you try to access unparsed data, you'll hit the error > > > again. If you use the same handler you did before, you're in an > > > infloop. So you need a second handler to do things differently this > > > time or a flag ... but it's unclear to me that that flag can be a > > > boolean. So you may as well store the defect list and information > > > about where to restart. > > > > From the point of view of the email package, the errors can just be > > raised. Then the client can make choices, and use other APIs or other > > parameters to the API to direct the email package to attempt a different > > technique to access the data. > > The problem is that by this point some of the state of the parse may > be lost. We can't say "just raise", we need to say "interrupt the > parse, preserve state, and then raise". Python does absolutely > nothing to help with the problem of preserving the state. We also > need to determine just what state to preserve. > > > Yes, I have learned that in my 34 years of programming. I agree. > > > > > So it's OK to write a lazy parser, but it must retain enough state so > > > that it can work forward until the end. [...] > > > > Are you speaking about parsing the message into MIME parts, or parsing a > > particular MIME part contained within the message, or both? > > Both. I *believe* (but it needs to be checked) that in a correctly > formed multipart MIME object (message or part), any internal structure > is context-free within the MIME boundaries. If that is so, then > individual parts of the object can be stored in raw form and parsed > lazily. > > Similarly, for any MIME or RFC 822 object, the object can be parsed > into header section and body section, and each can be stored and > parsed lazily, subject to the condition that the header section must > be sufficiently parsed to identify all headers that might affect > parsing the body part before the body part is parsed. That > "condition" is the context. > Neither of these context conditions apply to correctly formed MIME trees, but are the only context I'm aware of that can affect parsing of MIME parts, AFAIK (and I just reread most of the MIME RFCs in the last few days). The only context for parsing MIME parts that I'm aware of is that when determining the end of a nested MIME part, that the search for ending delimiter must include searching for any higher-level delimiter as well... to handle the case where the inner delimiter got lost. So one should search for CR LF --, and then examine the stuff after the -- to match first the innermost delimiter, and then the next outermost, etc., and if finding a match, considering that it is the end of all the parts nested within the delimiter found, the inner ones being considered truncated, since their own delimiter was not found. Unexpected end-of-data should also mark all unterminated nested MIME parts as incomplete, of course. The only other cross-part context that I am aware of is Content-ID references. That doesn't affect parsing, but rather semantic interpretation, after parsing, validation, and decoding is complete. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From v+python at g.nevcal.com Sat Oct 10 22:20:02 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Sat, 10 Oct 2009 13:20:02 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: <87fx9rl0jh.fsf@uwakimon.sk.tsukuba.ac.jp> References: <4ACEABFD.6010309@g.nevcal.com> <4ACEB234.9030309@is.kochi-u.ac.jp> <4ACED8C4.5070906@g.nevcal.com> <4ACEF66B.3000500@is.kochi-u.ac.jp> <4ACFA08F.9080307@g.nevcal.com> <4ACFB456.6010106@is.kochi-u.ac.jp> <4ACFDF86.8040104@g.nevcal.com> <87fx9rl0jh.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4AD0EC72.6040704@g.nevcal.com> On approximately 10/10/2009 8:40 AM, came the following characters from the keyboard of Stephen J. Turnbull: > Glenn Linderman writes: > > On approximately 10/9/2009 3:08 PM, came the following characters from > > the keyboard of Tokio Kikuchi: > > > > Your suggestions 1)-4) are not accesptable to Japanese users at > > > all. > > > If a message with an encoded header arrives (like your number 2 sample) > > but it cannot be decoded, what action _is_ acceptable to Japanese > > users? And what action is implemented in Mailman (if different)? > > I know a fair bit about Japanese (both the language and the users), > and I'm having difficulty understanding what Tokio means, given your > list of hypotheses. I suspect he's basically rejecting the hypothesis > that it can't be decoded -- if it can't be decoded, then learn how to > do so! > > > I can think of a 5th technique... don't modify the header, and send > > it through unchanged. Now I think I've covered the gamut of > > possibilities, > > I agree. However, I think we're way out of bounds here. We already > know how to decode anything that RFC 2047 can throw at us in charsets > that Python can handle. Anything that can't be decoded then is > seriously malformed from the point of view of the mailing list users. > So why are we discussing this? We don't even know what our mainline > APIs are going to look like, why are we discussing forcibly operating > on broken input? > Use case generation. If the only way to access header values is to successfully, fully, decode them, then some uses may be rendered impossible, or at least difficult, even by choice of APIs. > [[ Aside: > > > with an appropriate translation for "Re: "). > > "Re" is a Latin abbreviation; there is no appropriate translation. ;-) > Nonetheless, I have seen both Re: and Fwd: translated to other languages (besides Latin or geek) :) Communication to people with MUAs that do such translations tend to accumulate an alternating Re: XRe: Re: XRe: Re: subject line because neither MUA will recognize the other translation. > ]] > > > MUAs or mailing list handlers that attempt to retain what was sent > > (idempotency or invertibility), would be more likely to do what I > > describe, and are more robust when faced with new character sets > > that they don't understand how to decode. > > Maybe they are, but the email module doesn't know or care about what > they do. Let's stick within what the email module is supposed to > handle Yep, this is just use case exploration. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From rdmurray at bitdance.com Sat Oct 10 23:20:59 2009 From: rdmurray at bitdance.com (R. David Murray) Date: Sat, 10 Oct 2009 17:20:59 -0400 (EDT) Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACF880D.5080305@g.nevcal.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org> <4ACD94E5.5020808@g.nevcal.com> <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org> <4ACE6A1B.7060702@g.nevcal.com> <3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org> <4ACF880D.5080305@g.nevcal.com> Message-ID: On Fri, 9 Oct 2009 at 11:59, Glenn Linderman wrote: > On approximately 10/9/2009 5:05 AM, came the following characters from the > keyboard of Barry Warsaw: >> On Oct 8, 2009, at 6:39 PM, Glenn Linderman wrote: >> > 1) wire format. Either what came in, in the parser case, or what would >> > be generated. >> > 2) internal headers from the MIME part >> > 3) decoded BLOB. This means that quopri and base64 are decoded, no more >> > and no less. This is bytes. No headers, only payload. For >> > Content-Transfer-Encoding: binary, this is mostly a noop. >> > 4) text/* parts should also be obtainable as str()/unicode(), payload >> > only. This is where charset decoding is done. >> > >> > I think your talk in the next paragraph about hooks and other object >> > types being produced is a generalization of 4, not 3, and generally no >> > additional decoding needs to be done, just conversion to the right >> > object type (or file, or file-like object). >> I mostly agree with that. I've always called #4 the "decoded payload" and >> #3 I've usually called the "raw payload". Maybe we can bikeshed on better >> terms to help inform us about the API's method/attribute names. > > It would be good though to have standardized terms for easier communication. > Maybe as they are chosen, they could be added to that Wiki RDM set up? I didn't set it up, Barry did. I just started adding stuff ;) > My only problem with "raw" and "decoded" payload, is that there are 3 payload > formats, not 2, so there needs to be a 3rd term, corresponding to #1, #3, and > #4, above. #2 is somewhat orthogonal from the payload. > > To me, "raw" conjures up #1, not #3. I think I understand why Barry uses it for #3: it's the 'raw data' that went in to get transfer-encoded in the first place. But clearly the term is ambiguous. I have set up two more documents on the wiki. One is UseCases[1], and I've tried to copy into it all of the use cases that have been mentioned in this discussion, plus a few more. Edits welcome. The other is a Glossary[2]. I think most of it accurately reflects the consensus here, but in it I'm proposing to use the term 'transfer-decoded' for #3, and 'transfer-encoded' as an alternative to 'wire-format' just for symmetry. Comments and suggestions welcome. Any other terms of art we should record? --David [1] http://wiki.python.org/moin/Email%20SIG/UseCases [2] http://wiki.python.org/moin/Email%20SIG/Glossary From v+python at g.nevcal.com Sun Oct 11 00:58:38 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Sat, 10 Oct 2009 15:58:38 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org> <4ACD94E5.5020808@g.nevcal.com> <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org> <4ACE6A1B.7060702@g.nevcal.com> <3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org> <4ACF880D.5080305@g.nevcal.com> Message-ID: <4AD1119E.60409@g.nevcal.com> On approximately 10/10/2009 2:20 PM, came the following characters from the keyboard of R. David Murray: > On Fri, 9 Oct 2009 at 11:59, Glenn Linderman wrote: >> On approximately 10/9/2009 5:05 AM, came the following characters >> from the keyboard of Barry Warsaw: >>> On Oct 8, 2009, at 6:39 PM, Glenn Linderman wrote: >>> > 1) wire format. Either what came in, in the parser case, or what >>> would > be generated. >>> > 2) internal headers from the MIME part >>> > 3) decoded BLOB. This means that quopri and base64 are decoded, >>> no more > and no less. This is bytes. No headers, only payload. >>> For > Content-Transfer-Encoding: binary, this is mostly a noop. >>> > 4) text/* parts should also be obtainable as str()/unicode(), >>> payload > only. This is where charset decoding is done. >>> > > I think your talk in the next paragraph about hooks and other >>> object > types being produced is a generalization of 4, not 3, and >>> generally no > additional decoding needs to be done, just >>> conversion to the right > object type (or file, or file-like object). >>> I mostly agree with that. I've always called #4 the "decoded >>> payload" and >>> #3 I've usually called the "raw payload". Maybe we can bikeshed on >>> better >>> terms to help inform us about the API's method/attribute names. >> >> It would be good though to have standardized terms for easier >> communication. Maybe as they are chosen, they could be added to that >> Wiki RDM set up? > > I didn't set it up, Barry did. I just started adding stuff ;) OK. I seem to have an account there, so made some edits. >> My only problem with "raw" and "decoded" payload, is that there are 3 >> payload formats, not 2, so there needs to be a 3rd term, >> corresponding to #1, #3, and #4, above. #2 is somewhat orthogonal >> from the payload. >> >> To me, "raw" conjures up #1, not #3. > > I think I understand why Barry uses it for #3: it's the 'raw data' that > went in to get transfer-encoded in the first place. But clearly the > term is ambiguous. I found it so. > I have set up two more documents on the wiki. One is UseCases[1], and > I've > tried to copy into it all of the use cases that have been mentioned in > this discussion, plus a few more. Edits welcome. I hadn't seen UTF-16/-32/-BE/-LE mentioned in this discussion, but the MIME RFCs do mention use cases that require them, so I added it to RFC822 handling, but it might be better in HTTP handling? Or maybe elsewhere? > The other is a Glossary[2]. I think most of it accurately reflects the > consensus here, but in it I'm proposing to use the term > 'transfer-decoded' > for #3, and 'transfer-encoded' as an alternative to 'wire-format' just > for symmetry. Comments and suggestions welcome. I like the distinction you made that 'wire format' is "in the wild", not known to be RFC compliant, and 'transfer-encoded' be the generated type, and compliant. I would think that if we get data as far as 'transfer-decoded', that we've (mostly) proven that the received 'wire format' is compliant, or can be made compliant. (I switched conformant to compliant, not finding the former at dictionary.com, and not liking conformable which I found there, as it seems to imply able to be changed to conform, in my head, although not in the definition). > Any other terms of art we should record? > > --David > > [1] http://wiki.python.org/moin/Email%20SIG/UseCases > [2] http://wiki.python.org/moin/Email%20SIG/Glossary > -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From stephen at xemacs.org Sun Oct 11 02:47:39 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sun, 11 Oct 2009 09:47:39 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4AD0EC72.6040704@g.nevcal.com> References: <4ACEABFD.6010309@g.nevcal.com> <4ACEB234.9030309@is.kochi-u.ac.jp> <4ACED8C4.5070906@g.nevcal.com> <4ACEF66B.3000500@is.kochi-u.ac.jp> <4ACFA08F.9080307@g.nevcal.com> <4ACFB456.6010106@is.kochi-u.ac.jp> <4ACFDF86.8040104@g.nevcal.com> <87fx9rl0jh.fsf@uwakimon.sk.tsukuba.ac.jp> <4AD0EC72.6040704@g.nevcal.com> Message-ID: <878wfilpsk.fsf@uwakimon.sk.tsukuba.ac.jp> Glenn Linderman writes: > On approximately 10/10/2009 8:40 AM, came the following characters from > the keyboard of Stephen J. Turnbull: > > So why are we discussing this? We don't even know what our mainline > > APIs are going to look like, why are we discussing forcibly operating > > on broken input? > > Use case generation. If the only way to access header values is to > successfully, fully, decode them, then some uses may be rendered > impossible, or at least difficult, even by choice of APIs. Since invertibility is a requirement, "successfully fully decoding" a header field is not a prerequisite to accessing it. The question of "what should we do about broken mail" at this point has three components: (1) To what level do we (ie, the email module) promise to parse conforming wire format into useful objects? (2) For nonconforming input, when is it OK to raise an error and return to the calling client rather than handle it ourselves? (3) What is the API for accessing and/or mutating unparsed data, and requesting a reparse? I don't think we should go any farther than that. > > "Re" is a Latin abbreviation; there is no appropriate translation. ;-) > > > > Nonetheless, I have seen both Re: and Fwd: translated to other languages > (besides Latin or geek) :) Sure. This is an aspect of question (1): is this the responsibility of the email module? > > Maybe they are, but the email module doesn't know or care about what > > they do. Let's stick within what the email module is supposed to > > handle > > Yep, this is just use case exploration. But since by definition this is broken input, discussing what applications are going to want to do with it is inappropriate, IMO. We don't care if the app is going to prefix, suffix, or crucifix it. We need to specify (a) what object will hold the raw data we couldn't handle (b) how a calling client can retrieve the raw data (c) how the client can replace (or more generally mutate) that data (d) how the client can request a reparse from us if it attempted to repair the breakage at a low level rather than parse it Manipulations of text or bytes are in principle not the responsibility of the email module IMO; that will be done *by* the client *using* raw Python, not methods provided by email. I don't see how discussion of *what* manipulations can be done with one hand up our nose is anything but useless bikeshedding. If we decide that the email module can usefully provide sufficiently general facilities that would be convenient and hard to implement by general client programmers (eg, the Mailman Developers collective wisdom about foreign equivalents for "re" and "fwd" is surely greater than that of the average American programmer), we will do it by calling low-level methods to get and put the data, and raw Python to manipulate it as text or bytes. From stephen at xemacs.org Sun Oct 11 05:23:36 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sun, 11 Oct 2009 12:23:36 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4AD0E82A.5000603@g.nevcal.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACCD10D.4070308@g.nevcal.com> <87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACE6CBD.2030805@g.nevcal.com> <87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACED79F.6050602@g.nevcal.com> <87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACF9C6B.4020508@g.nevcal.com> <87iqenl594.fsf@uwakimon.sk.tsukuba.ac.jp> <4AD0E82A.5000603@g.nevcal.com> Message-ID: <877hv2likn.fsf@uwakimon.sk.tsukuba.ac.jp> Glenn Linderman writes: > On approximately 10/10/2009 6:59 AM, came the following characters from > the keyboard of Stephen J. Turnbull: > > Glenn Linderman writes: > > > On approximately 10/9/2009 8:10 AM, came the following characters from > > > the keyboard of Stephen J. Turnbull: > > correctly decoded data normally is stored, and is accessible in the > > same way. But I gather that's not what you were talking about, my > > mistake. > > Well, the client tells us where to store it, and we can't prevent it > from being the same place. Huh? No way! We decide where our data is stored. This isn't C where you pass around arbitrary pointers for efficiency. In particular, strings (whether Unicode or bytes) are not mutable. So the client can keep a copy if it likes, but once it hands us raw message text as bytes, after that we decide where we put parsed pieces and/or slices of the unparsed original. > > So when you wrote about saving and converting to text form, without > > mentioning that the specific APIs, I assumed you meant the "mainline" > > APIs for parsing and accessing parts of a correctly formatted message. > > Mostly, I hadn't bothered about APIs yet; You may not bother about APIs, but it sure looks like you do to me. You can't talk about where to store stuff without touching the API. > I think that the email package should require that some special action > needs to be taken by the client to request not-quite-perfect data, > either a special parameter value, or different API, etc. That's all I need to hear, until we're ready to write specs for that API. (Note that a special parameter value is part of the API in a sense, if we specify and document what it means, so I tend to use API for that, too, not just for whole functions.) > But there is nothing that says that some client might not pass that > all the time, and ignore the defect reports. Whether that is easy > to identify or not, and whether the email package wants to require > that the normal APIs be tried before the not-quite-perfect APIs are > issues for discussion. The answers are obvious to me: yes and no. You can identify whether a particular API has been used with standard text search tools like M-x occur. (For non-Emacsers, that is an Emacs command that finds all occurances of a particular string in the buffer.) If a program wants to call the quick & dirty APIs first, that's none of our business, except that if parsing is being done lazily we should be careful to update the defect list, so that the program can check them when it wants to. > Ultimately, the email package cannot enforce that proper case is taken > by the client; only code reviews of the client can encourage that. My point is not to enforce anything, not even code reviews. But by having separate APIs for parsed and unparsed data, code review can be made easier and more accurate. > Yes, agreed. And a special way or ways to get various algorithms for > attempting to interpret not-quite-perfect data, when the client thinks > that might be useful. I don't think we should be talking about special ways (plural) or "not-quite-perfect" data. At this point in the design process, we have *parsed* and *unparsed* data. Heuristic algorithms for recovering from unparsable input can be layered on top of these two sets of APIs, when we have *real* use cases for them. For example, I don't think your use case of prepending a mailing list's topic or serial number to an unparseable subject is realistic; in all lists I know of such a message would be held for moderation, or even discarded outright as spam. And again: > Right. And it is the more detailed structure that I was referring to... But why? There is no need to discuss it at this point, and bringing it up is confusing as all get-out. > How a particular email server interprets the "stuff before the @" is > pretty much up to it... so as long as it does something appropriate, it > can interpret all or a fraction of it as a mailbox name, or could it > intuit a mailbox name from the body content if it wants, or even from a > special header. So yeah, particular interpretations of the address is > non-RFC stuff. Right. To riff on the RFC vs. not theme ["Barry, pick up the bass line, need more bottom here!"], I think we should pick a list of RFCs we "promise" to implement as "defining" email; if we reserve any structures as "too obscure for us to parse," we should say so (and reference chapter and verse of the Holy RFC). On the other hand, of course as we discover common use cases for which precise specifications can be given, we should be flexible and implement them. But there should be no rush. Which RFCs? First of all, the STD 11 series (RFCs 733, 822, 2822, 5322). Here we have to worry about the standard's recommended format vs. the obsolete format because of the Postel principle. AFAIK, there is no reason not to insist on *producing* strictly RFC 5322 conformant messages, but I think we should implement both strict and lax parsers. The lax parser is for "daily use", the strict parser for validation. Second, the basic MIME structure RFCs: 2045-2049, 2231. (Some of these have been at least partially superseded by now, I think.) The mailing list header RFCs: 2369 and 2919. Not RFCs, per se, but an auxiliary module should provide the registered IANA data for the above RFCs. Strictly speaking outside of the email module, but we make use of URLs (RFC 3986 -- superseded?) and mimetypes data (this overlaps substantially with the "registered IANA data". We need to coordinate with the responsible maintainers for those. Ditto coordinating with modules that we share a lot of structure with, the "not email but very similar" like HTTP (RFC 2616), and netnews (NNTP = 3397 and RFC 1036). Which extensions? Er, don't you think the above is enough for now? > Just to point out that good data can be obtained from bad email > messages, I think, and that that is a use case. But we already know that, and the basic idea of how to treat bad data (send it to a locked room without any supper). No need to rehash that, AFAICS from your use case. > The only context for parsing MIME parts that I'm aware of is that when > determining the end of a nested MIME part, Indeed, but this is Postel principle stuff, not about parsing correct syntax. First we need to decide what to do with correct syntax, then come up with belt and suspenders algorithms for broken mail. > The only other cross-part context that I am aware of is Content-ID > references. That doesn't affect parsing, but rather semantic > interpretation, after parsing, validation, and decoding is complete. I wasn't thinking of those, but that's a good point. Those will need to be kept in a mapping at a higher level of the representation, probably top-level, I guess. From turnbull at sk.tsukuba.ac.jp Sun Oct 11 05:52:54 2009 From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull) Date: Sun, 11 Oct 2009 12:52:54 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org> <4ACD94E5.5020808@g.nevcal.com> <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org> <4ACE6A1B.7060702@g.nevcal.com> <3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org> <4ACF880D.5080305@g.nevcal.com> Message-ID: <8763amlh7t.fsf@uwakimon.sk.tsukuba.ac.jp> R. David Murray writes: > I have set up two more documents on the wiki. One is UseCases[1], [...]. > The other is a Glossary[2]. Thank you, very much! > I think most of it accurately reflects the consensus here, but in > it I'm proposing to use the term 'transfer-decoded' for #3, and > 'transfer-encoded' as an alternative to 'wire-format' just for > symmetry. Comments and suggestions welcome. 'Wire-format' means "you can cat it to the wire", ie, RFC-conforming (in fact, it's the only meaning in the RFCs by definition), and for email itself it's always bytes AFAIK (Mama don' 'low no XML roun' here, Lord, Lord!). That's not true of all our applications, though, especially stuff like doctests. There are also some RFCs we use such as BASE64 (specifically relevant to transfer encodings) that are defined in terms of characters, not bytes, so 'transfer-encoded' is slightly different from 'wire-format'. I think in general that kind of comment should be applied directly to the Glossary, but what deserves general discussion is "how pedantic do we want to be? I think the distinction made here between 'wire-format' and 'transfer-encoded' is useful *to us*, and in general lean toward "high pedantry" (cf how much smoke and how little fire Glenn and I are generating!) WDOT? From stephen at xemacs.org Sun Oct 11 06:01:50 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sun, 11 Oct 2009 13:01:50 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4AD1119E.60409@g.nevcal.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org> <4ACD94E5.5020808@g.nevcal.com> <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org> <4ACE6A1B.7060702@g.nevcal.com> <3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org> <4ACF880D.5080305@g.nevcal.com> <4AD1119E.60409@g.nevcal.com> Message-ID: <874oq6lgsx.fsf@uwakimon.sk.tsukuba.ac.jp> Glenn Linderman writes: > (I switched conformant to compliant, Conformant is in common use. You might be more comfortable with conforming. Richard Stallman points out that you comply with the law, but you conform to a standard. I think it's useful to make that semantic distinction, cf. RFC 2119 MUST vs. SHOULD or MAY. From v+python at g.nevcal.com Sun Oct 11 06:37:48 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Sat, 10 Oct 2009 21:37:48 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: <874oq6lgsx.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org> <4ACD94E5.5020808@g.nevcal.com> <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org> <4ACE6A1B.7060702@g.nevcal.com> <3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org> <4ACF880D.5080305@g.nevcal.com> <4AD1119E.60409@g.nevcal.com> <874oq6lgsx.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4AD1611C.6030406@g.nevcal.com> On approximately 10/10/2009 9:01 PM, came the following characters from the keyboard of Stephen J. Turnbull: > Glenn Linderman writes: > > (I switched conformant to compliant, > > Conformant is in common use. You might be more comfortable with > conforming. > > Richard Stallman points out that you comply with the law, but you > conform to a standard. I think it's useful to make that semantic > distinction, cf. RFC 2119 MUST vs. SHOULD or MAY. > conformant is not in the dictionaries I've consulted. Conforming is mostly a verb, not an adjective. Richard Stallman is a great programmer, but conformable and compliant are synonyms. I don't like the word conformable, but if you appreciate his distinction, then we should use the word conformable even though I don't like it. But we shouldn't use the letter sequence conformant, because although I know what you mean by it, it appears not to be a word, and English is hard enough for ESL folks when they can find the words in the dictionary. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From rdmurray at bitdance.com Sun Oct 11 07:12:27 2009 From: rdmurray at bitdance.com (R. David Murray) Date: Sun, 11 Oct 2009 01:12:27 -0400 (EDT) Subject: [Email-SIG] fixing the current email module In-Reply-To: <874oq6lgsx.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org> <4ACD94E5.5020808@g.nevcal.com> <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org> <4ACE6A1B.7060702@g.nevcal.com> <3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org> <4ACF880D.5080305@g.nevcal.com> <4AD1119E.60409@g.nevcal.com> <874oq6lgsx.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On Sun, 11 Oct 2009 at 13:01, Stephen J. Turnbull wrote: > Glenn Linderman writes: > > (I switched conformant to compliant, > > Conformant is in common use. You might be more comfortable with > conforming. > > Richard Stallman points out that you comply with the law, but you > conform to a standard. I think it's useful to make that semantic > distinction, cf. RFC 2119 MUST vs. SHOULD or MAY. Indeed. My regular dictionary doesn't have it, but WordWeb does: http://www.wordwebonline.com/en/CONFORMANT Seems to be a 'term of art' in computing rather than a regular English word, and the most appropriate word in the context in which I used it. But perhaps it should be added to the Glossary itself :) --David From v+python at g.nevcal.com Sun Oct 11 07:15:49 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Sat, 10 Oct 2009 22:15:49 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: <877hv2likn.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACCD10D.4070308@g.nevcal.com> <87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACE6CBD.2030805@g.nevcal.com> <87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACED79F.6050602@g.nevcal.com> <87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACF9C6B.4020508@g.nevcal.com> <87iqenl594.fsf@uwakimon.sk.tsukuba.ac.jp> <4AD0E82A.5000603@g.nevcal.com> <877hv2likn.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4AD16A05.8020302@g.nevcal.com> On approximately 10/10/2009 8:23 PM, came the following characters from the keyboard of Stephen J. Turnbull: > Glenn Linderman writes: > > On approximately 10/10/2009 6:59 AM, came the following characters from > > the keyboard of Stephen J. Turnbull: > > > Glenn Linderman writes: > > > > On approximately 10/9/2009 8:10 AM, came the following characters from > > > > the keyboard of Stephen J. Turnbull: > > > > correctly decoded data normally is stored, and is accessible in the > > > same way. But I gather that's not what you were talking about, my > > > mistake. > > > > Well, the client tells us where to store it, and we can't prevent it > > from being the same place. > > Huh? No way! We decide where our data is stored. This isn't C where > you pass around arbitrary pointers for efficiency. In particular, > strings (whether Unicode or bytes) are not mutable. So the client can > keep a copy if it likes, but once it hands us raw message text as > bytes, after that we decide where we put parsed pieces and/or slices > of the unparsed original. > Yes, email package can figure out where to store its copy, client figures out where to store its copy. We're getting better at communicating, but not 100% there yet :) I was thinking of the case where the client asks the email package for data, and stores it in its variable; you seem to be thinking of the case where the client gives the email package data. > > > So when you wrote about saving and converting to text form, without > > > mentioning that the specific APIs, I assumed you meant the "mainline" > > > APIs for parsing and accessing parts of a correctly formatted message. > > > > Mostly, I hadn't bothered about APIs yet; > > You may not bother about APIs, but it sure looks like you do to me. > You can't talk about where to store stuff without touching the API. > Well, I'm sure there will be APIs; the names and parameters is what I haven't bothered about yet, much, except if the discussion seemed to require such. > > I think that the email package should require that some special action > > needs to be taken by the client to request not-quite-perfect data, > > either a special parameter value, or different API, etc. > > That's all I need to hear, until we're ready to write specs for that > API. (Note that a special parameter value is part of the API in a > sense, if we specify and document what it means, so I tend to use API > for that, too, not just for whole functions.) > Yes, I was just trying to be clear that it could be either case. > > But there is nothing that says that some client might not pass that > > all the time, and ignore the defect reports. Whether that is easy > > to identify or not, and whether the email package wants to require > > that the normal APIs be tried before the not-quite-perfect APIs are > > issues for discussion. > > The answers are obvious to me: yes and no. You can identify whether a > particular API has been used with standard text search tools like M-x > occur. (For non-Emacsers, that is an Emacs command that finds all > occurances of a particular string in the buffer.) If a program wants > to call the quick & dirty APIs first, that's none of our business, > except that if parsing is being done lazily we should be careful to > update the defect list, so that the program can check them when it > wants to. > > > Ultimately, the email package cannot enforce that proper case is taken > > by the client; only code reviews of the client can encourage that. > > My point is not to enforce anything, not even code reviews. But by > having separate APIs for parsed and unparsed data, code review can be > made easier and more accurate. > You have to analyze the control flow as well, not just search for existence of the API. In normal code, that should be straightforward, but there is no guarantee that the client doesn't use spaghetti code, or even obfuscated code, where the analysis would be hard. The API call could exist, but never be invoked; the API call could take parameters that never have particular values of interest at run-time. Hence, it may or may not be easy to search the client code and figure it out. But I agree with your stated point: we can't enforce anything about the client code, unless we write it ourself, or have some sort of authority over it. I intend to write a client, so I'll have control over that one, and don't plan to obfuscate it. > > Yes, agreed. And a special way or ways to get various algorithms for > > attempting to interpret not-quite-perfect data, when the client thinks > > that might be useful. > > I don't think we should be talking about special ways (plural) or > "not-quite-perfect" data. At this point in the design process, we > have *parsed* and *unparsed* data. Heuristic algorithms for > recovering from unparsable input can be layered on top of these two > sets of APIs, when we have *real* use cases for them. For example, I > don't think your use case of prepending a mailing list's topic or > serial number to an unparseable subject is realistic; in all lists I > know of such a message would be held for moderation, or even discarded > outright as spam. > So if the subject is unparseable, what is the moderator to do? He can't read the subject if it unparseable. Perhaps he can read the body, but it might be in the same unparseable charset. Let's say he can read the body, and the message seems to be valid for the list, and he marks it to be forwarded to list members. Now what is the mailing list to do, it still can't parse the subject? And if there is no moderator, it still may not be spam, just a mailing list manager that doesn't understand a valid charset, likely because it predates the definition of the charset. > And again: > > > Right. And it is the more detailed structure that I was referring to... > > But why? There is no need to discuss it at this point, and bringing > it up is confusing as all get-out. > The more we understand/discuss about how different client can function, the better we can design the email package. We'll still not likely cover all the possibilities, but we don't want to have tunnel vision and declare that because Mailman works this way, that all mailing list managers work this way, or that because we haven't discussed that some client might do something this way, that it won't. So I have no problem bringing clients into the discussion, to make sure that we don't preclude their reasonable behaviors as use cases. > > How a particular email server interprets the "stuff before the @" is > > pretty much up to it... so as long as it does something appropriate, it > > can interpret all or a fraction of it as a mailbox name, or could it > > intuit a mailbox name from the body content if it wants, or even from a > > special header. So yeah, particular interpretations of the address is > > non-RFC stuff. > > Right. To riff on the RFC vs. not theme ["Barry, pick up the bass > line, need more bottom here!"], I think we should pick a list of RFCs > we "promise" to implement as "defining" email; if we reserve any > structures as "too obscure for us to parse," we should say so (and > reference chapter and verse of the Holy RFC). On the other hand, of > course as we discover common use cases for which precise > specifications can be given, we should be flexible and implement them. > But there should be no rush. > > Which RFCs? > > First of all, the STD 11 series (RFCs 733, 822, 2822, 5322). Here we > have to worry about the standard's recommended format vs. the obsolete > format because of the Postel principle. AFAIK, there is no reason not > to insist on *producing* strictly RFC 5322 conformant messages, but I > think we should implement both strict and lax parsers. The lax parser > is for "daily use", the strict parser for validation. > > Second, the basic MIME structure RFCs: 2045-2049, 2231. (Some of > these have been at least partially superseded by now, I think.) > > The mailing list header RFCs: 2369 and 2919. > > Not RFCs, per se, but an auxiliary module should provide the > registered IANA data for the above RFCs. > > Strictly speaking outside of the email module, but we make use of URLs > (RFC 3986 -- superseded?) and mimetypes data (this overlaps > substantially with the "registered IANA data". We need to coordinate > with the responsible maintainers for those. > > Ditto coordinating with modules that we share a lot of structure with, > the "not email but very similar" like HTTP (RFC 2616), and netnews > (NNTP = 3397 and RFC 1036). > > Which extensions? > > Er, don't you think the above is enough for now? > It's a good list, yes. > > Just to point out that good data can be obtained from bad email > > messages, I think, and that that is a use case. > > But we already know that, and the basic idea of how to treat bad data > (send it to a locked room without any supper). No need to rehash > that, AFAICS from your use case. > Locked room is the first pass; unlocking it belongs to the heuristics, for determined clients. The use case wasn't at http://wiki.python.org/moin/Email%20SIG/UseCases so I've added it there, as "Handling pathological data #2" -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From v+python at g.nevcal.com Sun Oct 11 07:49:25 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Sat, 10 Oct 2009 22:49:25 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: <878wfilpsk.fsf@uwakimon.sk.tsukuba.ac.jp> References: <4ACEABFD.6010309@g.nevcal.com> <4ACEB234.9030309@is.kochi-u.ac.jp> <4ACED8C4.5070906@g.nevcal.com> <4ACEF66B.3000500@is.kochi-u.ac.jp> <4ACFA08F.9080307@g.nevcal.com> <4ACFB456.6010106@is.kochi-u.ac.jp> <4ACFDF86.8040104@g.nevcal.com> <87fx9rl0jh.fsf@uwakimon.sk.tsukuba.ac.jp> <4AD0EC72.6040704@g.nevcal.com> <878wfilpsk.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4AD171E5.40307@g.nevcal.com> On approximately 10/10/2009 5:47 PM, came the following characters from the keyboard of Stephen J. Turnbull: > Glenn Linderman writes: > > On approximately 10/10/2009 8:40 AM, came the following characters from > > the keyboard of Stephen J. Turnbull: > > > > So why are we discussing this? We don't even know what our mainline > > > APIs are going to look like, why are we discussing forcibly operating > > > on broken input? > > > > Use case generation. If the only way to access header values is to > > successfully, fully, decode them, then some uses may be rendered > > impossible, or at least difficult, even by choice of APIs. > > Since invertibility is a requirement, "successfully fully decoding" a > header field is not a prerequisite to accessing it. > > The question of "what should we do about broken mail" at this point > has three components: > > (1) To what level do we (ie, the email module) promise to parse > conforming wire format into useful objects? > > (2) For nonconforming input, when is it OK to raise an error and > return to the calling client rather than handle it ourselves? > > (3) What is the API for accessing and/or mutating unparsed data, and > requesting a reparse? > > I don't think we should go any farther than that. > I agree with your three components; but I think the answer to (3) requires discussion/speculation of what clients might want to to when faced with errors, otherwise the API won't likely help them much, without reimplementing email package logic. It is easy to design "sufficient", but unhelpful, APIs. So I've been willing to discuss such things. Maybe at too much length, and maybe with insufficient clarity that that is what I'm discussing, for which I apologize. But I don't think that not discussing it helps to answer (3). > > > "Re" is a Latin abbreviation; there is no appropriate translation. ;-) > > > > > > > Nonetheless, I have seen both Re: and Fwd: translated to other languages > > (besides Latin or geek) :) > > Sure. This is an aspect of question (1): is this the responsibility > of the email module? > I don't think the old RFCs even discuss the use of Re: and Fwd:, nor whether they should be collapsed or translated, or even used at all. Just checked: RFC 822 had an example that showed Re:, but RFC 2822 does discuss it a bit, and suggests not adding duplicate Re:. Fwd: is not mentioned at all, in those two RFCs. So no, adding and collapsing Re:/Fwd: is not the responsibility of the email package. But making it easy to do so, might be, as it is a common client operation. Lots of email style guides discuss it. > > > Maybe they are, but the email module doesn't know or care about what > > > they do. Let's stick within what the email module is supposed to > > > handle > > > > Yep, this is just use case exploration. > > But since by definition this is broken input, discussing what > applications are going to want to do with it is inappropriate, IMO. > We don't care if the app is going to prefix, suffix, or crucifix it. > We need to specify > > (a) what object will hold the raw data we couldn't handle > (b) how a calling client can retrieve the raw data > (c) how the client can replace (or more generally mutate) that data > (d) how the client can request a reparse from us if it attempted to > repair the breakage at a low level rather than parse it > > Manipulations of text or bytes are in principle not the responsibility > of the email module IMO; that will be done *by* the client *using* raw > Python, not methods provided by email. I don't see how discussion of > *what* manipulations can be done with one hand up our nose is anything > but useless bikeshedding. > > If we decide that the email module can usefully provide sufficiently > general facilities that would be convenient and hard to implement by > general client programmers (eg, the Mailman Developers collective > wisdom about foreign equivalents for "re" and "fwd" is surely greater > than that of the average American programmer), we will do it by > calling low-level methods to get and put the data, and raw Python to > manipulate it as text or bytes Except it may be perfectly valid input using a standard that post-dates the application. Doing something reasonable with it is appropriate. The email RFCs go to great lengths to make new features work reasonably in old clients that have limited understanding; with fallback interpretations for unknown MIME subtypes and even MIME types, and ensuring that some type of reasonable interpretation might be done. The RFCs define ways that new MIME types and subtypes might be defined, and new charsets, it seems reasonable to attempt to accommodate the possibility that such may actually be defined in the future. If we don't discuss some of the possibilities, we'll never learn enough to "decide that the email module can usefully provide sufficiently general facilities that would be convenient and hard to implement by general client programmers" :) To me, "hard" would mean that they would have to rewrite portions of logic that already exists in the email package, and then tweak it slightly to compensate for not-quite-perfect data, or maybe I should switch to saying "not-quite-perfect-or-possibly-later-standardized data" :) -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From v+python at g.nevcal.com Sun Oct 11 07:51:50 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Sat, 10 Oct 2009 22:51:50 -0700 Subject: [Email-SIG] fixing the current email module In-Reply-To: References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org> <4ACD94E5.5020808@g.nevcal.com> <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org> <4ACE6A1B.7060702@g.nevcal.com> <3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org> <4ACF880D.5080305@g.nevcal.com> <4AD1119E.60409@g.nevcal.com> <874oq6lgsx.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4AD17276.10205@g.nevcal.com> On approximately 10/10/2009 10:12 PM, came the following characters from the keyboard of R. David Murray: > But perhaps it should be added to the Glossary itself :) That would, to me, make it more acceptable for use. Like I said, I knew what was meant, but tried several printed and internet dictionaries, and didn't find it. Didn't try wordwebonline, as you might suppose! -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From stephen at xemacs.org Sun Oct 11 10:25:39 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sun, 11 Oct 2009 17:25:39 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4AD171E5.40307@g.nevcal.com> References: <4ACEABFD.6010309@g.nevcal.com> <4ACEB234.9030309@is.kochi-u.ac.jp> <4ACED8C4.5070906@g.nevcal.com> <4ACEF66B.3000500@is.kochi-u.ac.jp> <4ACFA08F.9080307@g.nevcal.com> <4ACFB456.6010106@is.kochi-u.ac.jp> <4ACFDF86.8040104@g.nevcal.com> <87fx9rl0jh.fsf@uwakimon.sk.tsukuba.ac.jp> <4AD0EC72.6040704@g.nevcal.com> <878wfilpsk.fsf@uwakimon.sk.tsukuba.ac.jp> <4AD171E5.40307@g.nevcal.com> Message-ID: <87zl7yjq0s.fsf@uwakimon.sk.tsukuba.ac.jp> Glenn Linderman writes: > > (3) What is the API for accessing and/or mutating unparsed data, and > > requesting a reparse? > > > > I don't think we should go any farther than that. > > I agree with your three components; but I think the answer to (3) > requires discussion/speculation of what clients might want to to when > faced with errors, I could be wrong, but I don't think it does. We don't to implement YAGNIs. > otherwise the API won't likely help them much, without > reimplementing email package logic. (1) That's why I propose parsing as much as possible, but no more. The parts that are in email package will not only be implemented and available, but they will already have been done. What hasn't been done yet, the email module doesn't know how to do anyway. (2) DRY simply doesn't apply. The logic for dealing with erroneous data is not the same as dealing with conforming data. If it were, we would have succeeded in the first place. > Except it may be perfectly valid input using a standard that post-dates > the application. Doing something reasonable with it is appropriate. I have no idea what you're thinking of. If it's a standard we implement, we'll handle it. If it isn't, it's not our problem. Discussing "possibilities" is out of the realm of "useful" already. Useful is "Existing client X does Y, and Z does it too. We can do Y for them, faster, better, cheaper." From stephen at xemacs.org Sun Oct 11 10:42:07 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sun, 11 Oct 2009 17:42:07 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4AD1611C.6030406@g.nevcal.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org> <4ACD94E5.5020808@g.nevcal.com> <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org> <4ACE6A1B.7060702@g.nevcal.com> <3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org> <4ACF880D.5080305@g.nevcal.com> <4AD1119E.60409@g.nevcal.com> <874oq6lgsx.fsf@uwakimon.sk.tsukuba.ac.jp> <4AD1611C.6030406@g.nevcal.com> Message-ID: <87y6nijp9c.fsf@uwakimon.sk.tsukuba.ac.jp> Glenn Linderman writes: > conformant is not in the dictionaries I've consulted. Try these (top 3 goggle results for "conformant"): conformant- WordWeb dictionary definition (computing) conforming to a particular specification or standard "In this paper we present a new approach to conformant planning". Nearest ... www.wordwebonline.com/en/CONFORMANT - Cached - Similar - conformant - Definition from the Merriam-Webster Online Dictionary conformant can be found at Merriam-WebsterUnabridged.com. Click here to start your free trial! Click here to search for another word in the Merriam-Webster ... www.merriam-webster.com/dictionary/conformant - Cached - Similar - Conformance The notion of TEI conformance is intended as an aid in describing the format and contents of a particular document or set of documents. ... www.tei-c.org/Guidelines/P4/html/CF.html - Cached - Similar - A quick look at some of the results show that the word "conformant" is typically used in a section called "conformance", which defines what criteria are used to determine if an application is following the standard or not. OTOH, the fact that the top three results are dictionary definitions suggests an awful lot of people are looking up the word in dictionaries.... > Conforming is mostly a verb, not an adjective. Goggling gives "Results 1 - 10 of about 3,680,000 for conforming application," but " Results 1 - 10 of about 324,000 for conformant application." Looks like "conforming" is the preferred adjectival form. > but conformable and compliant are synonyms. When used to mean "submissive." "Conformable" won't do. > English is hard enough for ESL folks when they can find the > words in the dictionary. Compliant does seem to be the winner. "Results 1 - 10 of about 13,900,000 for compliant application." Conformant or conforming is better IMHO but much less popular. Tie goes to the lusers, as usual. From stephen at xemacs.org Sun Oct 11 11:11:19 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sun, 11 Oct 2009 18:11:19 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4AD16A05.8020302@g.nevcal.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACCD10D.4070308@g.nevcal.com> <87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACE6CBD.2030805@g.nevcal.com> <87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACED79F.6050602@g.nevcal.com> <87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACF9C6B.4020508@g.nevcal.com> <87iqenl594.fsf@uwakimon.sk.tsukuba.ac.jp> <4AD0E82A.5000603@g.nevcal.com> <877hv2likn.fsf@uwakimon.sk.tsukuba.ac.jp> <4AD16A05.8020302@g.nevcal.com> Message-ID: <87ws32jnwo.fsf@uwakimon.sk.tsukuba.ac.jp> Glenn Linderman writes: > On approximately 10/10/2009 8:23 PM, came the following characters from > the keyboard of Stephen J. Turnbull: > > I don't think your use case of prepending a mailing list's topic or > > serial number to an unparseable subject is realistic; in all lists I > > know of such a message would be held for moderation, or even discarded > > outright as spam. > > So if the subject is unparseable, what is the moderator to do? That's her problem, not ours. I can think of a number of things she can do, starting with bouncing the mail back to sender with a note that it was broken, please fix. If the moderator is me, I might load the mail into XEmacs and see if Gnus can grok it. Etc. If and when we discover there are "best practices" for this situation, we should help automate them. Until then, "it broke -- here are all the pieces" is what we should say, IMO. > The more we understand/discuss about how different client can function, > the better we can design the email package. Sure, but about this level of discussion ... "Although never is often better than *right* now" applies, I think. From barry at python.org Mon Oct 12 22:18:32 2009 From: barry at python.org (Barry Warsaw) Date: Mon, 12 Oct 2009 16:18:32 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACCD10D.4070308@g.nevcal.com> <87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACE6CBD.2030805@g.nevcal.com> <87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACED79F.6050602@g.nevcal.com> <87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACF9C6B.4020508@g.nevcal.com> Message-ID: On Oct 9, 2009, at 7:20 PM, R. David Murray wrote: > IMO, the appropriate way for the email package to provide the API you > are talking about is it provide the client with a way to get at the > raw > byte string, which I think everyone agrees on. If the client wants to > decode it as if it were latin-1 to process it, it can then do that. I agree. I'm running out of time to participate in this lengthy thread, but I just wanted to say that of the 3 accessors (raw, transport-decoded, fully-decoded) I'm not sure transport-decoded is all that interesting. I wouldn't support it directly in the API. I think they library's clients are mostly going to be interested in raw or fully-decoded values, and there will be plenty of library utilities to get from raw to transport-decoded if they really want it. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From barry at python.org Mon Oct 12 22:19:34 2009 From: barry at python.org (Barry Warsaw) Date: Mon, 12 Oct 2009 16:19:34 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACFDB3C.5040307@g.nevcal.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACCD10D.4070308@g.nevcal.com> <87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACE6CBD.2030805@g.nevcal.com> <87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACED79F.6050602@g.nevcal.com> <87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACF9C6B.4020508@g.nevcal.com> <4ACFDB3C.5040307@g.nevcal.com> Message-ID: <1D493DD5-7EE0-486B-BD8D-24FC5BB0B7A0@python.org> On Oct 9, 2009, at 8:54 PM, Glenn Linderman wrote: > That certainly works, but it isn't very helpful... that forces the > client application to reproduce the logic to parse the header value > and decode the parts that can be decoded successfully, and that is > exactly the sort of thing Stephen was complaining about when he > thought I was suggesting that to be a requirement (but he was > confused about what I was suggesting). There are/will be utilities in the email package to make this easy. I don't think there's a ton of benefit to be had by supporting transport- decoded directly in the Message or Header API. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From barry at python.org Mon Oct 12 22:30:28 2009 From: barry at python.org (Barry Warsaw) Date: Mon, 12 Oct 2009 16:30:28 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <87iqenl594.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACCD10D.4070308@g.nevcal.com> <87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACE6CBD.2030805@g.nevcal.com> <87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACED79F.6050602@g.nevcal.com> <87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACF9C6B.4020508@g.nevcal.com> <87iqenl594.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <2F987CAA-8FC6-406E-A825-E12B97659A60@python.org> On Oct 10, 2009, at 9:59 AM, Stephen J. Turnbull wrote: > Both. I *believe* (but it needs to be checked) that in a correctly > formed multipart MIME object (message or part), any internal structure > is context-free within the MIME boundaries. If that is so, then > individual parts of the object can be stored in raw form and parsed > lazily. I too /think/ that's correct. There are some MIME content-types that cause parts to be related (e.g. multipart/alternative and multipart/ related), but those are all operating at a higher level. In practice it probably makes sense to parse all the headers right away. Content-Type has the most bearing on parsing the rest of the stuff, so by that time you already need to parse parameters to e.g. get the boundary. Early on I claimed that headers were so manageable in practice that we could implement an ordered-dictionary with duplicates as a simple list, with linear searching and nobody would notice. I think nobody has noticed ;). Lazy parsing of the body does make sense. You only need to parse enough to find end boundaries, or recurse into parsing an embedded part. This is how the parser currently works anyway. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From barry at python.org Mon Oct 12 22:41:48 2009 From: barry at python.org (Barry Warsaw) Date: Mon, 12 Oct 2009 16:41:48 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <4ACF880D.5080305@g.nevcal.com> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org> <4ACD94E5.5020808@g.nevcal.com> <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org> <4ACE6A1B.7060702@g.nevcal.com> <3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org> <4ACF880D.5080305@g.nevcal.com> Message-ID: <87C79F21-2E8D-460E-992D-AE6050F0C394@python.org> On Oct 9, 2009, at 2:59 PM, Glenn Linderman wrote: > It would be good though to have standardized terms for easier > communication. Maybe as they are chosen, they could be added to > that Wiki RDM set up? I like raw, transfer-decoded, decoded (or maybe fully-decoded). As I've mentioned before, I don't think the Message or Header APIs need to directly support transfer-decoded. > Separate APIs would be clearer, but for compatibility, > should .get_payload() be retained, with the flag? No. It was a mistake that should be taken out back and shot. I would proposal a radical suggestion: we treat backward compatibility the way Python 3 did. Nice to keep, but we can throw it over the side in order to fix the warts. We'll worry about migration strategy later. Aside: I would really like to have a much more @property based API where appropriate. E.g. Message.get_content_type() would be Message.content_type. And in this case we'd probably have message.payload_bytes or some such. Decoding may require additional parameters so it will probably be a method. > Sure, a registration system is fine. It could work for any type > that has a method that can be registered, that accepts a binary BLOB > and returns an appropriate typed and functioning object that can > manipulate that type. That would mean that the application would > have to make all the registration calls up front, instead of making > the API calls when the objects are retrieved. Basically, if the > email package doesn't have a registration system that the > application can use, the application has to invent its own, so this > is work that could benefit all applications. I'm sure there will be lots of default content-types registered, and there ought to be a "default" or fallback converter that can be overridden. It should also be possible for third party extensions to add additional converters. Models for this would be timzeone additions for datetime, and codecs. > Actually, although it is not common practice to have encodings other > than the RFC defined base64 and quoted-printable, a registration > system for converting from #1 to #3, with appropriate defaults for > base64, quoted-printable, binary, 7bit, 8bit, would be appropriate, That makes sense. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From barry at python.org Mon Oct 12 22:45:09 2009 From: barry at python.org (Barry Warsaw) Date: Mon, 12 Oct 2009 16:45:09 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org> <4ACD94E5.5020808@g.nevcal.com> <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org> <4ACE6A1B.7060702@g.nevcal.com> <3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org> <4ACF880D.5080305@g.nevcal.com> Message-ID: On Oct 10, 2009, at 5:20 PM, R. David Murray wrote: > The other is a Glossary[2]. I think most of it accurately reflects > the > consensus here, but in it I'm proposing to use the term 'transfer- > decoded' > for #3, and 'transfer-encoded' as an alternative to 'wire-format' just > for symmetry. Comments and suggestions welcome. wire-format is potentially misleading because the RFCs define line- endings as CRLF, but we accept system native line-endings, and sometimes output them too. ready-for-another-can-of-worms-yum!-ly y'rs, -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From barry at python.org Mon Oct 12 22:47:31 2009 From: barry at python.org (Barry Warsaw) Date: Mon, 12 Oct 2009 16:47:31 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <878wfilpsk.fsf@uwakimon.sk.tsukuba.ac.jp> References: <4ACEABFD.6010309@g.nevcal.com> <4ACEB234.9030309@is.kochi-u.ac.jp> <4ACED8C4.5070906@g.nevcal.com> <4ACEF66B.3000500@is.kochi-u.ac.jp> <4ACFA08F.9080307@g.nevcal.com> <4ACFB456.6010106@is.kochi-u.ac.jp> <4ACFDF86.8040104@g.nevcal.com> <87fx9rl0jh.fsf@uwakimon.sk.tsukuba.ac.jp> <4AD0EC72.6040704@g.nevcal.com> <878wfilpsk.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <19D59CF5-9027-4544-AE8A-207AAA26F6D3@python.org> On Oct 10, 2009, at 8:47 PM, Stephen J. Turnbull wrote: > The question of "what should we do about broken mail" at this point > has three components: > > (1) To what level do we (ie, the email module) promise to parse > conforming wire format into useful objects? > > (2) For nonconforming input, when is it OK to raise an error and > return to the calling client rather than handle it ourselves? > > (3) What is the API for accessing and/or mutating unparsed data, and > requesting a reparse? > > I don't think we should go any farther than that. Agreed! -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From barry at python.org Mon Oct 12 22:54:17 2009 From: barry at python.org (Barry Warsaw) Date: Mon, 12 Oct 2009 16:54:17 -0400 Subject: [Email-SIG] fixing the current email module In-Reply-To: <877hv2likn.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACCD10D.4070308@g.nevcal.com> <87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACE6CBD.2030805@g.nevcal.com> <87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACED79F.6050602@g.nevcal.com> <87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACF9C6B.4020508@g.nevcal.com> <87iqenl594.fsf@uwakimon.sk.tsukuba.ac.jp> <4AD0E82A.5000603@g.nevcal.com> <877hv2likn.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <82DD0453-E0D5-4853-9678-27D7E0FEB9CC@python.org> On Oct 10, 2009, at 11:23 PM, Stephen J. Turnbull wrote: > Right. To riff on the RFC vs. not theme ["Barry, pick up the bass > line, need more bottom here!"], I think we should pick a list of RFCs > we "promise" to implement as "defining" email; if we reserve any > structures as "too obscure for us to parse," we should say so (and > reference chapter and verse of the Holy RFC). On the other hand, of > course as we discover common use cases for which precise > specifications can be given, we should be flexible and implement them. > But there should be no rush. Although of course Rush is the most awesomest band EVAR. But I'm slappin' and poppin' to your groove here my bruthah. > Which RFCs? > > First of all, the STD 11 series (RFCs 733, 822, 2822, 5322). Here we > have to worry about the standard's recommended format vs. the obsolete > format because of the Postel principle. AFAIK, there is no reason not > to insist on *producing* strictly RFC 5322 conformant messages, but I > think we should implement both strict and lax parsers. The lax parser > is for "daily use", the strict parser for validation. > > Second, the basic MIME structure RFCs: 2045-2049, 2231. (Some of > these have been at least partially superseded by now, I think.) > > The mailing list header RFCs: 2369 and 2919. Yep, yep, and yep. > Not RFCs, per se, but an auxiliary module should provide the > registered IANA data for the above RFCs. > > Strictly speaking outside of the email module, but we make use of URLs > (RFC 3986 -- superseded?) and mimetypes data (this overlaps > substantially with the "registered IANA data". We need to coordinate > with the responsible maintainers for those. > > Ditto coordinating with modules that we share a lot of structure with, > the "not email but very similar" like HTTP (RFC 2616), and netnews > (NNTP = 3397 and RFC 1036). > > Which extensions? > > Er, don't you think the above is enough for now? Surely is, at least until that U$1M grant from the PSF comes through . Oh wait, we blew that on lunch at Pycon 2009. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From stephen at xemacs.org Tue Oct 13 06:07:00 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 13 Oct 2009 13:07:00 +0900 Subject: [Email-SIG] fixing the current email module In-Reply-To: <87C79F21-2E8D-460E-992D-AE6050F0C394@python.org> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org> <4ACD94E5.5020808@g.nevcal.com> <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org> <4ACE6A1B.7060702@g.nevcal.com> <3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org> <4ACF880D.5080305@g.nevcal.com> <87C79F21-2E8D-460E-992D-AE6050F0C394@python.org> Message-ID: <871vl8hr8b.fsf@uwakimon.sk.tsukuba.ac.jp> Barry Warsaw writes: > I would proposal a radical suggestion: we treat backward compatibility > the way Python 3 did. Nice to keep, but we can throw it over the side > in order to fix the warts. We'll worry about migration strategy later. +1 > Aside: I would really like to have a much more @property based API > where appropriate. +1 > E.g. Message.get_content_type() would be Message.content_type. And > in this case we'd probably have message.payload_bytes or some such. > Decoding may require additional parameters so it will probably be a > method. Maybe, but in general those parameters can be deduced from the metadata. If we can use those defaults often enough, then the default-decoded version can be a property too. We would have to provide alternatives, though. I've seen Shift JIS encoded Japanese labelled "ISO-2022-JP", and apparently many Japanese MUAs actually decode that to Japanese! Not suggesting that we should do the same, but probably the generic function that is used to decode should be exposed as a method so that clients who encounter such nonsense can deal with it, and override any of the metadata. From andrewm at object-craft.com.au Mon Oct 19 06:39:06 2009 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Mon, 19 Oct 2009 15:39:06 +1100 Subject: [Email-SIG] fixing the current email module In-Reply-To: <1685F0AB-8B57-445A-BE03-3782E07DB8FD@python.org> References: <10506972.7161254576370614.JavaMail.root@boaz> <8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org> <87zl88h4cj.fsf@uwakimon.sk.tsukuba.ac.jp> <1685F0AB-8B57-445A-BE03-3782E07DB8FD@python.org> Message-ID: <20091019043906.542AE59C086@longblack.object-craft.com.au> >Just to ramble a little longer, it's been argued that we should give >up on idempotency, but I'm not convinced. I think people want to see >an email message they throw into the system come out the other end as >closely as possible (well, /exactly/ for well-formed messages). I, for one, would be disappointed if we lost idempotency. If people want a use-case, think of SpamBayes, where we read the message, do our best to analyse it, then insert a header or two. If this mangled messages, the email module would be nearly useless to SB. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Mon Oct 19 06:50:26 2009 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Mon, 19 Oct 2009 15:50:26 +1100 Subject: [Email-SIG] fixing the current email module In-Reply-To: <87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACCD10D.4070308@g.nevcal.com> <87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <20091019045027.12F3E59C086@longblack.object-craft.com.au> > > Your "hit me with your best shot" comment indicates that you want a > > failure code or exception when the data is bad, and then a way to > > "retry accepting errors"? > >My curent thinking is that the email module should return an object >representing a partial parse. The way that you find out if it is >partial is to try to access some data that "should" be in the object. >If the parse succeeded, the accessor returns the data (which might be >empty). If the parse did not succeed, you get an AttributeError. >(This is just a paraphrase of what I wrote in response to Oleg.) I agree - try to extract as much intelligence as we can from the malformed message, and hold the unparseable bits in a "bad chunk" object. If possible, when reserialising the message, e-mail the bad chunk verbatim, or possibly with minor fixes to keep the containing MIME structure legal if we have to. But I'd rather see "garbage-in and same garbage-out", than "garbage-in and even worse garbage out". Maybe the parsing should lazy where possible: don't recurse deeper into the structure if all we're doing is looking at a top level header, for instance. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Mon Oct 19 07:05:10 2009 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Mon, 19 Oct 2009 16:05:10 +1100 Subject: [Email-SIG] fixing the current email module In-Reply-To: References: <8510262.7231254589795083.JavaMail.root@boaz> <4ACB0DC9.7080307@g.nevcal.com> <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACB971D.9080706@g.nevcal.com> <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp> <4ACC0277.2060807@g.nevcal.com> <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp> <643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org> <4ACD94E5.5020808@g.nevcal.com> <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org> <4ACE6A1B.7060702@g.nevcal.com> <3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org> <4ACF880D.5080305@g.nevcal.com> Message-ID: <20091019050510.C238259C086@longblack.object-craft.com.au> >wire-format is potentially misleading because the RFCs define line- >endings as CRLF, but we accept system native line-endings, and >sometimes output them too. And, in some contexts, when forwarding e-mail it is important that we emit exactly the line endings we received, without trying to be "helpful" and "fix" them. But, in the case of text content inserted into a message, I think we should convert the system line endings into CRLF (possibly with some way to override this - a "literal" mode). -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From rdmurray at bitdance.com Thu Oct 22 00:58:42 2009 From: rdmurray at bitdance.com (R. David Murray) Date: Wed, 21 Oct 2009 18:58:42 -0400 (EDT) Subject: [Email-SIG] invertability and idempotence In-Reply-To: <20091019043906.542AE59C086@longblack.object-craft.com.au> References: <10506972.7161254576370614.JavaMail.root@boaz> <8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org> <87zl88h4cj.fsf@uwakimon.sk.tsukuba.ac.jp> <1685F0AB-8B57-445A-BE03-3782E07DB8FD@python.org> <20091019043906.542AE59C086@longblack.object-craft.com.au> Message-ID: On Mon, 19 Oct 2009 at 15:39, Andrew McNamara wrote: >> Just to ramble a little longer, it's been argued that we should give >> up on idempotency, but I'm not convinced. I think people want to see >> an email message they throw into the system come out the other end as >> closely as possible (well, /exactly/ for well-formed messages). > > I, for one, would be disappointed if we lost idempotency. If people want > a use-case, think of SpamBayes, where we read the message, do our best > to analyse it, then insert a header or two. If this mangled messages, > the email module would be nearly useless to SB. You are referring here to invertability, rather than idempotence. But it turns out that idempotence does have a meaning in the context of the email module, so I think I need to remove 'depreciated' from my glossary[1] entry for it, and explain what it means in the context of the email module. For background, see issue 7119[2]. Here's what I propose: _invertability_ applies to the data path into the parser and out of the generator. That is: generate(parse(msg)) == msg should be true whenever possible. On the other hand, when _constructing_ a message, sometimes not all data is filled in (in the example above, it is the MIME boundary marker). In that case, it is important (I think, please discuss :) that generating the message maintain _idempotency_: once you have generated the message, then if you have not further mutated the message, generating the message again should produce the _same_ output. That is: generate(msg) == generate(msg) even though the state of msg may change after the _first_ generate call. --David [1] http://wiki.python.org/moin/Email%20SIG/Glossary [2] http://bugs.python.org/issue7119 From andrewm at object-craft.com.au Thu Oct 22 06:58:24 2009 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Thu, 22 Oct 2009 15:58:24 +1100 Subject: [Email-SIG] invertability and idempotence In-Reply-To: References: <10506972.7161254576370614.JavaMail.root@boaz> <8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org> <87zl88h4cj.fsf@uwakimon.sk.tsukuba.ac.jp> <1685F0AB-8B57-445A-BE03-3782E07DB8FD@python.org> <20091019043906.542AE59C086@longblack.object-craft.com.au> Message-ID: <20091022045824.ABEBD600111@longblack.object-craft.com.au> >You are referring here to invertability, rather than idempotence. The discussion had referred to idempotency up until that point, and I didn't want to introduce new terminology. But referring to this: > generate(parse(msg)) == msg as "idempotency" is perfectly valid in my opinion (as in, applying an operation multiple times produces the same result). -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From stephen at xemacs.org Thu Oct 22 10:00:13 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 22 Oct 2009 17:00:13 +0900 Subject: [Email-SIG] invertability and idempotence In-Reply-To: <20091022045824.ABEBD600111@longblack.object-craft.com.au> References: <10506972.7161254576370614.JavaMail.root@boaz> <8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org> <87zl88h4cj.fsf@uwakimon.sk.tsukuba.ac.jp> <1685F0AB-8B57-445A-BE03-3782E07DB8FD@python.org> <20091019043906.542AE59C086@longblack.object-craft.com.au> <20091022045824.ABEBD600111@longblack.object-craft.com.au> Message-ID: <87eiovdfjm.fsf@uwakimon.sk.tsukuba.ac.jp> Andrew McNamara writes: > The discussion had referred to idempotency up until that point, and I > didn't want to introduce new terminology. But referring to this: > > > generate(parse(msg)) == msg > > as "idempotency" is perfectly valid in my opinion (as in, applying an > operation multiple times produces the same result). That would be generate(generate(msg)) == generate(msg) or parse(parse(email)) == parse(email). The input and output of these functions are of *different types*, they cannot possibly be idempotent. I'm +1 on changing to use "invertible", -0 on continuing to use "idempotent" (since it's the traditional idiom), and -1 on using "idempotent" to mean "is deterministic", ie, generate(msg) == generate(msg). If msg changes state in an irrelevant way, it would be nice to produce the same output from generate. But that is not "idempotency". And we would need to specify precisely what irrelevant means. For example, if a client of the Message class decides to specify the MIME boundary explicitly, then the output of generate has to change IMO. OTOH, many MIME implementations put the time of day or the generating process into the MIME boundary. This is unnecessary (boundaries need to be unique only message-wide, and the email package can adjust the boundary to not conflict with message content, eg, Emacs/Gnus uses something like "-=-=-=-=-" by default), and I would hope that email avoids such practices when possible. From andrewm at object-craft.com.au Thu Oct 22 11:42:43 2009 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Thu, 22 Oct 2009 20:42:43 +1100 Subject: [Email-SIG] invertability and idempotence In-Reply-To: <87eiovdfjm.fsf@uwakimon.sk.tsukuba.ac.jp> References: <10506972.7161254576370614.JavaMail.root@boaz> <8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org> <87zl88h4cj.fsf@uwakimon.sk.tsukuba.ac.jp> <1685F0AB-8B57-445A-BE03-3782E07DB8FD@python.org> <20091019043906.542AE59C086@longblack.object-craft.com.au> <20091022045824.ABEBD600111@longblack.object-craft.com.au> <87eiovdfjm.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <20091022094243.41A25600111@longblack.object-craft.com.au> > > didn't want to introduce new terminology. But referring to this: > > > > > generate(parse(msg)) == msg > > > > as "idempotency" is perfectly valid in my opinion (as in, applying an > > operation multiple times produces the same result). > >That would be generate(generate(msg)) == generate(msg) or >parse(parse(email)) == parse(email). The input and output of >these functions are of *different types*, they cannot possibly be >idempotent. You're splitting hairs - the operation "generate(parse(X))" is idempotent, and that's what I was referring to. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From barry at python.org Thu Oct 22 13:36:12 2009 From: barry at python.org (Barry Warsaw) Date: Thu, 22 Oct 2009 07:36:12 -0400 Subject: [Email-SIG] invertability and idempotence In-Reply-To: References: <10506972.7161254576370614.JavaMail.root@boaz> <8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org> <87zl88h4cj.fsf@uwakimon.sk.tsukuba.ac.jp> <1685F0AB-8B57-445A-BE03-3782E07DB8FD@python.org> <20091019043906.542AE59C086@longblack.object-craft.com.au> Message-ID: On Oct 21, 2009, at 6:58 PM, R. David Murray wrote: > But it turns out that idempotence does have a meaning in the context > of the email module, so I think I need to remove 'depreciated' from > my glossary[1] entry for it, and explain what it means in the context > of the email module. I think you're onto something here. > For background, see issue 7119[2]. > > Here's what I propose: _invertability_ applies to the data path > into the parser and out of the generator. That is: > > generate(parse(msg)) == msg > > should be true whenever possible. Agreed, where 'msg' in this context means the message text or bytes. > On the other hand, when _constructing_ a message, sometimes not all > data > is filled in (in the example above, it is the MIME boundary marker). > In that case, it is important (I think, please discuss :) that > generating > the message maintain _idempotency_: once you have generated the > message, > then if you have not further mutated the message, generating the > message > again should produce the _same_ output. That is: > > generate(msg) == generate(msg) > > even though the state of msg may change after the _first_ generate > call. "Idempotent" means: "multiple applications of the operation do not change the result". So here where the operation is to take a message object and generate a stream of text or bytes, this should absolutely return the same stream if the object is not mutated between calls. I think it's fair though that if the model is manipulated in any way, we make no guarantees of idempotency, though we should strive for minimal differences. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 832 bytes Desc: This is a digitally signed message part URL: From stephen at xemacs.org Thu Oct 22 20:09:43 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 23 Oct 2009 03:09:43 +0900 Subject: [Email-SIG] invertability and idempotence In-Reply-To: <20091022094243.41A25600111@longblack.object-craft.com.au> References: <10506972.7161254576370614.JavaMail.root@boaz> <8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org> <87zl88h4cj.fsf@uwakimon.sk.tsukuba.ac.jp> <1685F0AB-8B57-445A-BE03-3782E07DB8FD@python.org> <20091019043906.542AE59C086@longblack.object-craft.com.au> <20091022045824.ABEBD600111@longblack.object-craft.com.au> <87eiovdfjm.fsf@uwakimon.sk.tsukuba.ac.jp> <20091022094243.41A25600111@longblack.object-craft.com.au> Message-ID: <874oprcnbs.fsf@uwakimon.sk.tsukuba.ac.jp> Andrew McNamara writes: > > > didn't want to introduce new terminology. But referring to this: > > > > > > > generate(parse(msg)) == msg > > > > > > as "idempotency" is perfectly valid in my opinion (as in, applying an > > > operation multiple times produces the same result). > > > >That would be generate(generate(msg)) == generate(msg) or > >parse(parse(email)) == parse(email). The input and output of > >these functions are of *different types*, they cannot possibly be > >idempotent. > > You're splitting hairs - the operation "generate(parse(X))" is > idempotent, and that's what I was referring to. Yes and no. The equation above does imply idempotency, but it is a much stronger statement: generate(parse()) is the identity. That stronger statement could be useful in practice, but it could also be expensive to implement. That tension could engender flamewars if the requirement is expressed by the word "idempotency" but the intent is "identity". For example, suppose that for MIME multipart messages, generate() uses "$%$%$%$%$%$" as the separator as long as no component contains that string. Then generate(parse(msg)) will be *equivalent* but not *identical* to msg for most messages received from non-Python-email- using MUAs. generate(parse()) is idempotent, though. I don't think the folks who ask for "idempotency" would be satisfied with that! As I said earlier, if we're going to use the word "idempotent" to mean "invertible", that's established practice, so we footnote the Humpty-Dumpty-ism, and I can live with that. But if we're going to try to be more accurate, let's be fully accurate.