From tfarrell at owassobible.org  Sat Oct  3 15:26:10 2009
From: tfarrell at owassobible.org (Timothy Farrell)
Date: Sat, 3 Oct 2009 08:26:10 -0500 (CDT)
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <14505295.7141254576102143.JavaMail.root@boaz>
Message-ID: <10506972.7161254576370614.JavaMail.root@boaz>

Back in June, David Murray posted the message below about fixing the email module.  I have an interest in helping with this due to a personal project I'm working on.  However, my ability to help is severely limited by my understanding of email and MIME RFCs.

David asked the question of whether or not passing strings to the feedparser is a needed behavior.  I don't claim to have enough knowledge to answer the question yes or no, but I would urge us all to consider that if no answer shows up that David's patch should be put in for no better reason than that it's better than what we currently have.

David, if you would send it to me, I might be able to fix up some of the test cases.

Thanks,
-tim


--------
So, designing a new interface is one thing.  Making the current
interface usable in py3k is another.  I presume that the latter
is desirable?

I'm porting a small application that uses the email module to py3k.
I've run into two problems, one of which was already reported, the other
of which was not:

     http://bugs.python.org/issue4661
     http://bugs.python.org/issue6302

(Then there's the whole string issues relating to email and unicode
organized under Issue1685453, but I'm going to ignore those for the
moment.)

I'd like to try fixing these, but there are design issues involved.
The fundamental one is, what format should 'message' be handling message
data in?  4661 addresses this obliquely, and we've talked about this
somewhat at the higher design level.  But the question before me is,
how to fix feedparser, message, and decode_header so that I can actually
parse a message and display it correctly.

I need to be able to feed bytes to feedparser, that much is clear.
I've implemented a proof-of-concept fix that has feedparser handle all
its input as bytes, has message decode headers and values using the
ASCII codec if handled bytes, and has decode_header expect strings and
consistently return bytes.

With this fix in place my application works.  But of course, the
email module tests do not pass, and I don't know what other use
cases I have broken.

My specific question, as posted in issue4661, is: is there any
use case for passing strings to feedparser that is not a design
error waiting to trap the programmer?

--David

From barry at python.org  Sat Oct  3 16:36:51 2009
From: barry at python.org (Barry Warsaw)
Date: Sat, 3 Oct 2009 10:36:51 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <10506972.7161254576370614.JavaMail.root@boaz>
References: <10506972.7161254576370614.JavaMail.root@boaz>
Message-ID: <8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org>

On Oct 3, 2009, at 9:26 AM, Timothy Farrell wrote:

> Back in June, David Murray posted the message below about fixing the  
> email module.  I have an interest in helping with this due to a  
> personal project I'm working on.  However, my ability to help is  
> severely limited by my understanding of email and MIME RFCs.
>
> David asked the question of whether or not passing strings to the  
> feedparser is a needed behavior.  I don't claim to have enough  
> knowledge to answer the question yes or no, but I would urge us all  
> to consider that if no answer shows up that David's patch should be  
> put in for no better reason than that it's better than what we  
> currently have.

I expect RDM to have some follow ups soon, but I'll put this forward  
in the meantime.

I firmly believe we need parallel feedparser APIs, one for feeding it  
strings and one for feeding it bytes.  In all the tentative attempts  
at Python3-ification I've done I just keep coming back to that  
assessment.  I don't think it's a terrible burden either since I also  
firmly believe that /internally/ the email package should be bytes- 
oriented.  So the basic model is: accept strings or bytes at the  
edges, process everything internally as bytes, output strings and  
bytes at the edges.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091003/36640c1b/attachment.pgp>

From stephen at xemacs.org  Sat Oct  3 17:41:48 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Sun, 04 Oct 2009 00:41:48 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org>
References: <10506972.7161254576370614.JavaMail.root@boaz>
	<8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org>
Message-ID: <87zl88h4cj.fsf@uwakimon.sk.tsukuba.ac.jp>

Barry Warsaw writes:

 > So the basic model is: accept strings or bytes at the edges,
 > process everything internally as bytes, output strings and bytes at
 > the edges.

In a certain pedantic sense, that can't be right, because bytes alone
can't represent strings.

Practically, you are going need to say how a bytes or bytearray is to
be interpreted as a string, and that is going to be one big mess.
(MIME?)

Going the other way around you have no such problem, or rather the
trivial embedding works fine, except that you have to do a range check
at some point before you convert to bytes.


From tfarrell at owassobible.org  Sat Oct  3 19:09:55 2009
From: tfarrell at owassobible.org (Timothy Farrell)
Date: Sat, 3 Oct 2009 12:09:55 -0500 (CDT)
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <25336492.7211254589432905.JavaMail.root@boaz>
Message-ID: <8510262.7231254589795083.JavaMail.root@boaz>

I agree with Barry insofar as accepting bytes or strings on the input with internal processing in bytes and output bytes or strings depending on the content parsed.

Forgive my ignorance...why does converting bytes to strings have to be a mess?  Rather than having two Feedparsers, can't we just pass a default encoding when instantiating a feedparser and have it read from the MIME headers otherwise?  If not encoding is passed and one can't be determined, simply output as bytes or try a default and raise an exception if it fails.

If providing the default encoding, no such range check is needed.

----- Original Message -----
From: "Stephen J. Turnbull" <stephen at xemacs.org>
To: "Barry Warsaw" <barry at python.org>
Cc: "Timothy Farrell" <tfarrell at owassobible.org>, email-sig at python.org
Sent: Saturday, October 3, 2009 10:41:48 AM GMT -06:00 US/Canada Central
Subject: Re: [Email-SIG] fixing the current email module

Barry Warsaw writes:

 > So the basic model is: accept strings or bytes at the edges,
 > process everything internally as bytes, output strings and bytes at
 > the edges.

In a certain pedantic sense, that can't be right, because bytes alone
can't represent strings.

Practically, you are going need to say how a bytes or bytearray is to
be interpreted as a string, and that is going to be one big mess.
(MIME?)

Going the other way around you have no such problem, or rather the
trivial embedding works fine, except that you have to do a range check
at some point before you convert to bytes.


From v+python at g.nevcal.com  Tue Oct  6 11:28:41 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Tue, 06 Oct 2009 02:28:41 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <8510262.7231254589795083.JavaMail.root@boaz>
References: <8510262.7231254589795083.JavaMail.root@boaz>
Message-ID: <4ACB0DC9.7080307@g.nevcal.com>

On approximately 10/3/2009 10:09 AM, came the following characters from 
the keyboard of Timothy Farrell:
> I agree with Barry insofar as accepting bytes or strings on the input with internal processing in bytes and output bytes or strings depending on the content parsed.
>
> Forgive my ignorance...why does converting bytes to strings have to be a mess?  Rather than having two Feedparsers, can't we just pass a default encoding when instantiating a feedparser and have it read from the MIME headers otherwise?  If not encoding is passed and one can't be determined, simply output as bytes or try a default and raise an exception if it fails.
>
> If providing the default encoding, no such range check is needed.
>
> ----- Original Message -----
> From: "Stephen J. Turnbull" <stephen at xemacs.org>
> To: "Barry Warsaw" <barry at python.org>
> Cc: "Timothy Farrell" <tfarrell at owassobible.org>, email-sig at python.org
> Sent: Saturday, October 3, 2009 10:41:48 AM GMT -06:00 US/Canada Central
> Subject: Re: [Email-SIG] fixing the current email module
>
> Barry Warsaw writes:
>
>  > So the basic model is: accept strings or bytes at the edges,
>  > process everything internally as bytes, output strings and bytes at
>  > the edges.
>
> In a certain pedantic sense, that can't be right, because bytes alone
> can't represent strings.
>
> Practically, you are going need to say how a bytes or bytearray is to
> be interpreted as a string, and that is going to be one big mess.
> (MIME?)
>
> Going the other way around you have no such problem, or rather the
> trivial embedding works fine, except that you have to do a range check
> at some point before you convert to bytes.

Email messages are bytes.  Usually restricted to bytes in the range 
32-127, but sometimes permitted to be 0-255 (8bit encoding).

Email messages carry sufficient information to convert bytes to strings 
(usually; and sufficient defaults to cover the other cases adequately, 
even if not with 100% certainty).

So if Barry is considering that the internal form is bytes, particularly 
bytes encoded via email RFCs, then I can't argue with that being a 
reasonable internal form.... except for one problem, 2 paragraphs below.

The only mess that I can see Stephen referring to is the fact that the 
email RFCs define rather messy encoding formats and character set 
specifications.  There isn't much cure for this, AFAICS, other than 
perhaps keeping the bytes in segmented structures, with cached metadata 
to speed repeated references.  Using any other format than email format, 
means knowing how to translate that format to/from email format, and 
to/from API format... this means coding two translation routines instead 
of one.

The choice of email RFC byte formats for the internal form makes it 
quick and easy to produce a complete message when called for, and to 
defer interpretation when a message is fed in.... sometimes, and herein 
lies the catch....

One problem with storing messages in bytes format: it seems to me that 
the choice of which of several legal email bytes formats to represent 
various email parts (texts and attachments) is problematical for using 
email format bytes as the internal storage format.  An unsophisticated 
email library could assume that the transfer encoding is always 7bit, 
and that should be acceptable in all circumstances.  A more 
sophisticated email library would provide support for either 7bit or 
8bit transfer encodings.... but the choice of the bytes formats, and 
MIME type encodings of various message parts to support that difference 
would be significant.  It seems that the present email lib provides only 
a way to create only a 7bit or 8bit message (and apparently not binary 
encoding), meaning that the whole message assembly process has to be 
done after initiating a connection with the SMTP server, to determine 
whether it supports 8bit (or binary) encoding or not.  A more abstract 
internal format could defer that choice to the generate step, keeping 
items as str or binary blobs prior to that step.

IIUC, 7bit requires that text and binary be encoded to remove 
"difficult" byte values from the byte stream, so choosing quopri or 
base64 is appropriate at MIME part definition time to make that choice 
(although an optimal sized choice could be made based on the data), in 
the event that generate requests 7bit.

However, 8bit has no such requirement, it declares that there are no 
difficult characters except NULL, CR and LF.  However, because no 8bit 
encodings are defined, the (inefficient, 7-bit) quopri or base64 may 
still have to be used to avoid lines that are too long, and to encode 
NULL, CR and LF.  8 bit and UTF-8 text containing no NULL characters and 
no long lines would qualify without encoding.

Finally, binary declares that there are no difficult characters at all.  
Therefore, the quopri or base64 choice could be ignored, and the raw 
data passed through.

Choosing a particular Content-Transport-Encoding as the internal storage 
format forces transcoding to the other Content-Transport-Encoding values 
on the fly after connecting to the SMTP server (using an apparently 
non-existent parameter to the generate method); not supporting 
on-the-fly transcoding would force the user to choose a particular 
Content-Ttransport-Encoding up front, requiring the connection to the 
SMTP server even earlier in the process.

I observe that most of my SMTP providers do not support binary 
transport, but it seems that MS Exchange does.

I observe that binary transport is more efficient than 7bit or 8bit.

I observe that even with binary transport, the MIME headers must still 
be in US-ASCII, by definition, so the headers need not be generated 
differently for different transports... only the 
Content-Transfer-Encoding, and the content itself, would be affected by 
deferring that choice to generate time.

Perhaps binary transport, with meta-data indicating whether the user 
prefers quopri or base64 for parts that must be encoded for 7bit or 8bit 
transport, would be an appropriate storage format for the email 
library.  This would allow the quopri or base64 encodings to be 
performed on-the-fly, only if needed, by adding a new parameter to 
generate, that specifies the Content-Transfer-Encoding (which should 
default to 7bit for maximal server compatibility, or 8bit if the user 
specified that along the way so that backwards compatibility is preserved).


N.B.  I note that the documentation for 2.6.3 section 19.1.3 MIMEtext 
function (reproduced below) is confusing:

/class /email.mime.text.MIMEText(/_text/[, /_subtype/[, /_charset/]])? 
<http://docs.python.org/library/email.mime.html#email.mime.text.MIMEText>

    Module: email.mime.text

    A subclass of MIMENonMultipart
    <http://docs.python.org/library/email.mime.html#email.mime.nonmultipart.MIMENonMultipart>,
    the MIMEText
    <http://docs.python.org/library/email.mime.html#email.mime.text.MIMEText>
    class is used to create MIME objects of major type /text/. /_text/
    is the string for the payload. /_subtype/ is the minor type and
    defaults to /plain/. /_charset/ is the character set of the text and
    is passed as a parameter to the MIMENonMultipart
    <http://docs.python.org/library/email.mime.html#email.mime.nonmultipart.MIMENonMultipart>
    constructor; it defaults to us-ascii. No guessing or encoding is
    performed on the text data.

    Changed in version 2.4: The previously deprecated /_encoding/
    argument has been removed. Encoding happens implicitly based on the
    /_charset/ argument.


The confusion is that it states there is no encoding performed, and then 
it states that encoding is implicit.  It is not clear what it actually 
does, if anything.  The 3.2a0 documentation further muddies the water by 
removing the last paragraph.


-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From stephen at xemacs.org  Tue Oct  6 16:18:03 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Tue, 06 Oct 2009 23:18:03 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACB0DC9.7080307@g.nevcal.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
Message-ID: <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>

In the following I use Python 3 terminology: strings are Python
Unicode objects, and bytes are Python bytes objects.

Glenn Linderman writes:

 > Email messages are bytes.  Usually restricted to bytes in the range 
 > 32-127, but sometimes permitted to be 0-255 (8bit encoding).

This is irrelevant to our internal representation.  It is both trivial
and efficient to convert the wire format (bytes) to a string
internally (at least for email messages up to say 5MB).

Which internal representation makes the most sense depends on what we
are going to do with that internal representation.  At this point I'm
not sure that strings are better than bytes, but I'm quite sure that
I've seen no convincing argument that bytes are TOOWTDI.

Nor is it at all obvious to me that should be stored in wire format.

 > Using any other format than email format, means knowing how to
 > translate that format to/from email format, and to/from API
 > format... this means coding two translation routines instead of
 > one.

That sound reasonable, but it's a false economy.  The formats you're
talking about here are the transfer encodings, and we need to be able
to decode all of them, and produce all of them.  Internally, they can
be represented by a single format, so you need internal-to-transfer
and transfer-to-internal for about six of them (7bit, 8bit, binary ==
Python bytes, BASE64, quoted-printable, Python string)

As for runtime economy, if conversion is done once at parse time and
once at generate time it is not a big burden, not as compared to the
overhead of the Python language itself.

 > The choice of email RFC byte formats

By "byte format", do you mean "wire format"?

 > for the internal form makes it quick and easy to produce a complete
 > message when called for,

Only for certain kinds of messages, such as automated forwards and
signed MIME parts, and cron's messages.  For those, there are great
advantages to spewing things verbatim as you got them off the wire or
the disk.  But even there, as long as we use the natural embedding of
bytes in Unicode (ie, interpret bytes as ISO 8859/1) it's easy and not
particularly inefficient to use strings.

For anything else, storing in wire format is going to require checking
format (of the stored data if the format is variable, and always of
the requesting API) on all attribute accesses, and conversions on
many, even most attribute accesses.

 > One problem with storing messages in bytes format: it seems to me that 
 > the choice of which of several legal email bytes formats

None of them are very happy.  The email module needs to be able to
both read and produce all of 7bit, 8bit, and binary, and they are in
fact pretty well trivial to do.

So the question to me is "what are the primary use cases for the email
module, and how do they affect the choice of internal representation?"
I can't claim special expertise on "how", I'll leave that up to
Barry.  Here are some use cases I can think of.

1.  Debugging programs using the email module.  Maybe that's a +1 for
    internally storing textual data in string form.

2.  MUA #1: Composition.  Input will be strings and multimedia file
    names, output will be bytes.  Will attributes of message objects
    be manipulated?  Not in a conventional MUA, but an email-based MUA
    might find uses for that.

3.  MUA #2: Reading.  Input will often be bytes (spool files, IMAP
    data).  Could be strings, though, depending on the internal format
    of folders.  Output will be strings and multimedia objects.  Lots
    of string processing, especially generating folder directory
    displays from message headers.

4.  Mailing list processor.  Message input will be bytes.
    Configuration input, including heading and footer texts that may
    be added are likely to be strings.  Header manipulation (adding
    topics, sequence numbers, RFC 2369 headers) most conveniently done
    with strings.  Output will be bytes.

5.  Mailing list archiver.  Input will be bytes or message objects,
    output will be strings (typically HTML documents or XML
    fragments).

6.  Spam/virus detection.  Input may be bytes or message objects.
    Lots of internal string processing; in most cases the text/* parts
    need to be converted to strings before grepping; in some cases
    even images or executables may be reconstituted to look for
    malware signatures.  Output may be a flag or signal, or the
    message itself may be edited (typically to provide headers
    recording degree of spamminess, trace headers, maybe a body
    heading; in some cases, a new message may be generated with the
    suspected spam as a message/rfc822 MIME body part).


From v+python at g.nevcal.com  Tue Oct  6 21:14:37 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Tue, 06 Oct 2009 12:14:37 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8510262.7231254589795083.JavaMail.root@boaz>	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <4ACB971D.9080706@g.nevcal.com>

On approximately 10/6/2009 7:18 AM, came the following characters from 
the keyboard of Stephen J. Turnbull:
> In the following I use Python 3 terminology: strings are Python
> Unicode objects, and bytes are Python bytes objects.
>
> Glenn Linderman writes:
>
>  > Email messages are bytes.  Usually restricted to bytes in the range 
>  > 32-127, but sometimes permitted to be 0-255 (8bit encoding).
>
> This is irrelevant to our internal representation.  It is both trivial
> and efficient to convert the wire format (bytes) to a string
> internally (at least for email messages up to say 5MB).
>
> Which internal representation makes the most sense depends on what we
> are going to do with that internal representation.  At this point I'm
> not sure that strings are better than bytes, but I'm quite sure that
> I've seen no convincing argument that bytes are TOOWTDI.
>
> Nor is it at all obvious to me that should be stored in wire format.
>   

Yes, I interpreted, possibly misinterpreted, Barry's comment about 
storing things as bytes, as that he was figuring to store them in wire 
format.


>  > Using any other format than email format, means knowing how to
>  > translate that format to/from email format, and to/from API
>  > format... this means coding two translation routines instead of
>  > one.
>
> That sound reasonable, but it's a false economy.  

And this was actually the point I was trying to make.


> The formats you're
> talking about here are the transfer encodings, and we need to be able
> to decode all of them, and produce all of them.  Internally, they can
> be represented by a single format, so you need internal-to-transfer
> and transfer-to-internal for about six of them (7bit, 8bit, binary ==
> Python bytes, BASE64, quoted-printable, Python string)
>   

Not all formats apply to all MIME types, but I think you've enumerated 
the list.

> As for runtime economy, if conversion is done once at parse time and
> once at generate time it is not a big burden, not as compared to the
> overhead of the Python language itself.
>   

I would tend to agree with that, except that if something is 
received/provided in a particular format, it might want to stay in that 
format until such time it is needed in a different format... and then 
the appropriate set of conversions (current format => internal format => 
needed format) applied as needed, avoiding all conversions when it is 
already in the needed format.

>  > The choice of email RFC byte formats
>
> By "byte format", do you mean "wire format"?
>   

Sure, RFC byte formats == wire format.

>  > for the internal form makes it quick and easy to produce a complete
>  > message when called for,
>
> Only for certain kinds of messages, such as automated forwards and
> signed MIME parts, and cron's messages.  For those, there are great
> advantages to spewing things verbatim as you got them off the wire or
> the disk.  But even there, as long as we use the natural embedding of
> bytes in Unicode (ie, interpret bytes as ISO 8859/1) it's easy and not
> particularly inefficient to use strings.
>   

two conversions are slower than none, and use 2-4 times the space in 
string format.

> For anything else, storing in wire format is going to require checking
> format (of the stored data if the format is variable, and always of
> the requesting API) on all attribute accesses, and conversions on
> many, even most attribute accesses.
>   

One has to write the conversion code anyway; it is just a matter of 
where it is called.  Once converted, meta data could be retained in its 
natural format.

>  > One problem with storing messages in bytes format: it seems to me that 
>  > the choice of which of several legal email bytes formats
>
> None of them are very happy.  The email module needs to be able to
> both read and produce all of 7bit, 8bit, and binary, and they are in
> fact pretty well trivial to do.
>
> So the question to me is "what are the primary use cases for the email
> module, and how do they affect the choice of internal representation?"
> I can't claim special expertise on "how", I'll leave that up to
> Barry.  Here are some use cases I can think of.
>   

Yes this is a good question.


> 1.  Debugging programs using the email module.  Maybe that's a +1 for
>     internally storing textual data in string form.
>
> 2.  MUA #1: Composition.  Input will be strings and multimedia file
>     names, output will be bytes.  Will attributes of message objects
>     be manipulated?  Not in a conventional MUA, but an email-based MUA
>     might find uses for that.
>   

I'm not sure what an email-based MUA is.... seems to me even a 
conventional MUA is "email-based"???


> 3.  MUA #2: Reading.  Input will often be bytes (spool files, IMAP
>     data).  Could be strings, though, depending on the internal format
>     of folders.  Output will be strings and multimedia objects.  Lots
>     of string processing, especially generating folder directory
>     displays from message headers.
>
> 4.  Mailing list processor.  Message input will be bytes.
>     Configuration input, including heading and footer texts that may
>     be added are likely to be strings.  Header manipulation (adding
>     topics, sequence numbers, RFC 2369 headers) most conveniently done
>     with strings.  Output will be bytes.
>   

But the bulk of the message parts, received in wire format, may not need 
to be altered to be sent along in the same wire format.  Headers must be 
manipulated somehow, I'd think it would be convenient as strings too.  
Heading and footing texts are configured boilerplate, and could be 
cached in a variety of formats to avoid the need to convert them for 
each message, and could then be obtained from the cache in the 
appropriate format for this particular message, and prepended or 
appended as appropriate.

> 5.  Mailing list archiver.  Input will be bytes or message objects,
>     output will be strings (typically HTML documents or XML
>     fragments).
>   

An archiver could archive wire format, and do the conversions to *ML on 
the fly for those messages that might be accessed that way.  Depends on 
the expectation of the usage of the archiver... to retrieve the archived 
messages via email, wire format could be extremely efficient; to 
retrieve via HTTP, one should note that there is very little difference 
between .eml format (another name for wire format) and .mthml format 
(which is a format IE and Opera will display natively, support in other 
browsers varies, mostly via addons and conversion utilities).  So I'm 
not at all sure that this use case requires string output, although some 
implementations might prefer it.

> 6.  Spam/virus detection.  Input may be bytes or message objects.
>     Lots of internal string processing; in most cases the text/* parts
>     need to be converted to strings before grepping; in some cases
>     even images or executables may be reconstituted to look for
>     malware signatures.  Output may be a flag or signal, or the
>     message itself may be edited (typically to provide headers
>     recording degree of spamminess, trace headers, maybe a body
>     heading; in some cases, a new message may be generated with the
>     suspected spam as a message/rfc822 MIME body part).
>   


So it seems to me that storing the data in the format provided, and 
converting it to native format when requested and caching that result, 
and then when generating wire format, if the needed format was not 
provided or cached, then converting as necessary, would be optimal to 
minimize conversion (time) costs.  This technique would also maximally 
preserve the original format for use cases 3 and 5, which, for use case 
3, at least, seems to be important to this list from past discussion.  
To minimize memory (space) costs, the caching could be avoided (causing 
reconversion costs), or, at the expense of not preserving the original 
format, once converted, retain only the native format of the item (which 
is generally the smallest, for binary objects, and which is most easily 
manipulated, but not necessarily smallest, for text objects).

So I'd design the internal format with meta data like

MIMEpart
    formatFlag
    metaData
    7bitData
    8bitData
    binaryData
    nativeText
    nativeBLOB

where the metaData would consist of a variety of pertinent items, 
obtained by decoding provided wireData or supplied along with provided 
nativeData.

Generate could use 7bitData, 8bitData, or binaryData directly if it 
exists, or cache it there if it didn't already exist.

binaryData would differ from nativeBLOB only by containing the 
appropriate MIMEheaders... perhaps as a space optimization, it would 
contain only the appropriate MIMEheaders, with the binaryData being 
placed in nativeBLOB directly (since this is not a costly conversion, 
just a choice of where to store the bytes).

It could also be possible that a complete, provided, wire format message 
would be retained as a single BLOB, and the appropriate format data 
items simply be offsets and lengths within that BLOB, although with 
cached metaData.

Of course, there is already a design within the existing code, and the 
cost of wholesale redesign may be more than can be afforded.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From stephen at xemacs.org  Wed Oct  7 02:30:25 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Wed, 07 Oct 2009 09:30:25 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACB971D.9080706@g.nevcal.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
Message-ID: <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>

Glenn Linderman writes:

 > Yes, I interpreted, possibly misinterpreted, Barry's comment about 
 > storing things as bytes, as that he was figuring to store them in wire 
 > format.

What that means is unclear, though.  Does a "header in wire format"
mean before or after MIME encoding?  Probably after, but that's pretty
useless for the purpose of editing the header.  Does it include the
tag (the part before the colon) or not?  Etc.

 > I would tend to agree with that, except that if something is 
 > received/provided in a particular format, it might want to stay in that 
 > format until such time it is needed in a different format... and then 
 > the appropriate set of conversions (current format => internal format => 
 > needed format) applied as needed, avoiding all conversions when it is 
 > already in the needed format.

If you mean that the email module will keep track of what form the
object is currently represented by, that will eventually result in
"UnicodeError: octet out of range: 161, ascii".

 > two conversions are slower than none, and use 2-4 times the space in 
 > string format.

Let's get this correct, *then* optimize, please.

 > One has to write the conversion code anyway; it is just a matter of 
 > where it is called.  Once converted, meta data could be retained in its 
 > natural format.

Meta data for what?  Why would you convert meta data?

 > > 2.  MUA #1: Composition.  Input will be strings and multimedia file
 > >     names, output will be bytes.  Will attributes of message objects
 > >     be manipulated?  Not in a conventional MUA, but an email-based MUA
 > >     might find uses for that.
 > 
 > I'm not sure what an email-based MUA is.... seems to me even a 
 > conventional MUA is "email-based"???

Only if it's written using the Python email module.

 > > 4.  Mailing list processor.  Message input will be bytes.
 > >     Configuration input, including heading and footer texts that may
 > >     be added are likely to be strings.  Header manipulation (adding
 > >     topics, sequence numbers, RFC 2369 headers) most conveniently done
 > >     with strings.  Output will be bytes.
 > >   
 > 
 > But the bulk of the message parts, received in wire format, may not need 
 > to be altered to be sent along in the same wire format.

That depends.  For example, multimedia parts may simply be discarded,
in which case it makes sense to not convert them.  However, most
Mailman lists do add a footer, and because of crappy Windows MUAs that
don't implement MIME correctly, it's preferred to add that by
concatenating as text.  That simply cannot be done correctly in wire
format for any character set except ISO 8859/1.

 > Heading and footing texts are configured boilerplate, and could be 
 > cached in a variety of formats to avoid the need to convert them for 
 > each message,

Premature optimization is the root of all error.

 > An archiver could archive wire format,

Are you suggesting that the email module should mandate that?  We have
a severe tail-dog inversion problem here.


From janssen at parc.com  Wed Oct  7 03:34:32 2009
From: janssen at parc.com (Bill Janssen)
Date: Tue, 6 Oct 2009 18:34:32 PDT
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <10506972.7161254576370614.JavaMail.root@boaz>
References: <10506972.7161254576370614.JavaMail.root@boaz>
Message-ID: <7054.1254879272@parc.com>

Timothy Farrell <tfarrell at owassobible.org> wrote:

> Back in June, David Murray posted the message below about fixing the
> email module.  I have an interest in helping with this due to a
> personal project I'm working on.  However, my ability to help is
> severely limited by my understanding of email and MIME RFCs.

Tim, familiarity with email and MIME RFCs would be a big help if you
want to help with the email module.  Even for writing test cases.

Bill

From v+python at g.nevcal.com  Wed Oct  7 04:52:39 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Tue, 06 Oct 2009 19:52:39 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8510262.7231254589795083.JavaMail.root@boaz>	<4ACB0DC9.7080307@g.nevcal.com>	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <4ACC0277.2060807@g.nevcal.com>

On approximately 10/6/2009 5:30 PM, came the following characters from 
the keyboard of Stephen J. Turnbull:
> Glenn Linderman writes:
>
>  > Yes, I interpreted, possibly misinterpreted, Barry's comment about 
>  > storing things as bytes, as that he was figuring to store them in wire 
>  > format.
>
> What that means is unclear, though.  Does a "header in wire format"
> mean before or after MIME encoding?  Probably after, but that's pretty
> useless for the purpose of editing the header.  Does it include the
> tag (the part before the colon) or not?  Etc.
>
>  > I would tend to agree with that, except that if something is 
>  > received/provided in a particular format, it might want to stay in that 
>  > format until such time it is needed in a different format... and then 
>  > the appropriate set of conversions (current format => internal format => 
>  > needed format) applied as needed, avoiding all conversions when it is 
>  > already in the needed format.
>
> If you mean that the email module will keep track of what form the
> object is currently represented by, that will eventually result in
> "UnicodeError: octet out of range: 161, ascii".
>   

The above sentence does not communicate your meaning to me... or any 
meaning, actually.  Can you explain?
If conversions are avoided, then octets are unlikely to be out of 
range?  And the email module must be aware of the form of the data in 
order to manipulate it in any format other than wire format, but 
fortunately, wire format declares the format of the data (not to say 
there is not buggy wire format data -- but that is an issue best avoided 
by avoiding as many conversions as possible).

>  > two conversions are slower than none, and use 2-4 times the space in 
>  > string format.
>
> Let's get this correct, *then* optimize, please.
>   

That's a nice platitude... I could have used it on you when you said
> As for runtime economy, if conversion is done once at parse time and
> once at generate time it is not a big burden, not as compared to the
> overhead of the Python language itself.
but I didn't.  You can't design things totally ignoring the reality of 
time and space performance, and expect to get an efficient result.  I 
agree one can spend too much time on premature optimization issues, and 
I have that tendency, but if you totally ignore time and space issues, 
you wind up with Vista.


>  > One has to write the conversion code anyway; it is just a matter of 
>  > where it is called.  Once converted, meta data could be retained in its 
>  > natural format.
>
> Meta data for what?  Why would you convert meta data?
>   

Meta data for the email message... how many MIME parts, their 
Content-Types, etc.  This is small amounts of data, but reasonably 
likely to referenced multiple times during the message parsing or 
creation and generation process.  So once it is converted from wire 
format, it should be kept in a useful format, as well as wire format.


>  > > 2.  MUA #1: Composition.  Input will be strings and multimedia file
>  > >     names, output will be bytes.  Will attributes of message objects
>  > >     be manipulated?  Not in a conventional MUA, but an email-based MUA
>  > >     might find uses for that.
>  > 
>  > I'm not sure what an email-based MUA is.... seems to me even a 
>  > conventional MUA is "email-based"???
>
> Only if it's written using the Python email module.
>   

Um.  Aren't we talking about use cases for the Python email module?  I 
was trying to interpret what you were saying in that light.  Sure, what 
a conventional (not written using the Python email module) MUA does, is 
mostly irrelevant, except so far as it shows use cases that might be 
applied to email-based (written using the Python email module) MUAs.


>  > > 4.  Mailing list processor.  Message input will be bytes.
>  > >     Configuration input, including heading and footer texts that may
>  > >     be added are likely to be strings.  Header manipulation (adding
>  > >     topics, sequence numbers, RFC 2369 headers) most conveniently done
>  > >     with strings.  Output will be bytes.
>  > >   
>  > 
>  > But the bulk of the message parts, received in wire format, may not need 
>  > to be altered to be sent along in the same wire format.
>
> That depends.  For example, multimedia parts may simply be discarded,
> in which case it makes sense to not convert them.  However, most
> Mailman lists do add a footer, and because of crappy Windows MUAs that
> don't implement MIME correctly, it's preferred to add that by
> concatenating as text.  That simply cannot be done correctly in wire
> format for any character set except ISO 8859/1.
>   

Huh?

First off, which "crappy Windows MUAs" don't implement MIME correctly, 
and what do they do wrong?  When I look at wire format emails, I'm 
mostly appalled by the stuff generated by Apple Mail.  I have seen a few 
doozies from Outlook 2000, but they seem to be fixed in newer versions.

Adding a header or trailer does require knowledge of the character set 
and encoding of the message.  Given that, you can decode to str, add the 
header or trailer and encode back to MIME.  So that's the inefficient 
proof of concept.

In the identity or quopri encodings, it is possible to add similarly 
encoded headers and trailers correctly to text/plain parts through 
normal concatenation.  Adding headers to base64 encoding requires that 
the encoded header be an exact number of base64 lines, or at least a 
multiple of 3 characters and that you shuffle the line layout through 
the whole base64 body... it is not clear that this is worth the work.  
Adding trailers to base64 encoding requires decoding the final partial 
encoding, noticing how much room is left on that last line, and the 
encoding from there on... so it is not possible to cache an encoded 
base64 footer, although it would be possible to cache 3 of them, and 
only have to tweak the merge and choose the right one of the three and 
then reshuffle.  So since text/plain is seldom encoded in base64, and 
base64 is so complex to concatenate to in wire format, I'd think it 
would be a better choice to decode and reencode to concatenate headers 
or footers to base64 encoded MIME parts.... unless immense base64 
encoded MIME parts are expected to be common enough to develop the 
optimized logic.

text/html is trickier, whether encoded or not.  You have to parse past 
any stuff that precedes <body>, and place the header after that, and 
then you have to find the </body> and place the trailer before that.  
And unless you run the HTML through a validity checker, you can't be 
sure that the trailer will even show up, much less actually at the 
bottom, due to the possibility of unclosed tags within the body.  To 
parse even quopri encoded HTML gets tricky, and basically impossible for 
base64 encoded HTML.  So the first text/html part likely will need to be 
decoded for adding headers and trailers, if it is an alternative to the 
text/plain part, or there is no text/plain part.

I've seen some systems add an additional MIME part to place a trailer 
in, and that can be pretty effective for MUAs that will show multiple 
parts in-line, but there are so many MUAs out there, that it is 
extremely difficult to make any certain declarations regarding what the 
user sees as a result.

And, ISO 8859/1 is an 8-bit character set, so would require encoding on 
a 7bit transfer.  But it is not unique; if you know how to do ISO 8859/1 
concatenation in wire format, then you can do the whole class of 
ASCII+128 more character sets in the same manner.  Not to mention that 
ASCII itself works fine in wire format.  And so does UTF-8.  It is just 
a matter of matching the character set and the encoding.


>  > Heading and footing texts are configured boilerplate, and could be 
>  > cached in a variety of formats to avoid the need to convert them for 
>  > each message,
>
> Premature optimization is the root of all error.
>   

Yeah, yeah.  I said "could", not "must".  I was pushing back from your 
declaration that:

>     Configuration input, including heading and footer texts that may
>     be added are likely to be strings.  

Such configuration texts are likely to provided as strings, but there is 
nothing to prevent them from being converted to other formats.  
Premature optimization may or may not be the root of all error, but 
discarding perfectly valid design possibilities based on how the input 
might be supplied seems a similar error.  I'm not declaring which design 
is best, just that there are alternatives.

>  > An archiver could archive wire format,
>
> Are you suggesting that the email module should mandate that?  We have
> a severe tail-dog inversion problem here.

Absolutely not.  I said "could", not "must".  The archiver can do what 
it wants.  The email library should provide access to the message data 
in all useful formats, so that the archiver can do what it wants.  The 
archiver needs to choose its design and optimizations appropriate for 
its expected use cases.  I was pushing back from your declaration that 
an archiver would always want string output.... you said:

> 5.  Mailing list archiver.  Input will be bytes or message objects,
>     output will be strings (typically HTML documents or XML
>     fragments).

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From stephen at xemacs.org  Wed Oct  7 12:33:42 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Wed, 07 Oct 2009 19:33:42 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACC0277.2060807@g.nevcal.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
Message-ID: <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>

Glenn Linderman writes:

 > > If you mean that the email module will keep track of what form the
 > > object is currently represented by, that will eventually result in
 > > "UnicodeError: octet out of range: 161, ascii".
 > 
 > The above sentence does not communicate your meaning to me... or any 
 > meaning, actually.  Can you explain?

Yes, that Unicode error is one that took years for Mailman to work
around.  If we are going to be converting different objects at
different times, I'm sure we'll get to see it agin in the future.  Oh,
joy.

 > If conversions are avoided, then octets are unlikely to be out of 
 > range?

Haven't looked in your spam bucket recently, I guess.  Spammers
regularly put 8 bit characters into headers (and into bodies in
messages without a Content-Type header), for one thing.

 > And the email module must be aware of the form of the data in 
 > order to manipulate it in any format other than wire format, but 
 > fortunately, wire format declares the format of the data (not to say 
 > there is not buggy wire format data -- but that is an issue best avoided 
 > by avoiding as many conversions as possible).

"Best" I can't speak to; you obviously are willing to accept a much
higher error rate than I am.  "Robust" handling of buggy wire format
data means that the email module must do something sane with it before
giving it to the application.  Maybe it's reasonable to do that
lazily, and/or cache the result, but access to bogus data (that the
email module can determine is bogus or suspicious) must not be allowed
unless the client says "hit me with your best shot" explicitly.  Most
clients are simply not going to be prepared for the kind of crap I see
in /var/mail/turnbull every day.

 > I was pushing back from your declaration that an archiver would
 > always want string output

Please don't push back; we won't get anywhere.  Use cases are
*examples*, not complete specifications of all possible inputs and
outputs.  Use cases should be simple and clear cut.  If you want a
different use case, state it.  In fact in the real world, *all* of the
archivers I know of produce text formats on disk, either deleting
multimedia objects or saving them off and linking to them via URLs in
the text.  If you know of a different kind of archiver, add it as a
use case.

From phd at phd.pp.ru  Wed Oct  7 13:09:58 2009
From: phd at phd.pp.ru (Oleg Broytman)
Date: Wed, 7 Oct 2009 15:09:58 +0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <20091007110958.GG24702@phd.pp.ru>

On Wed, Oct 07, 2009 at 07:33:42PM +0900, Stephen J. Turnbull wrote:
> Haven't looked in your spam bucket recently, I guess.  Spammers
> regularly put 8 bit characters into headers

   Legitimate but stupid programs do this as well. Think of phpbb-like
forums written by programmers who never understand how non-ascii can be put
into Subject field or filenames - they send amazingly crippled emails.

Oleg.
-- 
     Oleg Broytman            http://phd.pp.ru/            phd at phd.pp.ru
           Programmers don't die, they just GOSUB without RETURN.

From stephen at xemacs.org  Wed Oct  7 14:19:27 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Wed, 07 Oct 2009 21:19:27 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <20091007110958.GG24702@phd.pp.ru>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20091007110958.GG24702@phd.pp.ru>
Message-ID: <874oqbs8fk.fsf@uwakimon.sk.tsukuba.ac.jp>

Oleg Broytman writes:
 > On Wed, Oct 07, 2009 at 07:33:42PM +0900, Stephen J. Turnbull wrote:
 > > Haven't looked in your spam bucket recently, I guess.  Spammers
 > > regularly put 8 bit characters into headers
 > 
 >    Legitimate but stupid programs do this as well.

Sure, but Glenn may not be subscribed to any of those.  *Everybody* is
subscribed to spam, though.

I'll-let-you-decide-what-kind-of-smiley-that-needs-ly y'rs,

From matt at mondoinfo.com  Wed Oct  7 18:23:24 2009
From: matt at mondoinfo.com (Matthew Dixon Cowles)
Date: Wed, 7 Oct 2009 11:23:24 -0500 (CDT)
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <1254929486.96.16481@mint-julep.mondoinfo.com>

[Stephen J. Turnbull]
> Yes, that Unicode error is one that took years for Mailman to work
> around.  If we are going to be converting different objects at
> different times, I'm sure we'll get to see it again in the future.

In my opinion, the email module should never raise an exception as a
result of working with a malformed message. Though it should
certainly make the information that a message was malformed available
for the calling program to check.

That is, I think that it's extremely unlikely that the calling
program wants to blow up as a result of a malformed message. Very
probably, it wants to make what sense of the message that it can. The
number of ways in which a message can be malformed is pretty large
and just how (and when, as has been mentioned) any particular error
will cause problems for the module is really a matter that's internal
to the module. The module's user shouldn't have to say, "Over here I
have to trap UnicodeErrors and over there I have to trap IndexErrors".

Regards,
Matt


From phd at phd.pp.ru  Wed Oct  7 19:07:18 2009
From: phd at phd.pp.ru (Oleg Broytman)
Date: Wed, 7 Oct 2009 21:07:18 +0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <1254929486.96.16481@mint-julep.mondoinfo.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
Message-ID: <20091007170718.GA1901@phd.pp.ru>

On Wed, Oct 07, 2009 at 11:23:24AM -0500, Matthew Dixon Cowles wrote:
> In my opinion, the email module should never raise an exception as a
> result of working with a malformed message. Though it should
> certainly make the information that a message was malformed available
> for the calling program to check.

   I disagree. email package is not a user agent, and exceptions are *the*
way to indicate there are problems.

> That is, I think that it's extremely unlikely that the calling
> program wants to blow up as a result of a malformed message.

   Then the calling program must catch all exceptions and process they in a
reasonable (for this particular application) way. But certainly email
package must not dictate what ways are reasonable - they are too
application-specific.

> Very
> probably, it wants to make what sense of the message that it can.

   Yes, if email parse a message in some way - ok. You can help by creating
more intelligent parser(s). But if a parser stumbles upon an unparseable
block - it must raises an exception.

Oleg.
-- 
     Oleg Broytman            http://phd.pp.ru/            phd at phd.pp.ru
           Programmers don't die, they just GOSUB without RETURN.

From anthonybaxter at gmail.com  Wed Oct  7 16:38:38 2009
From: anthonybaxter at gmail.com (Anthony Baxter)
Date: Thu, 8 Oct 2009 01:38:38 +1100
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <874oqbs8fk.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20091007110958.GG24702@phd.pp.ru>
	<874oqbs8fk.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <e69d3ed20910070738n51937d05h58576364d6e1937f@mail.gmail.com>

On Wed, Oct 7, 2009 at 11:19 PM, Stephen J. Turnbull <stephen at xemacs.org>wrote:

> Oleg Broytman writes:
>  > On Wed, Oct 07, 2009 at 07:33:42PM +0900, Stephen J. Turnbull wrote:
>  > > Haven't looked in your spam bucket recently, I guess.  Spammers
>  > > regularly put 8 bit characters into headers
>  >
>  >    Legitimate but stupid programs do this as well.
>
> Sure, but Glenn may not be subscribed to any of those.  *Everybody* is
> subscribed to spam, though.
>
> I'll-let-you-decide-what-kind-of-smiley-that-needs-ly y'rs,


You'd be amazed how many MUAs shipped by major companies are broken. MS
Entourage, anyone?

noone-mention-the-nested-multiparts-with-the-same-boundary-tag-on-both-levels-ly
y'rs.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091008/2ddee189/attachment-0001.htm>

From v+python at g.nevcal.com  Wed Oct  7 19:34:05 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Wed, 07 Oct 2009 10:34:05 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8510262.7231254589795083.JavaMail.root@boaz>	<4ACB0DC9.7080307@g.nevcal.com>	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACB971D.9080706@g.nevcal.com>	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <4ACCD10D.4070308@g.nevcal.com>

On approximately 10/7/2009 3:33 AM, came the following characters from 
the keyboard of Stephen J. Turnbull:
> Glenn Linderman writes:
>
>  > > If you mean that the email module will keep track of what form the
>  > > object is currently represented by, that will eventually result in
>  > > "UnicodeError: octet out of range: 161, ascii".
>  > 
>  > The above sentence does not communicate your meaning to me... or any 
>  > meaning, actually.  Can you explain?
>
> Yes, that Unicode error is one that took years for Mailman to work
> around.  If we are going to be converting different objects at
> different times, I'm sure we'll get to see it agin in the future.  Oh,
> joy.
>   

Ah, a historical remark!  So that's why it was lost on me, I'm new to 
the Python world (but programming since 1975...)


>  > If conversions are avoided, then octets are unlikely to be out of 
>  > range?
>
> Haven't looked in your spam bucket recently, I guess.  Spammers
> regularly put 8 bit characters into headers (and into bodies in
> messages without a Content-Type header), for one thing.
>   

I'm aware of that, but if conversions are not done, octets are unlikely 
to be _reported_ to be out of range....


>  > And the email module must be aware of the form of the data in 
>  > order to manipulate it in any format other than wire format, but 
>  > fortunately, wire format declares the format of the data (not to say 
>  > there is not buggy wire format data -- but that is an issue best avoided 
>  > by avoiding as many conversions as possible).
>
> "Best" I can't speak to; you obviously are willing to accept a much
> higher error rate than I am.  "Robust" handling of buggy wire format
> data means that the email module must do something sane with it before
> giving it to the application.  Maybe it's reasonable to do that
> lazily, and/or cache the result, but access to bogus data (that the
> email module can determine is bogus or suspicious) must not be allowed
> unless the client says "hit me with your best shot" explicitly.  Most
> clients are simply not going to be prepared for the kind of crap I see
> in /var/mail/turnbull every day.
>   

Are you referring to most email clients, or most 
Python-email-library-using clients?  It seems like most email clients 
are being hit with the same stuff you are seeing... every day... and are 
handling it somehow... although anti-spam filters do eliminate some of 
it before the end user's MUA sees it, depending on the ISP, etc.

Is it your point of view, then, that incorrectly formed email should be 
mostly treated as SPAM?  Your paragraph above could be interpreted that 
way.  Oleg's point is also valid though, so it seems that isn't your 
point of view.

Your "hit me with your best shot" comment indicates that you want a 
failure code or exception when the data is bad, and then a way to "retry 
accepting errors"?


>  > I was pushing back from your declaration that an archiver would
>  > always want string output
>
> Please don't push back; we won't get anywhere.  Use cases are
> *examples*, not complete specifications of all possible inputs and
> outputs.  Use cases should be simple and clear cut.  If you want a
> different use case, state it.  In fact in the real world, *all* of the
> archivers I know of produce text formats on disk, either deleting
> multimedia objects or saving them off and linking to them via URLs in
> the text.  If you know of a different kind of archiver, add it as a
> use case.
>   

I misunderstood the purpose of your list.  Sure, everything in your list 
is a good example of real world uses.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From matt at mondoinfo.com  Wed Oct  7 22:05:35 2009
From: matt at mondoinfo.com (Matthew Dixon Cowles)
Date: Wed, 7 Oct 2009 15:05:35 -0500 (CDT)
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <20091007170718.GA1901@phd.pp.ru>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
Message-ID: <1254944602.12.16665@mint-julep.mondoinfo.com>

[me]
> In my opinion, the email module should never raise an exception as a
> result of working with a malformed message.

[Oleg Broytman]
> I disagree. email package is not a user agent, and exceptions are
> *the* way to indicate there are problems.

We may have to agree to disagree. If the email package gives up
because a message is malformed, I don't know what exactly it's for.
It's certainly not for parsing what arrives in my mailbox.

> Then the calling program must catch all exceptions and process they
> in a reasonable (for this particular application) way.

Then the module's documentation would need to include a list of all
exceptions that it might raise and the times that it might raise
them. Otherwise the application developer is proceeding in the dark.

Regards,
Matt


From phd at phd.pp.ru  Wed Oct  7 22:28:13 2009
From: phd at phd.pp.ru (Oleg Broytman)
Date: Thu, 8 Oct 2009 00:28:13 +0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <1254944602.12.16665@mint-julep.mondoinfo.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
	<1254944602.12.16665@mint-julep.mondoinfo.com>
Message-ID: <20091007202813.GB6832@phd.pp.ru>

On Wed, Oct 07, 2009 at 03:05:35PM -0500, Matthew Dixon Cowles wrote:
> If the email package gives up
> because a message is malformed, I don't know what exactly it's for.
> It's certainly not for parsing what arrives in my mailbox.

   Then it is *your* task to enhance the code. A flow of patches with tests
would be the best contribution.

> > Then the calling program must catch all exceptions and process they
> > in a reasonable (for this particular application) way.
> 
> Then the module's documentation would need to include a list of all
> exceptions that it might raise and the times that it might raise
> them.

   You are also welcome to provide patches for documentation.

Oleg.
-- 
     Oleg Broytman            http://phd.pp.ru/            phd at phd.pp.ru
           Programmers don't die, they just GOSUB without RETURN.

From barry at python.org  Thu Oct  8 03:10:24 2009
From: barry at python.org (Barry Warsaw)
Date: Wed, 7 Oct 2009 21:10:24 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <87zl88h4cj.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <10506972.7161254576370614.JavaMail.root@boaz>
	<8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org>
	<87zl88h4cj.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <1685F0AB-8B57-445A-BE03-3782E07DB8FD@python.org>

On Oct 3, 2009, at 11:41 AM, Stephen J. Turnbull wrote:

> Barry Warsaw writes:
>
>> So the basic model is: accept strings or bytes at the edges,
>> process everything internally as bytes, output strings and bytes at
>> the edges.
>
> In a certain pedantic sense, that can't be right, because bytes alone
> can't represent strings.
>
> Practically, you are going need to say how a bytes or bytearray is to
> be interpreted as a string, and that is going to be one big mess.
> (MIME?)
>
> Going the other way around you have no such problem, or rather the
> trivial embedding works fine, except that you have to do a range check
> at some point before you convert to bytes.

So, I've taken at least two abortive attempts at updating the email  
package to Python 3, once using bytes internally and another time  
using strings internally.  Neither one was completely satisfying (to  
say the least).  I've also heard convincing arguments from folks in  
the Python community in both camps: "using anything other than strings  
internally is insane; no, using anything other than bytes internally  
is insane."

As for the internal representational format, I'll amend my previous  
statement and say that I'll keep an open mind, but one thing that  
seems very clear is that we have to be able to accept strings and  
bytes at the incoming edges, and produce strings and bytes at the  
outgoing edges.  In a future message, Stephen outlines some excellent  
use cases, to which I'll follow up when I get there.  But I think he  
generally hits the nail on the head and proves that we'll have both  
types at the edges.  That makes for very interesting API design!

There's "internal" and then there's the low-level representation that  
the model exposes.  Here I have more confidence that we need make  
things much more consistent.  The trick is to do that while still  
making things convenient.

For example, we currently represent header values as 8-bit strings or  
Header instances. The latter can contain triples of the individual  
chunks, e.g. (content, language, charset).  I think we need represent  
header values as instances in all cases because the type checking is  
error prone, but even then, it makes for difficult API choices.   
Still, if the fundamental atom of header values in the model is the  
Header, and we define both byte and string APIs for headers, then the  
internal representation matters less since only the email package  
implementers need to care.

But note that even in this limited case, neither bytes nor strings  
really works.  The internal representation is that triple (and in the  
current model an implicit triple where charset=us-ascii).  So  
internally the charset is carried along for the ride, as it must be.   
If the internal representation were just strings or bytes, we wouldn't  
know how to generate the other format, at least not idempotently (or  
as close as we can get).

Just to ramble a little longer, it's been argued that we should give  
up on idempotency, but I'm not convinced.  I think people want to see  
an email message they throw into the system come out the other end as  
closely as possible (well, /exactly/ for well-formed messages).

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091007/7fe35fba/attachment.pgp>

From barry at python.org  Thu Oct  8 03:17:47 2009
From: barry at python.org (Barry Warsaw)
Date: Wed, 7 Oct 2009 21:17:47 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <8510262.7231254589795083.JavaMail.root@boaz>
References: <8510262.7231254589795083.JavaMail.root@boaz>
Message-ID: <9B5501D3-C9CB-46EA-843D-B07BE7E9E288@python.org>

On Oct 3, 2009, at 1:09 PM, Timothy Farrell wrote:

> Forgive my ignorance...why does converting bytes to strings have to  
> be a mess?  Rather than having two Feedparsers, can't we just pass a  
> default encoding when instantiating a feedparser and have it read  
> from the MIME headers otherwise?  If not encoding is passed and one  
> can't be determined, simply output as bytes or try a default and  
> raise an exception if it fails.

A lot of work went into the parser the last (successful) time around  
to avoid exceptions as much as possible.  That's why Message objects  
have a .defects attribute.

I'm more okay with the APIs that are used to hand-craft or modify  
existing message to throw exceptions when something bad happens, e.g.  
an unknown charset is used.  But the parser itself should never throw  
an exception.  The use case here is:

Our MTA has dropped a message on disk and it could be deliberately  
malformed spam.  We don't know that until we parse it though, so we  
must be able to construct a reasonable message tree from the raw bytes  
we read off disk.  The defects the parser encounters are in fact  
useful information that goes into a determination of ham/spam.

The key thing here is that clients of the email package are severely  
handicapped at handling any parsing errors.  Mailman for example can't  
do much except log the error and throw the message into a 'bad'  
bucket.  Whoop-de-doo!  Nobody can do anything about it! If we can at  
least give the system a Message object with defects, the system can  
reason about it and help the human decide what to do.

The generator is probably in a similar situation.  If you hand it a  
Message object, it must generate something.  In the case of a message  
with defects, we can compromises though, such as giving up on  
idempotency, fixing MIME boundaries, substituting legal/known  
charsets, etc.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091007/e6ee369e/attachment-0001.pgp>

From barry at python.org  Thu Oct  8 03:25:16 2009
From: barry at python.org (Barry Warsaw)
Date: Wed, 7 Oct 2009 21:25:16 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACB0DC9.7080307@g.nevcal.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
Message-ID: <1D2E61F6-B6DD-4D23-9CE3-25AAA4D713EE@python.org>

On Oct 6, 2009, at 5:28 AM, Glenn Linderman wrote:

> I observe that binary transport is more efficient than 7bit or 8bit.

A few principles that I think we should adopt as far as efficiency and  
performance go.

I am not concerned about performance.  Yes, we want to make things as  
fast as possible, but it's more important to be as right as possible.   
Look at some of the tricks that the parser has to jump through to  
properly handle MIME nesting.  Yuck, and not fast, but mostly right  
(it could be improved but I think we're darn close).

Memory footprint efficiency is very important, in some cases.  I don't  
particularly care about headers or some of the more compact MIME body  
formats (perhaps like text/*), but some are very problematic.  For  
example, the Twisted guys have told me that can't use the email  
package because let's say you read a 10MB image/jpg MIME part.  You  
really can't store thousands of these in memory at a time!  So again  
that dictates that our APIs have to support external storage hook  
points, for parsing, generating, accessing MIME parts on disk or in a  
database, etc.  It's fine if by default we store everything in memory,  
but we have to at least give applications the ability to parse  
straight from the wire, store some parts on disk, and still return  
Message objects that are completely consistent.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091007/2edc30fb/attachment.pgp>

From barry at python.org  Thu Oct  8 04:05:08 2009
From: barry at python.org (Barry Warsaw)
Date: Wed, 7 Oct 2009 22:05:08 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <8E7BBBBB-E9D7-43E9-B87E-B83209BFA298@python.org>

On Oct 6, 2009, at 10:18 AM, Stephen J. Turnbull wrote:

> In the following I use Python 3 terminology: strings are Python
> Unicode objects, and bytes are Python bytes objects.

Exactly.  8-bit strings are dead to us.

>> for the internal form makes it quick and easy to produce a complete
>> message when called for,
>
> Only for certain kinds of messages, such as automated forwards and
> signed MIME parts, and cron's messages.  For those, there are great
> advantages to spewing things verbatim as you got them off the wire or
> the disk.  But even there, as long as we use the natural embedding of
> bytes in Unicode (ie, interpret bytes as ISO 8859/1) it's easy and not
> particularly inefficient to use strings.
>
> For anything else, storing in wire format is going to require checking
> format (of the stored data if the format is variable, and always of
> the requesting API) on all attribute accesses, and conversions on
> many, even most attribute accesses.

I think that's going to be the case either way.  Some applications are  
going to want bytes, others strings, so there needs to be APIs for both.

> So the question to me is "what are the primary use cases for the email
> module, and how do they affect the choice of internal representation?"
> I can't claim special expertise on "how", I'll leave that up to
> Barry.  Here are some use cases I can think of.
>
> 1.  Debugging programs using the email module.  Maybe that's a +1 for
>    internally storing textual data in string form.
>
> 2.  MUA #1: Composition.  Input will be strings and multimedia file
>    names, output will be bytes.  Will attributes of message objects
>    be manipulated?  Not in a conventional MUA, but an email-based MUA
>    might find uses for that.
>
> 3.  MUA #2: Reading.  Input will often be bytes (spool files, IMAP
>    data).  Could be strings, though, depending on the internal format
>    of folders.  Output will be strings and multimedia objects.  Lots
>    of string processing, especially generating folder directory
>    displays from message headers.
>
> 4.  Mailing list processor.  Message input will be bytes.
>    Configuration input, including heading and footer texts that may
>    be added are likely to be strings.  Header manipulation (adding
>    topics, sequence numbers, RFC 2369 headers) most conveniently done
>    with strings.  Output will be bytes.
>
> 5.  Mailing list archiver.  Input will be bytes or message objects,
>    output will be strings (typically HTML documents or XML
>    fragments).
>
> 6.  Spam/virus detection.  Input may be bytes or message objects.
>    Lots of internal string processing; in most cases the text/* parts
>    need to be converted to strings before grepping; in some cases
>    even images or executables may be reconstituted to look for
>    malware signatures.  Output may be a flag or signal, or the
>    message itself may be edited (typically to provide headers
>    recording degree of spamminess, trace headers, maybe a body
>    heading; in some cases, a new message may be generated with the
>    suspected spam as a message/rfc822 MIME body part).

I think this is a very good list.  The key thing from an application's  
point of view is that sometimes messages are parsed and sometimes they  
are crafted.  When parsed, the raw input can come from a completely  
unknown and untrusted source such as the puking mouth of an MTA.   
Other times it comes from a big blob of string in a doctest.  When  
crafted, it's almost always a program building up a message tree from  
scratch, or possibly the manipulation of an existing message (e.g.  
MIME filter).

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091007/026ce6c0/attachment.pgp>

From barry at python.org  Thu Oct  8 04:14:43 2009
From: barry at python.org (Barry Warsaw)
Date: Wed, 7 Oct 2009 22:14:43 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <07EF292E-B5D0-48BD-8F69-8CBD2B6A4486@python.org>

On Oct 6, 2009, at 8:30 PM, Stephen J. Turnbull wrote:

> What that means is unclear, though.  Does a "header in wire format"
> mean before or after MIME encoding?  Probably after, but that's pretty
> useless for the purpose of editing the header.  Does it include the
> tag (the part before the colon) or not?  Etc.

This is a great question.  As far as headers go, sometimes you want to  
reason about the entire header (field name + value) and sometimes you  
just care about one or the other.  Putting the field name in the  
Header instance means it's difficult to copy the header to other  
fields.  Not having the field name in the instance means that some  
calculations (such as line length) are tricker.

> That depends.  For example, multimedia parts may simply be discarded,
> in which case it makes sense to not convert them.  However, most
> Mailman lists do add a footer, and because of crappy Windows MUAs that
> don't implement MIME correctly, it's preferred to add that by
> concatenating as text.  That simply cannot be done correctly in wire
> format for any character set except ISO 8859/1.

Even then, doesn't it depend on the character set of the text you're  
appending too?  Aren't there for example some Japanese character sets  
that are incompatible with iso-8859-1?  Mailman punts and says if the  
character sets aren't identical, it cannot concatenate.

>> Heading and footing texts are configured boilerplate, and could be
>> cached in a variety of formats to avoid the need to convert them for
>> each message,
>
> Premature optimization is the root of all error.

I could not agree more.  Plus, according to Moore's law, computers  
will all be 256 times faster when we finish the email package redesign  
than when we started it <wink>.

>> An archiver could archive wire format,
>
> Are you suggesting that the email module should mandate that?  We have
> a severe tail-dog inversion problem here.

Right.  Remember that the email package is fundamental to all of this,  
so it must provide the services that client applications need.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091007/bb3a90ba/attachment.pgp>

From barry at python.org  Thu Oct  8 04:25:51 2009
From: barry at python.org (Barry Warsaw)
Date: Wed, 7 Oct 2009 22:25:51 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <7054.1254879272@parc.com>
References: <10506972.7161254576370614.JavaMail.root@boaz>
	<7054.1254879272@parc.com>
Message-ID: <E73B1D0E-015A-4A65-9628-09975AC76106@python.org>

On Oct 6, 2009, at 9:34 PM, Bill Janssen wrote:

> Timothy Farrell <tfarrell at owassobible.org> wrote:
>
>> Back in June, David Murray posted the message below about fixing the
>> email module.  I have an interest in helping with this due to a
>> personal project I'm working on.  However, my ability to help is
>> severely limited by my understanding of email and MIME RFCs.
>
> Tim, familiarity with email and MIME RFCs would be a big help if you
> want to help with the email module.  Even for writing test cases.

Just be forewarned that you'll end up like James T. Kirk staring up at  
the neural neutralizer on the Tantalus Penal Colony.  You'll either be  
a mindless shell or in agonizing pain.  Or both.

going-bold-ly y'rs,
-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091007/e075ee7b/attachment-0001.pgp>

From barry at python.org  Thu Oct  8 04:33:24 2009
From: barry at python.org (Barry Warsaw)
Date: Wed, 7 Oct 2009 22:33:24 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACC0277.2060807@g.nevcal.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>	<4ACB0DC9.7080307@g.nevcal.com>	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
Message-ID: <8D91BDEF-F0CD-4FA2-8B55-5CF2E1291A8C@python.org>

On Oct 6, 2009, at 10:52 PM, Glenn Linderman wrote:

> text/html is trickier,

If by "trickier" you mean "impossible" then I'll agree. :)  Or maybe  
"insane" is more accurate.  Mailman will never try to parse text/html  
to concatenate a footer.  In fact, if it isn't text/plain and a  
matching character set, it punts to MIME attachment.  However...

> I've seen some systems add an additional MIME part to place a  
> trailer in, and that can be pretty effective for MUAs that will show  
> multiple parts in-line, but there are so many MUAs out there, that  
> it is extremely difficult to make any certain declarations regarding  
> what the user sees as a result.

It's actually easy to predict: they'll see crap that makes them  
unhappy.  The cases where I'm wrong about that are so rare as to  
probably not matter <wink>.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091007/f1b0aa38/attachment.pgp>

From janssen at parc.com  Thu Oct  8 04:37:08 2009
From: janssen at parc.com (Bill Janssen)
Date: Wed, 7 Oct 2009 19:37:08 PDT
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <8E7BBBBB-E9D7-43E9-B87E-B83209BFA298@python.org>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<8E7BBBBB-E9D7-43E9-B87E-B83209BFA298@python.org>
Message-ID: <13025.1254969428@parc.com>

Barry Warsaw <barry at python.org> wrote:

> > 5.  Mailing list archiver.  Input will be bytes or message objects,
> >    output will be strings (typically HTML documents or XML
> >    fragments).

I use the email package to implement an email archiver, and I do bytes
in and bytes out.  I do threading (using header instances), and process
attachments separately, which requires that they come out of the message
in their native format, whatever that is -- I treat it as bytes.

I also maintain a Python IMAP server which uses the email package to
construct messages, and then deconstructs them to send out in response
to IMAP requests.

Bill

From barry at python.org  Thu Oct  8 04:40:56 2009
From: barry at python.org (Barry Warsaw)
Date: Wed, 7 Oct 2009 22:40:56 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org>

On Oct 7, 2009, at 6:33 AM, Stephen J. Turnbull wrote:

> Haven't looked in your spam bucket recently, I guess.  Spammers
> regularly put 8 bit characters into headers (and into bodies in
> messages without a Content-Type header), for one thing.

Interesting story: Launchpad (which is open source now so there are no  
secrets) uses XMLRPC when Mailman holds a message for moderation,  
storing it in Launchpad's database for display to the list (team)  
owner.  Well, I was lazy, stupid, or both and didn't wrap the objects  
in a Binary over the wire, so we were getting tons of failures here.   
But none of them seemed to have any practical effect on user  
experience (read: we got zero bug reports for missing held messages).

I finally found the time to debug the problem, because the failures in  
themselves were cryptic and common enough to cause our operations  
people headaches.  So I cowboyed in some additional capture code and  
ran it for 24 hours.  Guess what I found?

We were essentially crapping out on /tons/ of messages with 8-bit in  
headers, and these messages were basically getting dropped on the  
floor.  Why no bug reports?  Because /every/ single captured message  
was spam.  How's that for a bug having unintended positive consequences?

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091007/daa50034/attachment.pgp>

From barry at python.org  Thu Oct  8 04:45:35 2009
From: barry at python.org (Barry Warsaw)
Date: Wed, 7 Oct 2009 22:45:35 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <1254929486.96.16481@mint-julep.mondoinfo.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
Message-ID: <3E400B08-1834-4A1B-8970-3AD20BD23765@python.org>

On Oct 7, 2009, at 12:23 PM, Matthew Dixon Cowles wrote:

> In my opinion, the email module should never raise an exception as a
> result of working with a malformed message. Though it should
> certainly make the information that a message was malformed available
> for the calling program to check.
>
> That is, I think that it's extremely unlikely that the calling
> program wants to blow up as a result of a malformed message. Very
> probably, it wants to make what sense of the message that it can. The
> number of ways in which a message can be malformed is pretty large
> and just how (and when, as has been mentioned) any particular error
> will cause problems for the module is really a matter that's internal
> to the module. The module's user shouldn't have to say, "Over here I
> have to trap UnicodeErrors and over there I have to trap IndexErrors".

I've said it before: I complete agree with you, at least for parsing.   
The big problem in my experience with Mailman is that you're sort of  
too upside down in the application to do anything about parsing errors  
when they occur except log it and shunt it.  And that's just not very  
helpful.

However, when crafting messages from scratch, I think it /would/ be  
okay to raise exceptions when something is done wrong, because the  
application has more control over the data and is in a position to  
either handle the problem or for the bug to be fixed <wink>.  In this  
case, complaining early is much better than say failing in the  
generator.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091007/c80b3e16/attachment.pgp>

From barry at python.org  Thu Oct  8 04:51:03 2009
From: barry at python.org (Barry Warsaw)
Date: Wed, 7 Oct 2009 22:51:03 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <20091007170718.GA1901@phd.pp.ru>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
Message-ID: <5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org>

On Oct 7, 2009, at 1:07 PM, Oleg Broytman wrote:

> On Wed, Oct 07, 2009 at 11:23:24AM -0500, Matthew Dixon Cowles wrote:
>> In my opinion, the email module should never raise an exception as a
>> result of working with a malformed message. Though it should
>> certainly make the information that a message was malformed available
>> for the calling program to check.
>
>   I disagree. email package is not a user agent, and exceptions are  
> *the*
> way to indicate there are problems.

By keeping the various components clear in our mind, we can see that  
both statements are correct in a sense.  The parser and generator  
should never raise exceptions.  The model can and probably should.

>   Yes, if email parse a message in some way - ok. You can help by  
> creating
> more intelligent parser(s). But if a parser stumbles upon an  
> unparseable
> block - it must raises an exception.

No.  It really can't.   Let's say your MTA dropped a bunch of bytes in  
a file and in some low-level background process you read those bytes  
and turn them into Message trees.  Now your parser throws an  
exception: what can you possibly do about it except throw away this  
unparseable jumble of bytes and log the exception?

Much much better is soldier on and produce a Message object that has  
the right format, but additional information, such as a set of defects  
it encountered.  This is what the current email package does and it  
has made Mailman's life infinitely better (when it all DTRT).  If you  
have a Message with defects, you can reason about it, show partial  
information, attempt a repair, etc.  With an exception, you're hosed.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091007/d3d3cecb/attachment-0001.pgp>

From barry at python.org  Thu Oct  8 04:52:42 2009
From: barry at python.org (Barry Warsaw)
Date: Wed, 7 Oct 2009 22:52:42 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <e69d3ed20910070738n51937d05h58576364d6e1937f@mail.gmail.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20091007110958.GG24702@phd.pp.ru>
	<874oqbs8fk.fsf@uwakimon.sk.tsukuba.ac.jp>
	<e69d3ed20910070738n51937d05h58576364d6e1937f@mail.gmail.com>
Message-ID: <2E08B903-E388-42AF-9386-26FA3B4E4270@python.org>

On Oct 7, 2009, at 10:38 AM, Anthony Baxter wrote:

> noone-mention-the-nested-multiparts-with-the-same-boundary-tag-on- 
> both-levels-ly

stfu.  you are evil.

-Barry

(For the humor impaired, i.e. not Anthony -> :)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091007/8f5a5657/attachment.pgp>

From mark at msapiro.net  Thu Oct  8 05:31:41 2009
From: mark at msapiro.net (Mark Sapiro)
Date: Wed, 7 Oct 2009 20:31:41 -0700
Subject: [Email-SIG] header info in body of message. is this normal? EOM
In-Reply-To: <CC28F43ED4708D489ABCF68D06D7F5560300DDFE73@505DENALI.corp.vnw.com>
Message-ID: <PC192200910072031410421eac69fce@msapiro>


----- Original Message ---------------

Subject: [Email-SIG] header info in body of message.  is this normal? 
EOM
   From: Michael Lesauis <MichaelL at vulcan.com>
   Date: Tue, 18 Aug 2009 08:45:20 -0700
     To: "'email-sig at python.org'" <email-sig at python.org>


The first empty (not just whitespace, but empty) line in the message
terminates the headers.

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan


From mark at msapiro.net  Thu Oct  8 05:40:13 2009
From: mark at msapiro.net (Mark Sapiro)
Date: Wed, 7 Oct 2009 20:40:13 -0700
Subject: [Email-SIG] email.header.decode_header eats my spaces
In-Reply-To: <SRVR-DNS1xalsglr6b80001b810@SRVR-DNS1.metropcs.net>
Message-ID: <PC19220091007204013073440a144ab@msapiro>


----- Original Message ---------------

Subject: [Email-SIG]  email.header.decode_header eats my spaces
   From: 7073049749 at mymetropcs.com
   Date: 6 Sep 09 02:18:14 -0500
     To: email-sig at python.org


If you're talking about spaces between encoded words as in the space
between the ?= and the =? in

Subject: =?iso-8859-1?q?Hello?= =?iso-8859-1?q?World?=

it's supposed to. RFC 2047, section 6.2 says in part

   When displaying a particular header field that contains multiple
   'encoded-word's, any 'linear-white-space' that separates a pair of
   adjacent 'encoded-word's is ignored.  (This is to allow the use of
   multiple 'encoded-word's to represent long strings of unencoded text,
   without having to separate 'encoded-word's where spaces occur in the
   unencoded text.)


-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan


From v+python at g.nevcal.com  Thu Oct  8 08:54:50 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Wed, 07 Oct 2009 23:54:50 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <13025.1254969428@parc.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>	<4ACB0DC9.7080307@g.nevcal.com>	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>	<8E7BBBBB-E9D7-43E9-B87E-B83209BFA298@python.org>
	<13025.1254969428@parc.com>
Message-ID: <4ACD8CBA.5090604@g.nevcal.com>

On approximately 10/7/2009 7:37 PM, came the following characters from 
the keyboard of Bill Janssen:
> Barry Warsaw <barry at python.org> wrote:
>
>   
>>> 5.  Mailing list archiver.  Input will be bytes or message objects,
>>>    output will be strings (typically HTML documents or XML
>>>    fragments).
>>>       
>
> I use the email package to implement an email archiver, and I do bytes
> in and bytes out.  I do threading (using header instances), and process
> attachments separately, which requires that they come out of the message
> in their native format, whatever that is -- I treat it as bytes.
>
> I also maintain a Python IMAP server which uses the email package to
> construct messages, and then deconstructs them to send out in response
> to IMAP requests.
>
> Bill
>   

OK, so there's another nice item for the use case list.  Thanks Bill for 
responding, I figured there had to be something like that out there.  
That's why I was pushing back on Stephen's cases as making too 
restrictive of assumptions... but now that I understand his purpose, it 
is appropriate just to add the additional case.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From rdmurray at bitdance.com  Thu Oct  8 09:16:47 2009
From: rdmurray at bitdance.com (R. David Murray)
Date: Thu, 8 Oct 2009 03:16:47 -0400 (EDT)
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
	<5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org>
Message-ID: <Pine.LNX.4.64.0910080250450.18193@kimball.webabinitio.net>

I'd like to try to summarize what I understand Barry to be saying (which,
in this case, also reflects my understanding of what is needed), and
see if I'm anywhere close to on target :)  In the following discussion,
'text' refers to unicode data, and bytes refers to, well, bytes.  (I
chose to use 'text' instead of 'string' to avoid confusion).

The email package consists of two major conceptual pieces: the API, and
the internal data model.  The API needs to have facilities for accepting
data in either text format or bytes format, and this data is used to
generate a model of the input message (a Message).  Likewise the API needs
to provide facilities for serializing a Message as either bytes or text.
The API also provides ways to build up a Message from pieces, or to
extract information from a Message in pieces, and to modify a Message,
and again input and output as both text and bytes must be supported.

The data model used by the email package is an "implementation detail",
and we should not spend effort at this stage trying to optimize it for
anything except memory requirements with respect to potentially large
sub-objects, and even there it is more a matter of providing ways to
deal with potentially large sub-objects than it is a true optimization.
In general correctness and robustness is much more important than speed.

The data model will need to be a practical hybrid of the input data,
possibly transformed in some way in some cases, and various sorts of
meta-data.  The current email package already works this way.

An important characteristic of the model is that it be idempotent whenever
sensible; that is, if a given byte stream is used to create a Message
or subobject, serializing that Message or subobject as bytes should
return the original byte stream whenever sensible (ie: when the data
is not pathologically malformed).  Likewise if a text stream is used to
create a Message or subobject, serializing it as text should produce,
whenever sensible, the original text stream.  In particular, well-formed
(per RFC) message data should always be stored and produced
idempotently.

An important property of the API is that both the parser that transforms
an input stream into a Message and Message serialization should not raise
exceptions except in the face of errors that leave no way to produce a
valid Message or serialization.  Instead a defects list is maintained
and exposed through the API.  In the face of some defects it may not be
sensible to maintain idempotency.

The APIs that manipulate the data model either for piecewise construction
or for transformations may raise exceptions, and in most cases _should_
raise exceptions when encountering invalid data or operations.


Also, as an additional note to those thinking about use cases, I'd
like to point out something I know well and which Barry reminded me
about recently:  parts of the email package (eg: MIME and RFC822-style
header parsing) are used or can be used by systems other than systems
handling email.  The particular cases I have run into myself are working
with non-email data files that follow RFC822 rules, and handling data
from NNTP (which, granted, is almost email...but only almost).  In
the former case you usually have text input and output, mediated
by the encoding of the file(s) on disk.  In the latter case you have
all the problems of email plus a few more.

Further, in the standard library the http package, urllib, the cgi
module, and pydoc are all clients of the email package.

--David (RDM)

From v+python at g.nevcal.com  Thu Oct  8 09:29:41 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Thu, 08 Oct 2009 00:29:41 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org>
Message-ID: <4ACD94E5.5020808@g.nevcal.com>

On approximately 10/7/2009 7:40 PM, came the following characters from 
the keyboard of Barry Warsaw:
> On Oct 7, 2009, at 6:33 AM, Stephen J. Turnbull wrote:
>> Haven't looked in your spam bucket recently, I guess.  Spammers
>> regularly put 8 bit characters into headers (and into bodies in
>> messages without a Content-Type header), for one thing.
> Interesting story: Launchpad (which is open source now so there are no 
> secrets) uses XMLRPC when Mailman holds a message for moderation, 
> storing it in Launchpad's database for display to the list (team) 
> owner.  Well, I was lazy, stupid, or both and didn't wrap the objects 
> in a Binary over the wire, so we were getting tons of failures here.  
> But none of them seemed to have any practical effect on user 
> experience (read: we got zero bug reports for missing held messages).
>
> I finally found the time to debug the problem, because the failures in 
> themselves were cryptic and common enough to cause our operations 
> people headaches.  So I cowboyed in some additional capture code and 
> ran it for 24 hours.  Guess what I found?
>
> We were essentially crapping out on /tons/ of messages with 8-bit in 
> headers, and these messages were basically getting dropped on the 
> floor.  Why no bug reports?  Because /every/ single captured message 
> was spam.  How's that for a bug having unintended positive consequences? 

Great anecdote!  Spammers shooting themselves in the foot with their 
ignorance.  But still, much too much spam gets through.

Seems to me that when there is an error in an encoded base64 MIME part, 
such that it can't be base64 decoded, the options for the library are:
return an error, the data is likely meaningless
allow the bytes to be retrieved, undecoded
I suppose it might be possible to skip only those 4-character sequences 
that don't decode properly, and try to decode the rest of the data, if 
it is text.    But some way to flag that data were undecodable would be 
needed.
And if it is text, then it must then undergo charset decoding (below).

The application options are to drop the attachment, or pass through the 
corrupted bytes, and let the next application try to make sense of it.

A quopri MIME part that can't be correctly decoded may still be mostly 
readable... so here it makes sense to return an error but also the data, 
decoded as best as possible.  Applications choices are basically the 
same.  Once quopri decoded, then text parts must also face charset 
decoding (below).

Charset decoding: a charset should be specified, or is assumed to be 
ASCII by default.  If a text MIME part that isn't in the right character 
set gets decode errors, there are several possibilities:
return an error, and the decoded data, with error substitutions
allow the bytes to be retrieved
decode as Latin-1 (no errors possible, but probably results in mojibake)

The application options are to drop the attachment, or choose to pass 
through one of the three data values.

For headers, the choices are basically the same as for text MIME parts, 
but some headers that contain meta data (rather just text like Subject:) 
may be critical to proper decoding of other data, and so errors in some 
headers can cause incorrect behaviour of other headers or of an 
associated MIME part.

And I agree that APIs to retrieve any MIME part as undecoded bytes is 
appropriate; and to retrieve it as decoded strings is appropriate for 
text MIME parts.  Not sure that non-text MIME parts need to support 
being returned as strings.

Headers could possibly be a quadruple instead of a triple, with the 4th 
item being the wire format if received? (If constructed, no wire format 
would be expected until it is generated.)  That would help with 
idempotency, as if a header contains non-ASCII characters, there are 
many choices of heuristic to encode that are all proper, so it is 
unlikely two different algorithms would preserve idempotency.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From phd at phd.pp.ru  Thu Oct  8 11:18:40 2009
From: phd at phd.pp.ru (Oleg Broytman)
Date: Thu, 8 Oct 2009 13:18:40 +0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
	<5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org>
Message-ID: <20091008091840.GB28906@phd.pp.ru>

On Wed, Oct 07, 2009 at 10:51:03PM -0400, Barry Warsaw wrote:
> On Oct 7, 2009, at 1:07 PM, Oleg Broytman wrote:
>
>> On Wed, Oct 07, 2009 at 11:23:24AM -0500, Matthew Dixon Cowles wrote:
>>> In my opinion, the email module should never raise an exception as a
>>> result of working with a malformed message. Though it should
>>> certainly make the information that a message was malformed available
>>> for the calling program to check.
>>
>>   I disagree. email package is not a user agent, and exceptions are  
>> *the*
>> way to indicate there are problems.
>
> By keeping the various components clear in our mind, we can see that  
> both statements are correct in a sense.  The parser and generator should 
> never raise exceptions.  The model can and probably should.

   Are you going to parse any garbage and create a Message (probably an
empty Message) with one defect "cannot parse it at all"?

>> But if a parser stumbles upon an  
>> unparseable
>> block - it must raises an exception.
>
> No.  It really can't.   Let's say your MTA dropped a bunch of bytes in a 
> file and in some low-level background process you read those bytes and 
> turn them into Message trees.  Now your parser throws an exception: what 
> can you possibly do about it except throw away this unparseable jumble of 
> bytes and log the exception?

   I don't disagree with that. If a parser can parse an input in some way -
let's consider the input a malformed message and create a Message with
defects.
   What I disagree with is that if a parser cannot parse input garbage at
all it must raise an exception. And if a parser can raise an exception any
calling program must be prepared to catch such exceptions.

Oleg.
-- 
     Oleg Broytman            http://phd.pp.ru/            phd at phd.pp.ru
           Programmers don't die, they just GOSUB without RETURN.

From phd at phd.pp.ru  Thu Oct  8 11:22:32 2009
From: phd at phd.pp.ru (Oleg Broytman)
Date: Thu, 8 Oct 2009 13:22:32 +0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <20091008091840.GB28906@phd.pp.ru>
References: <4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
	<5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org>
	<20091008091840.GB28906@phd.pp.ru>
Message-ID: <20091008092232.GC28906@phd.pp.ru>

On Thu, Oct 08, 2009 at 01:18:40PM +0400, Oleg Broytman wrote:
>    What I disagree with is that if a parser cannot parse input garbage at
> all it must raise an exception.

   Sorry for my bad wording.

   What I disagree with is that if a parser cannot parse input garbage at
all it must NOT raise an exception. My opinion is - it must raise an
exception.

Oleg.
-- 
     Oleg Broytman            http://phd.pp.ru/            phd at phd.pp.ru
           Programmers don't die, they just GOSUB without RETURN.

From stephen at xemacs.org  Thu Oct  8 12:46:50 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Thu, 08 Oct 2009 19:46:50 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <1685F0AB-8B57-445A-BE03-3782E07DB8FD@python.org>
References: <10506972.7161254576370614.JavaMail.root@boaz>
	<8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org>
	<87zl88h4cj.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1685F0AB-8B57-445A-BE03-3782E07DB8FD@python.org>
Message-ID: <87ocoiqi1x.fsf@uwakimon.sk.tsukuba.ac.jp>

Barry Warsaw writes:

 > I've also heard convincing arguments from folks in the Python
 > community in both camps: "using anything other than strings
 > internally is insane; no, using anything other than bytes
 > internally is insane."

They're both right, of course.  The problem is figuring out who is
right when. ;-)

 > For example, we currently represent header values as 8-bit strings or  
 > Header instances. The latter can contain triples of the individual  
 > chunks, e.g. (content, language, charset).  I think we need represent  
 > header values as instances in all cases because the type checking is  
 > error prone, but even then, it makes for difficult API choices.   

Agreed on both the need and the difficulty.

 > Just to ramble a little longer, it's been argued that we should give  
 > up on idempotency, but I'm not convinced.

If we can't achieve ... ah, isn't "invertibility" what you mean here?
... "idempotency", then we're dropping information somewhere along the
line.  Also, there are part types (pgp-signed, I'm looking at you)
where it's absolutely essential that we be able to roundtrip the body
byte for byte.  So I'm -1 on giving up.


From stephen at xemacs.org  Thu Oct  8 13:25:41 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Thu, 08 Oct 2009 20:25:41 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <20091007170718.GA1901@phd.pp.ru>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
Message-ID: <87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp>

Oleg Broytman writes:
 > On Wed, Oct 07, 2009 at 11:23:24AM -0500, Matthew Dixon Cowles wrote:
 > > In my opinion, the email module should never raise an exception as a
 > > result of working with a malformed message. Though it should
 > > certainly make the information that a message was malformed available
 > > for the calling program to check.
 > 
 >    I disagree. email package is not a user agent, and exceptions are *the*
 > way to indicate there are problems.

Although practicality beats purity.

The email package has access to the wire format, and knows what to do
with most of it.  It should DTRT where that is possible, and punt
where not.  By "punt" I mean return a special object containing as
much of the meta data for an object as it could recover, along with
the data itself as a blob.

I would suggest that module utilities that require access to the
parsed form of data be designed as object methods.  The special
objects produced when broken wire format is encountered wouldn't have
those methods, and thus they'd fail the duck type test.  But that
makes sense: that "duck" can't quack anyway.

So this gives our (== Matt and me) desideratum that email never raises
(it's the Python runtime that will raise AttributeError), and also
Oleg's (in part, anyway): an exception *will* be raised.

I think (== hope) that this will sufficiently localize the issues that
even though only AttributeError would even be raised, it will be
obvious what went wrong.

 >    Then the calling program must catch all exceptions

That is just unreasonable.  There are too many ways for things to go
wrong.  If you have just one exception for all problems, it's easy to
catch them all, but then the client doesn't know what went wrong, and
has to partially parse the unparsable itself.  That's nuts; the reason
for using the email module is to delegate that in the first place, and
besides, to the extent it's possible, the module has presumably done
that.

OTOH, a long list of precise exceptions is both a maintenance burden
on the email module and on client programmers.

 >    Yes, if email parse a message in some way - ok. You can help by creating
 > more intelligent parser(s). But if a parser stumbles upon an unparseable
 > block - it must raises an exception.

No, that's the last thing you want it to do.  Suppose you have

Content-Type: multipart/alternative

    Content-Type: text/plain

    Content-Type: text/html; body-parseable=no

Clearly you want (a) a vanilla email client to just grab the
text/plain part, and (b) a client written by somebody whose boss uses
BustedMUA[tm] to be able to try to parse the text/html part, using the
special rules that apply to the jumble produced by BustedMUA.

In other cases, you might be able to find a valid part terminator, but
the header of that part was hosed.  So the whole part becomes a blob,
but the parser should resync at that point, and start parsing
following parts.

I can think of no input for which the parser should *ever* throw an
exception.  Utilities that depend on a particular object's parsed form
might have do so, but even then it should be avoided if at all
possible.


From stephen at xemacs.org  Thu Oct  8 13:40:47 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Thu, 08 Oct 2009 20:40:47 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACCD10D.4070308@g.nevcal.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACCD10D.4070308@g.nevcal.com>
Message-ID: <87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp>

Glenn Linderman writes:

 > >  > If conversions are avoided, then octets are unlikely to be out of 
 > >  > range?
 > >
 > > Haven't looked in your spam bucket recently, I guess.  Spammers
 > > regularly put 8 bit characters into headers (and into bodies in
 > > messages without a Content-Type header), for one thing.
 > 
 > I'm aware of that, but if conversions are not done, octets are unlikely 
 > to be _reported_ to be out of range....

Conversions will eventually be done.  "Best it were done quickly."

 > > Most clients are simply not going to be prepared for the kind of
 > > crap I see in /var/mail/turnbull every day.
 > 
 > Are you referring to most email clients, or most 
 > Python-email-library-using clients?

Sorry.  When I mean "MUA" I try to say "MUA".  By "client", I'm
referring to the higher level logic that is going to be calling the
email module.

 > Is it your point of view, then, that incorrectly formed email should be 
 > mostly treated as SPAM?

Heavens no!  Not by the email module, anyway!  The email module should
not know about spam (but see Barry's "we're having spam for Launchpad"
post: if you're that good, anything goes!), except maybe at a very
high level.

 > Your "hit me with your best shot" comment indicates that you want a
 > failure code or exception when the data is bad, and then a way to
 > "retry accepting errors"?

My curent thinking is that the email module should return an object
representing a partial parse.  The way that you find out if it is
partial is to try to access some data that "should" be in the object.
If the parse succeeded, the accessor returns the data (which might be
empty).  If the parse did not succeed, you get an AttributeError.
(This is just a paraphrase of what I wrote in response to Oleg.)

From barry at python.org  Thu Oct  8 14:28:04 2009
From: barry at python.org (Barry Warsaw)
Date: Thu, 8 Oct 2009 08:28:04 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <Pine.LNX.4.64.0910080250450.18193@kimball.webabinitio.net>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
	<5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org>
	<Pine.LNX.4.64.0910080250450.18193@kimball.webabinitio.net>
Message-ID: <15946EEF-1991-4F43-90E2-3D30715A15B7@python.org>

On Oct 8, 2009, at 3:16 AM, R. David Murray wrote:

> I'd like to try to summarize what I understand Barry to be saying  
> (which,
> in this case, also reflects my understanding of what is needed), and
> see if I'm anywhere close to on target :)

Spot on, IMO!  I can only quibble about one thing, though I think it's  
just in the phrasing of what you wrote (or the way I read it), not in  
your understanding.

> An important property of the API is that both the parser that  
> transforms
> an input stream into a Message and Message serialization should not  
> raise
> exceptions except in the face of errors that leave no way to produce a
> valid Message or serialization.

I'd say it differently, since we all know you can encounter errors  
leaving invalid Messages.  The parser and generator should only raise  
exceptions when its basic assumptions (embodied as assertions  
probably) of the internal model are broken.  In almost all cases, I  
think those would be "bugs" :).

It may be in fact that the best you can do is produce a Message object  
with no headers and a big massive body containing everything else, and  
a huge defects list.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091008/2add8f13/attachment.pgp>

From phd at phd.pp.ru  Thu Oct  8 14:31:33 2009
From: phd at phd.pp.ru (Oleg Broytman)
Date: Thu, 8 Oct 2009 16:31:33 +0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
	<87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <20091008123133.GA3059@phd.pp.ru>

On Thu, Oct 08, 2009 at 08:25:41PM +0900, Stephen J. Turnbull wrote:
> Oleg Broytman writes:
>  >    I disagree. email package is not a user agent, and exceptions are *the*
>  > way to indicate there are problems.
> 
> Although practicality beats purity.
> 
> The email package has access to the wire format, and knows what to do
> with most of it.  It should DTRT where that is possible, and punt
> where not.  By "punt" I mean return a special object containing as
> much of the meta data for an object as it could recover, along with
> the data itself as a blob.

   The special object is an instance of an exception class ;)

> I would suggest that module utilities that require access to the
> parsed form of data be designed as object methods.  The special
> objects produced when broken wire format is encountered wouldn't have
> those methods, and thus they'd fail the duck type test.  But that
> makes sense: that "duck" can't quack anyway.
> 
> So this gives our (== Matt and me) desideratum that email never raises
> (it's the Python runtime that will raise AttributeError), and also
> Oleg's (in part, anyway): an exception *will* be raised.
> 
> I think (== hope) that this will sufficiently localize the issues that
> even though only AttributeError would even be raised, it will be
> obvious what went wrong.

   Not exactly. One can see an AttributeError, but what was the cause? why
a parser has created a broken object? AttributeError doesn't preserve
information from parser.

> I can think of no input for which the parser should *ever* throw an
> exception.

   Are you saying that even a random garbage would be parsed to a Message
of some kind? No headers, a single unparsed body?..

Oleg.
-- 
     Oleg Broytman            http://phd.pp.ru/            phd at phd.pp.ru
           Programmers don't die, they just GOSUB without RETURN.

From barry at python.org  Thu Oct  8 15:00:31 2009
From: barry at python.org (Barry Warsaw)
Date: Thu, 8 Oct 2009 09:00:31 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACD94E5.5020808@g.nevcal.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org>
	<4ACD94E5.5020808@g.nevcal.com>
Message-ID: <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org>

On Oct 8, 2009, at 3:29 AM, Glenn Linderman wrote:

> Great anecdote!  Spammers shooting themselves in the foot with their  
> ignorance.

Indeed.  It constantly surprises me that spam would be so malformed,  
but I guess it could make perverse sense if say, you were trying to  
DoS a spam filter.

> Seems to me that when there is an error in an encoded base64 MIME  
> part, such that it can't be base64 decoded, the options for the  
> library are:
> return an error, the data is likely meaningless
> allow the bytes to be retrieved, undecoded
> I suppose it might be possible to skip only those 4-character  
> sequences that don't decode properly, and try to decode the rest of  
> the data, if it is text.    But some way to flag that data were  
> undecodable would be needed.
> And if it is text, then it must then undergo charset decoding (below).

Note that while I'm adamant that the parser and generator not raise  
exceptions, what the model does is a different matter.  Ideally,  
accessing data from the model would never raise an exception either,  
but mutating the model could.  This is just basic Postel's Law.

> The application options are to drop the attachment, or pass through  
> the corrupted bytes, and let the next application try to make sense  
> of it.

Exactly, and it's not for the email package to say which is right.

Here's a use case: I've got a Message that was parsed from wire input  
and I want to mangle the Subject heading to add the list prefix.  I  
know exactly what charset the prefix is in because that's data I  
control.  When I ask for the original Subject value, I'm handed an  
instance that I can use to try to figure out how add the prefix.

First thing I'll ask it is "are you a single chunk in my prefix  
charset (or compatible)?"  If so, I can probably just prepend my  
prefix onto the value.  If not, "are you composed of multiple valid  
chunks in different charsets?"  If so, I know that I need to encode my  
prefix, but I can still prepend it to the header value (hopefully  
using the same API, and I don't care that the implementation could not  
use string concatenation).

If not, then what?  Maybe I don't care if some of the chunk charsets  
aren't known because I can still use the right encode+prepend  
strategy.  But if the header is a gobbledegook of 8-bit bytes?  I'm  
pretty sure I want to be able to ask the API if that's the case rather  
than get an exception.  The thing I'm not so sure about is what  
happens if my application is just naive enough to just ask for the  
header as a unicode and that conversion can't be made.  I /think/ it  
should raise an exception in that case.  But then when I ask for the  
header value as a mass of bytes, that should succeed and return me the  
raw input.

> And I agree that APIs to retrieve any MIME part as undecoded bytes  
> is appropriate; and to retrieve it as decoded strings is appropriate  
> for text MIME parts.  Not sure that non-text MIME parts need to  
> support being returned as strings.

I hate to open another can of worms, but I've been thinking about this  
a lot too :).  It's been discussed on list before, so nothing new  
here.  I think the parser and MIME classes need to be hookable for  
decoding their contents.  For example, if you have a text/* it might  
well make sense to support bytes() and str()/unicode() on the part  
instance.  But if it's image/* str() makes no sense.  part.decode() or  
something similar makes sense, but this needs to be extensible because  
the email package will not know how to convert every content-type.  At  
best it will only know how to decode content-types that Python's  
stdlib knows about.

The problem is that if the bytes came off the wire, the parser  
currently can only attach the most basic MIME base class.  It doesn't  
know that an image/png should create a MIMEImagePNG instance there.   
This is different from hacking the model directly because the  
application can instantiate the right class.  So the parser either has  
to have a hookable way for an application to go from content-type to  
class, or the generic MIME base class needs to be hookable in  
its .decode() method.

> Headers could possibly be a quadruple instead of a triple, with the  
> 4th item being the wire format if received? (If constructed, no wire  
> format would be expected until it is generated.)  That would help  
> with idempotency, as if a header contains non-ASCII characters,  
> there are many choices of heuristic to encode that are all proper,  
> so it is unlikely two different algorithms would preserve idempotency.

I think not a quad.  I think other APIs should be used to extract the  
raw data, e.g.

 >>> # return a unicode or throw an exception
 >>> text = str(header)
 >>> # should always be okay even if gibberish
 >>> raw = bytes(header)

or /something/ like that.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091008/7bda1cc4/attachment.pgp>

From barry at python.org  Thu Oct  8 15:14:05 2009
From: barry at python.org (Barry Warsaw)
Date: Thu, 8 Oct 2009 09:14:05 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <20091008091840.GB28906@phd.pp.ru>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
	<5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org>
	<20091008091840.GB28906@phd.pp.ru>
Message-ID: <0827D8A4-CC48-4B46-9C3E-EB6282D97BD8@python.org>

On Oct 8, 2009, at 5:18 AM, Oleg Broytman wrote:

>   Are you going to parse any garbage and create a Message (probably an
> empty Message) with one defect "cannot parse it at all"?

Yes, although the most pathological stream of bytes will probably  
produce a message with no headers and an undecodeable body of  
gibberish bytes, with a .defects list possible one or two items long.

>   What I disagree with is that if a parser cannot parse input  
> garbage at
> all it must raise an exception. And if a parser can raise an  
> exception any
> calling program must be prepared to catch such exceptions.

Python 2.6.3 (r263:75183, Oct  4 2009, 19:57:34)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
 >>> from email import message_from_string
 >>> with open('/dev/urandom') as wire:
...   data = wire.read(1024)
...
 >>> msg = message_from_string(data)
 >>> # number of headers
... len(msg)
0
 >>> len(msg.get_payload())
1024
 >>> msg.defects
[]

This actually makes perfect sense.  A message with no headers and a  
mass of 1024 bytes in its payload is RFC valid!

-Barry


-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091008/c89d9243/attachment.pgp>

From barry at python.org  Thu Oct  8 15:20:26 2009
From: barry at python.org (Barry Warsaw)
Date: Thu, 8 Oct 2009 09:20:26 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <87ocoiqi1x.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <10506972.7161254576370614.JavaMail.root@boaz>
	<8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org>
	<87zl88h4cj.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1685F0AB-8B57-445A-BE03-3782E07DB8FD@python.org>
	<87ocoiqi1x.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <34701D4C-2F91-4969-8A4A-9067402A1E70@python.org>

On Oct 8, 2009, at 6:46 AM, Stephen J. Turnbull wrote:

> Barry Warsaw writes:
>
>> I've also heard convincing arguments from folks in the Python
>> community in both camps: "using anything other than strings
>> internally is insane; no, using anything other than bytes
>> internally is insane."
>
> They're both right, of course.  The problem is figuring out who is
> right when. ;-)

Indeed!

>> Just to ramble a little longer, it's been argued that we should give
>> up on idempotency, but I'm not convinced.
>
> If we can't achieve ... ah, isn't "invertibility" what you mean here?
> ... "idempotency", then we're dropping information somewhere along the
> line.  Also, there are part types (pgp-signed, I'm looking at you)
> where it's absolutely essential that we be able to roundtrip the body
> byte for byte.  So I'm -1 on giving up.

Yeah, "idempotency" probably is not the right term, though I think  
historically that's what's been used.  Math geeks, what's the right  
term here? :)

I completely agree with you (of course :).  The way I look at it is  
that we lose this important principle only when the source data lacks  
complete information, i.e. is defective.  Although we can still invert  
in the face of some defects (and we should), I think we officially  
make no such guarantees unless the model is defect-free.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091008/562c07f7/attachment.pgp>

From phd at phd.pp.ru  Thu Oct  8 15:22:37 2009
From: phd at phd.pp.ru (Oleg Broytman)
Date: Thu, 8 Oct 2009 17:22:37 +0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <0827D8A4-CC48-4B46-9C3E-EB6282D97BD8@python.org>
References: <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
	<5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org>
	<20091008091840.GB28906@phd.pp.ru>
	<0827D8A4-CC48-4B46-9C3E-EB6282D97BD8@python.org>
Message-ID: <20091008132237.GB3059@phd.pp.ru>

On Thu, Oct 08, 2009 at 09:14:05AM -0400, Barry Warsaw wrote:
> On Oct 8, 2009, at 5:18 AM, Oleg Broytman wrote:
>>   Are you going to parse any garbage and create a Message (probably an
>> empty Message) with one defect "cannot parse it at all"?
>
> Yes, although the most pathological stream of bytes will probably  
> produce a message with no headers and an undecodeable body of gibberish 
> bytes, with a .defects list possible one or two items long.

   Well, then...

Oleg.
-- 
     Oleg Broytman            http://phd.pp.ru/            phd at phd.pp.ru
           Programmers don't die, they just GOSUB without RETURN.

From barry at python.org  Thu Oct  8 15:23:42 2009
From: barry at python.org (Barry Warsaw)
Date: Thu, 8 Oct 2009 09:23:42 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
	<87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <9B033421-3475-4827-8C4B-F3D0116FDEEB@python.org>

On Oct 8, 2009, at 7:25 AM, Stephen J. Turnbull wrote:

> The email package has access to the wire format, and knows what to do
> with most of it.  It should DTRT where that is possible, and punt
> where not.  By "punt" I mean return a special object containing as
> much of the meta data for an object as it could recover, along with
> the data itself as a blob.
>
> I would suggest that module utilities that require access to the
> parsed form of data be designed as object methods.  The special
> objects produced when broken wire format is encountered wouldn't have
> those methods, and thus they'd fail the duck type test.  But that
> makes sense: that "duck" can't quack anyway.

This is a very interesting idea that I think I like!

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091008/088f6983/attachment-0001.pgp>

From barry at python.org  Thu Oct  8 15:30:12 2009
From: barry at python.org (Barry Warsaw)
Date: Thu, 8 Oct 2009 09:30:12 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <20091008123133.GA3059@phd.pp.ru>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
	<87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20091008123133.GA3059@phd.pp.ru>
Message-ID: <4FD82973-A650-408E-9CD2-FE3F4DF008A7@python.org>

On Oct 8, 2009, at 8:31 AM, Oleg Broytman wrote:

>   Not exactly. One can see an AttributeError, but what was the  
> cause? why
> a parser has created a broken object? AttributeError doesn't preserve
> information from parser.

But if you got the AttributeError, you'd still have the original  
object around to ask more detailed questions about.

On first blush, what I think I like about this is that it fits in with  
an interesting generic API design.  For example, if you have a message  
instance (and remember, parts-is-parts-is-messages) that you think is  
an image, you might just do something like:

 >>> image = msg.decoded_image

and then 'image' is the png that its Content-Type: image/png implies.   
If the data wasn't actually parseable as a png, this would raise an  
AttributeError and you'd then have to do:

 >>> bytes = msg.raw_bytes

to get the raw data, but you'd still have the msg object around to do  
that with.

The one possible problem is that Message may have to implement a  
__getattribute__() to handle this, since you can't know when the class  
is written whether the data its instances will contain will be valid  
or not.

>> I can think of no input for which the parser should *ever* throw an
>> exception.
>
>   Are you saying that even a random garbage would be parsed to a  
> Message
> of some kind? No headers, a single unparsed body?..

Sure, why not?  It's valid RFC 822 :)

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091008/f600a009/attachment.pgp>

From tfarrell at owassobible.org  Thu Oct  8 15:27:35 2009
From: tfarrell at owassobible.org (Timothy Farrell)
Date: Thu, 08 Oct 2009 08:27:35 -0500
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <E73B1D0E-015A-4A65-9628-09975AC76106@python.org>
References: <10506972.7161254576370614.JavaMail.root@boaz>
	<7054.1254879272@parc.com>
	<E73B1D0E-015A-4A65-9628-09975AC76106@python.org>
Message-ID: <4ACDE8C7.2040309@owassobible.org>

Barry Warsaw wrote:
> On Oct 6, 2009, at 9:34 PM, Bill Janssen wrote:
>
>> Timothy Farrell <tfarrell at owassobible.org> wrote:
>>
>>> Back in June, David Murray posted the message below about fixing the
>>> email module.  I have an interest in helping with this due to a
>>> personal project I'm working on.  However, my ability to help is
>>> severely limited by my understanding of email and MIME RFCs.
>>
>> Tim, familiarity with email and MIME RFCs would be a big help if you
>> want to help with the email module.  Even for writing test cases.
>
> Just be forewarned that you'll end up like James T. Kirk staring up at 
> the neural neutralizer on the Tantalus Penal Colony.  You'll either be 
> a mindless shell or in agonizing pain.  Or both.
>
> going-bold-ly y'rs,
> -Barry
>
That's the impression I got when I first started wading through them.  
Maybe I should leave this to you experts.  I think I hear my mom calling.

-tim


From barry at python.org  Thu Oct  8 15:48:01 2009
From: barry at python.org (Barry Warsaw)
Date: Thu, 8 Oct 2009 09:48:01 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACDE8C7.2040309@owassobible.org>
References: <10506972.7161254576370614.JavaMail.root@boaz>
	<7054.1254879272@parc.com>
	<E73B1D0E-015A-4A65-9628-09975AC76106@python.org>
	<4ACDE8C7.2040309@owassobible.org>
Message-ID: <2E1FF440-5102-466F-BD98-BE6594223E97@python.org>

On Oct 8, 2009, at 9:27 AM, Timothy Farrell wrote:

>> Just be forewarned that you'll end up like James T. Kirk staring up  
>> at the neural neutralizer on the Tantalus Penal Colony.  You'll  
>> either be a mindless shell or in agonizing pain.  Or both.
>>
> That's the impression I got when I first started wading through  
> them.  Maybe I should leave this to you experts.  I think I hear my  
> mom calling.

Tim, pay no attention to that curmudgeon up there.  You should  
definitely take a look through at least the basic ones!

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091008/50bddb33/attachment.pgp>

From janssen at parc.com  Thu Oct  8 16:30:45 2009
From: janssen at parc.com (Bill Janssen)
Date: Thu, 8 Oct 2009 07:30:45 PDT
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <2E1FF440-5102-466F-BD98-BE6594223E97@python.org>
References: <10506972.7161254576370614.JavaMail.root@boaz>
	<7054.1254879272@parc.com>
	<E73B1D0E-015A-4A65-9628-09975AC76106@python.org>
	<4ACDE8C7.2040309@owassobible.org>
	<2E1FF440-5102-466F-BD98-BE6594223E97@python.org>
Message-ID: <98907.1255012245@parc.com>

Barry Warsaw <barry at python.org> wrote:

> On Oct 8, 2009, at 9:27 AM, Timothy Farrell wrote:
> 
> >> Just be forewarned that you'll end up like James T. Kirk staring up
> >> at the neural neutralizer on the Tantalus Penal Colony.  You'll
> >> either be a mindless shell or in agonizing pain.  Or both.
> >>
> > That's the impression I got when I first started wading through
> > them.  Maybe I should leave this to you experts.  I think I hear my
> > mom calling.
> 
> Tim, pay no attention to that curmudgeon up there.  You should
> definitely take a look through at least the basic ones!

Everyone should!

Bill

From stephen at xemacs.org  Thu Oct  8 17:31:43 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Fri, 09 Oct 2009 00:31:43 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <20091008123133.GA3059@phd.pp.ru>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
	<87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20091008123133.GA3059@phd.pp.ru>
Message-ID: <87ab01rjfk.fsf@uwakimon.sk.tsukuba.ac.jp>

Oleg Broytman writes:

 > > where not.  By "punt" I mean return a special object containing as
 > > much of the meta data for an object as it could recover, along with
 > > the data itself as a blob.
 > 
 >    The special object is an instance of an exception class ;)

It could be, but it will be returned with return, not raise. ;)

 > > I think (== hope) that this will sufficiently localize the issues
 > > that even though only AttributeError would even be raised, it
 > > will be obvious what went wrong.
 > 
 >    Not exactly. One can see an AttributeError, but what was the
 > cause? why a parser has created a broken object? AttributeError
 > doesn't preserve information from parser.

Who said it wouldn't?  Granted, I didn't say it would, but in my

Content-Type: multipart/alternative
    Content-Type: text/plain
    Content-Type: text/html; parseable=no

example, I would expect the object returned to reflect that
structure.  In particular the object representing the second MIME part
would indeed possess a valid Header member.  I would also attach the
original data (which in the case of a missing separator might very
well overrun into other parts, etc), but it would *not* be accessible
via the usual methods (eg, definitely not from .flatten()).

So in fact it's not clear to me that you could ask for more
information than that.

 > > I can think of no input for which the parser should *ever* throw an
 > > exception.
 > 
 >    Are you saying that even a random garbage would be parsed to a Message
 > of some kind? No headers, a single unparsed body?..

As long as it contains no NULs or high-bit-set octets, and is
separated into at least two parts, each less than 998 characters long,
by a CRLF, yes, I would definitely expect that an otherwise randomly
generated string would be parsed to a Message.

This Message should not be sendable because RFC 5322 requires the
presence of a From and a Date.  However, if you were implementing a
sendmail-compatible MTA or LDA, you might very well wish to accept
such a thing on stdin, parse it to a Message, and then default the
>From and Date header fields appropriately, and add a Message-ID header
field.  I would, anyway, wouldn't you?

Ah, yes, that's another use case, isn't it?!

From stephen at xemacs.org  Thu Oct  8 17:09:54 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Fri, 09 Oct 2009 00:09:54 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <0827D8A4-CC48-4B46-9C3E-EB6282D97BD8@python.org>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
	<5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org>
	<20091008091840.GB28906@phd.pp.ru>
	<0827D8A4-CC48-4B46-9C3E-EB6282D97BD8@python.org>
Message-ID: <87bpkhrkfx.fsf@uwakimon.sk.tsukuba.ac.jp>

Barry Warsaw writes:

 >  >>> from email import message_from_string
 >  >>> with open('/dev/urandom') as wire:
 > ...   data = wire.read(1024)
 > ...

# insert A

 >  >>> msg = message_from_string(data)
 >  >>> # number of headers
 > ... len(msg)
 > 0
 >  >>> len(msg.get_payload())
 > 1024
 >  >>> msg.defects
 > []
 > 
 > This actually makes perfect sense.  A message with no headers and a  
 > mass of 1024 bytes in its payload is RFC valid!

If you insert at A

>>> wire = "".join(chr(ord(ch) & 127) for ch in wire)
>>> # optional with reasonably high probability:
>>> wire = wire[0:512] + "\r\n" + wire[512:1024]

or similar.  Otherwise not. ;-)

From stephen at xemacs.org  Thu Oct  8 17:43:43 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Fri, 09 Oct 2009 00:43:43 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org>
	<4ACD94E5.5020808@g.nevcal.com>
	<4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org>
Message-ID: <878wflrivk.fsf@uwakimon.sk.tsukuba.ac.jp>

Barry Warsaw writes:
 > On Oct 8, 2009, at 3:29 AM, Glenn Linderman wrote:

 > > Headers could possibly be a quadruple instead of a triple, with the  
 > > 4th item being the wire format if received?

I think the whole input format (note, not necessarily wire!) should be
saved off on the top-level Message object (possibly in a file, per
Barry's comments about that).  Subobjects could then refer to to
pieces of that as position ranges.

 > I think not a quad.  I think other APIs should be used to extract the  
 > raw data, e.g.
 > 
 >  >>> # return a unicode or throw an exception
 >  >>> text = str(header)
 >  >>> # should always be okay even if gibberish
 >  >>> raw = bytes(header)
 > 
 > or /something/ like that.

Does that work?  I would think (especially in parallel to text) you
want bytes(header) to be the wire format.  If so, you want it to raise
if it knows it contains gibberish.

And again, we have the problem of whether it should return with the
field name prepended or just the field body.

I have a feeling we should not try to decide what APIs we're going to
spell as __str__ and __bytes__ yet.

From janssen at parc.com  Thu Oct  8 17:35:49 2009
From: janssen at parc.com (Bill Janssen)
Date: Thu, 8 Oct 2009 08:35:49 PDT
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <87ab01rjfk.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
	<87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20091008123133.GA3059@phd.pp.ru>
	<87ab01rjfk.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <319.1255016149@parc.com>

I should point out that I also store lots of metadata in the registered
MIME format text/rfc822-headers (defined in RFC 1892), data that doesn't
necessarily conform to the specific set of headers mentioned in RFC822.
It would be nice if the header support in the email package would also
support reading and writing that format.

And MIME multipart is sometimes used in applications other than email.
It would be nice if the MIME parsing part of the email module could be
used for those purposes, as well -- basically without some of the
headers defined in 2822 and 2821.

I think of those two as lower-level standalone libraries used by the
higher-level email library.

Bill

From v+python at g.nevcal.com  Thu Oct  8 09:33:47 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Thu, 08 Oct 2009 00:33:47 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <Pine.LNX.4.64.0910080250450.18193@kimball.webabinitio.net>
References: <8510262.7231254589795083.JavaMail.root@boaz>	<4ACB0DC9.7080307@g.nevcal.com>	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACB971D.9080706@g.nevcal.com>	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACC0277.2060807@g.nevcal.com>	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>	<1254929486.96.16481@mint-julep.mondoinfo.com>	<20091007170718.GA1901@phd.pp.ru>	<5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org>
	<Pine.LNX.4.64.0910080250450.18193@kimball.webabinitio.net>
Message-ID: <4ACD95DB.4040800@g.nevcal.com>

On approximately 10/8/2009 12:16 AM, came the following characters from 
the keyboard of R. David Murray:
> I'd like to try to summarize what I understand Barry to be saying 

Good summary!  Deleted all but one point that I'd like to have clarified...

> The API also provides ways to build up a Message from pieces, or to
> extract information from a Message in pieces, and to modify a Message,
> and again input and output as both text and bytes must be supported.

And I agree that APIs to retrieve any MIME part as undecoded bytes is 
appropriate; and to retrieve it as decoded strings is appropriate for 
text MIME parts.  Not sure that non-text MIME parts need to support 
being returned as strings.

So there must be APIs that support obtaining text and (same or 
different) APIs that support obtaining bytes for a given MIME part.  
However, I think it is proper that a MIME part that is not flagged as 
text/* might produce an error if asked for as text.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From phd at phd.pp.ru  Thu Oct  8 18:54:05 2009
From: phd at phd.pp.ru (Oleg Broytman)
Date: Thu, 8 Oct 2009 20:54:05 +0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <87ab01rjfk.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
	<87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20091008123133.GA3059@phd.pp.ru>
	<87ab01rjfk.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <20091008165405.GA12047@phd.pp.ru>

On Fri, Oct 09, 2009 at 12:31:43AM +0900, Stephen J. Turnbull wrote:
> Oleg Broytman writes:
>  > > I can think of no input for which the parser should *ever* throw an
>  > > exception.
>  > 
>  >    Are you saying that even a random garbage would be parsed to a Message
>  > of some kind? No headers, a single unparsed body?..
> 
> As long as it contains no NULs or high-bit-set octets, and is
> separated into at least two parts, each less than 998 characters long,
> by a CRLF

   After all, you can think of input that should make a parser to raise an
exception, can't you?

> This Message should not be sendable because RFC 5322 requires the
> presence of a From and a Date.  However, if you were implementing a
> sendmail-compatible MTA or LDA, you might very well wish to accept
> such a thing on stdin, parse it to a Message, and then default the
> >From and Date header fields appropriately, and add a Message-ID header
> field.  I would, anyway, wouldn't you?
> 
> Ah, yes, that's another use case, isn't it?!

   Absolutely. We're talking about parsing data, not necessary from SMTP,
even less not necessary sendable.

Oleg.
-- 
     Oleg Broytman            http://phd.pp.ru/            phd at phd.pp.ru
           Programmers don't die, they just GOSUB without RETURN.

From stephen at xemacs.org  Thu Oct  8 19:29:32 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Fri, 09 Oct 2009 02:29:32 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <34701D4C-2F91-4969-8A4A-9067402A1E70@python.org>
References: <10506972.7161254576370614.JavaMail.root@boaz>
	<8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org>
	<87zl88h4cj.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1685F0AB-8B57-445A-BE03-3782E07DB8FD@python.org>
	<87ocoiqi1x.fsf@uwakimon.sk.tsukuba.ac.jp>
	<34701D4C-2F91-4969-8A4A-9067402A1E70@python.org>
Message-ID: <87y6nlpzer.fsf@uwakimon.sk.tsukuba.ac.jp>

Barry Warsaw writes:

 > Yeah, "idempotency" probably is not the right term, though I think  
 > historically that's what's been used.  Math geeks, what's the right  
 > term here? :)

"Invertability" *is* the math term.  "Roundtrip" is more likely to make
sense to real people.

 > I completely agree with you (of course :).

Other way around, I'm sure.<wink>

What-about-the-curmudgeon-behind-the-curtain-ly y'rs,


From stephen at xemacs.org  Thu Oct  8 21:06:40 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Fri, 09 Oct 2009 04:06:40 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <319.1255016149@parc.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
	<87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20091008123133.GA3059@phd.pp.ru>
	<87ab01rjfk.fsf@uwakimon.sk.tsukuba.ac.jp>
	<319.1255016149@parc.com>
Message-ID: <87vdippuwv.fsf@uwakimon.sk.tsukuba.ac.jp>

Bill Janssen writes:

 > I should point out that I also store lots of metadata in the registered
 > MIME format text/rfc822-headers (defined in RFC 1892), data that doesn't
 > necessarily conform to the specific set of headers mentioned in RFC822.
 > It would be nice if the header support in the email package would also
 > support reading and writing that format.

I'm not sure what you're saying here.  RFC 822 is inclusive.  More or
less, if it looks like a header, it is a header, and we need to parse
it at least into field name and field body, whether RFC 822 defines
more specific syntax for it or not.

Is that all, or do you mean you want it to give that MIME format
special treatment, such as a method for converting a Message object
containing a parsed RFC 822 message to a Message object containing a
multipart/report message and a text/rfc822-headers subobject, ready to
have the text/plain and message/delivery-status parts filled in per
RFC 1892?

 > And MIME multipart is sometimes used in applications other than email.
 > It would be nice if the MIME parsing part of the email module could be
 > used for those purposes, as well -- basically without some of the
 > headers defined in 2822 and 2821.

Ditto, here.

I would expect that you could feed an HTTP stream containing headers
and content to the Message constructor and get something sensible
back.  Dunno what Barry thinks of that, though.


From stephen at xemacs.org  Thu Oct  8 21:31:36 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Fri, 09 Oct 2009 04:31:36 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <20091008165405.GA12047@phd.pp.ru>
References: <8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
	<87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20091008123133.GA3059@phd.pp.ru>
	<87ab01rjfk.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20091008165405.GA12047@phd.pp.ru>
Message-ID: <87skdtptrb.fsf@uwakimon.sk.tsukuba.ac.jp>

Oleg Broytman writes:
 > On Fri, Oct 09, 2009 at 12:31:43AM +0900, Stephen J. Turnbull wrote:
 > > Oleg Broytman writes:
 > >  > > I can think of no input for which the parser should *ever* throw an
 > >  > > exception.
 > >  > 
 > >  >    Are you saying that even a random garbage would be parsed to a Message
 > >  > of some kind? No headers, a single unparsed body?..
 > > 
 > > As long as it contains no NULs or high-bit-set octets, and is
 > > separated into at least two parts, each less than 998 characters long,
 > > by a CRLF
 > 
 >    After all, you can think of input that should make a parser to raise an
 > exception, can't you?

No, to throw an error on the example above would be a felony, life
sentence.  Throwing an error on something that had 8-bit octets in it
probably wouldn't be a crime, but I'd sue, and any jury in the land
would award treble damages.  Better try for a change of venue to
Moscow.<wink>


From barry at python.org  Thu Oct  8 21:49:58 2009
From: barry at python.org (Barry Warsaw)
Date: Thu, 8 Oct 2009 15:49:58 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <87bpkhrkfx.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
	<5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org>
	<20091008091840.GB28906@phd.pp.ru>
	<0827D8A4-CC48-4B46-9C3E-EB6282D97BD8@python.org>
	<87bpkhrkfx.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <81800AC4-C935-4EC8-890A-3FB499A2BB95@python.org>

On Oct 8, 2009, at 11:09 AM, Stephen J. Turnbull wrote:

> Barry Warsaw writes:
>
>>>>> from email import message_from_string
>>>>> with open('/dev/urandom') as wire:
>> ...   data = wire.read(1024)
>> ...
>
> # insert A
>
>>>>> msg = message_from_string(data)
>>>>> # number of headers
>> ... len(msg)
>> 0
>>>>> len(msg.get_payload())
>> 1024
>>>>> msg.defects
>> []
>>
>> This actually makes perfect sense.  A message with no headers and a
>> mass of 1024 bytes in its payload is RFC valid!
>
> If you insert at A
>
>>>> wire = "".join(chr(ord(ch) & 127) for ch in wire)
>>>> # optional with reasonably high probability:
>>>> wire = wire[0:512] + "\r\n" + wire[512:1024]
>
> or similar.  Otherwise not. ;-)

Right!  That makes it legal.

What's interesting of course is that the parser can (and I submit,  
still should) handle the stream even without that.

-Barry


-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091008/58c71503/attachment-0001.pgp>

From barry at python.org  Thu Oct  8 21:52:02 2009
From: barry at python.org (Barry Warsaw)
Date: Thu, 8 Oct 2009 15:52:02 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <878wflrivk.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org>
	<4ACD94E5.5020808@g.nevcal.com>
	<4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org>
	<878wflrivk.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <D11803DD-2992-4FCF-B38B-2457FAB06A73@python.org>

On Oct 8, 2009, at 11:43 AM, Stephen J. Turnbull wrote:

> I think the whole input format (note, not necessarily wire!) should be
> saved off on the top-level Message object (possibly in a file, per
> Barry's comments about that).  Subobjects could then refer to to
> pieces of that as position ranges.

I haven't made up my mind about that (it's been suggested before).   
The tricky thing will be keeping that cache in sync with any other  
model changes through the approved API.  IOW, if I overwrite a  
message's payload, that input format should probably be blown away.

> I have a feeling we should not try to decide what APIs we're going to
> spell as __str__ and __bytes__ yet.

Very good point.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091008/8378aeac/attachment.pgp>

From barry at python.org  Thu Oct  8 21:53:00 2009
From: barry at python.org (Barry Warsaw)
Date: Thu, 8 Oct 2009 15:53:00 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <319.1255016149@parc.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
	<87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20091008123133.GA3059@phd.pp.ru>
	<87ab01rjfk.fsf@uwakimon.sk.tsukuba.ac.jp>
	<319.1255016149@parc.com>
Message-ID: <175C60BB-0E64-40FE-9401-F70E23598506@python.org>

On Oct 8, 2009, at 11:35 AM, Bill Janssen wrote:

> I should point out that I also store lots of metadata in the  
> registered
> MIME format text/rfc822-headers (defined in RFC 1892), data that  
> doesn't
> necessarily conform to the specific set of headers mentioned in  
> RFC822.
> It would be nice if the header support in the email package would also
> support reading and writing that format.
>
> And MIME multipart is sometimes used in applications other than email.
> It would be nice if the MIME parsing part of the email module could be
> used for those purposes, as well -- basically without some of the
> headers defined in 2822 and 2821.
>
> I think of those two as lower-level standalone libraries used by the
> higher-level email library.

I agree with this use case.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091008/ad0412db/attachment.pgp>

From barry at python.org  Thu Oct  8 21:54:10 2009
From: barry at python.org (Barry Warsaw)
Date: Thu, 8 Oct 2009 15:54:10 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACD95DB.4040800@g.nevcal.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>	<4ACB0DC9.7080307@g.nevcal.com>	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACB971D.9080706@g.nevcal.com>	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACC0277.2060807@g.nevcal.com>	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>	<1254929486.96.16481@mint-julep.mondoinfo.com>	<20091007170718.GA1901@phd.pp.ru>	<5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org>
	<Pine.LNX.4.64.0910080250450.18193@kimball.webabinitio.net>
	<4ACD95DB.4040800@g.nevcal.com>
Message-ID: <440B5F4C-E210-46F0-B647-240CDF091F4D@python.org>

On Oct 8, 2009, at 3:33 AM, Glenn Linderman wrote:

> Not sure that non-text MIME parts need to support being returned as  
> strings.

I don't think they do.  But e.g. an image/* MIME part should support  
returning the decoded image data.

> So there must be APIs that support obtaining text and (same or  
> different) APIs that support obtaining bytes for a given MIME part.   
> However, I think it is proper that a MIME part that is not flagged  
> as text/* might produce an error if asked for as text.

+1

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091008/f036bb40/attachment.pgp>

From barry at python.org  Thu Oct  8 21:54:50 2009
From: barry at python.org (Barry Warsaw)
Date: Thu, 8 Oct 2009 15:54:50 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <87y6nlpzer.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <10506972.7161254576370614.JavaMail.root@boaz>
	<8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org>
	<87zl88h4cj.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1685F0AB-8B57-445A-BE03-3782E07DB8FD@python.org>
	<87ocoiqi1x.fsf@uwakimon.sk.tsukuba.ac.jp>
	<34701D4C-2F91-4969-8A4A-9067402A1E70@python.org>
	<87y6nlpzer.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <7B06E031-38F1-4047-A88B-E867B6738D52@python.org>

On Oct 8, 2009, at 1:29 PM, Stephen J. Turnbull wrote:

> Barry Warsaw writes:
>
>> Yeah, "idempotency" probably is not the right term, though I think
>> historically that's what's been used.  Math geeks, what's the right
>> term here? :)
>
> "Invertability" *is* the math term.  "Roundtrip" is more likely to  
> make
> sense to real people.

Thanks.  +1 for roundtrip.

>> I completely agree with you (of course :).
>
> Other way around, I'm sure.<wink>
>
> What-about-the-curmudgeon-behind-the-curtain-ly y'rs,

:)

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091008/6e0c6286/attachment.pgp>

From rdmurray at bitdance.com  Thu Oct  8 21:55:18 2009
From: rdmurray at bitdance.com (R. David Murray)
Date: Thu, 8 Oct 2009 15:55:18 -0400 (EDT)
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <81800AC4-C935-4EC8-890A-3FB499A2BB95@python.org>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
	<5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org>
	<20091008091840.GB28906@phd.pp.ru>
	<0827D8A4-CC48-4B46-9C3E-EB6282D97BD8@python.org>
	<87bpkhrkfx.fsf@uwakimon.sk.tsukuba.ac.jp>
	<81800AC4-C935-4EC8-890A-3FB499A2BB95@python.org>
Message-ID: <Pine.LNX.4.64.0910081553260.18193@kimball.webabinitio.net>

On Thu, 8 Oct 2009 at 15:49, Barry Warsaw wrote:
> On Oct 8, 2009, at 11:09 AM, Stephen J. Turnbull wrote:
>
>> Barry Warsaw writes:
>> 
>> > > > > from email import message_from_string
>> > > > > with open('/dev/urandom') as wire:
>> > ...   data = wire.read(1024)
>> > ...
>> 
>> # insert A
>> 
>> > > > > msg = message_from_string(data)
>> > > > > # number of headers
>> > ... len(msg)
>> > 0
>> > > > > len(msg.get_payload())
>> > 1024
>> > > > > msg.defects
>> > []
>> > 
>> > This actually makes perfect sense.  A message with no headers and a
>> > mass of 1024 bytes in its payload is RFC valid!
>> 
>> If you insert at A
>> 
>> > > > wire = "".join(chr(ord(ch) & 127) for ch in wire)
>> > > > # optional with reasonably high probability:
>> > > > wire = wire[0:512] + "\r\n" + wire[512:1024]
>> 
>> or similar.  Otherwise not. ;-)
>
> Right!  That makes it legal.
>
> What's interesting of course is that the parser can (and I submit, still 
> should) handle the stream even without that.

But it should be recording a couple defects in that case, right?

--David

From barry at python.org  Thu Oct  8 21:57:17 2009
From: barry at python.org (Barry Warsaw)
Date: Thu, 8 Oct 2009 15:57:17 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <87vdippuwv.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
	<87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20091008123133.GA3059@phd.pp.ru>
	<87ab01rjfk.fsf@uwakimon.sk.tsukuba.ac.jp>
	<319.1255016149@parc.com>
	<87vdippuwv.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <6B772657-8600-41E7-9E29-8984D70121FD@python.org>

On Oct 8, 2009, at 3:06 PM, Stephen J. Turnbull wrote:

> Bill Janssen writes:
>
>> I should point out that I also store lots of metadata in the  
>> registered
>> MIME format text/rfc822-headers (defined in RFC 1892), data that  
>> doesn't
>> necessarily conform to the specific set of headers mentioned in  
>> RFC822.
>> It would be nice if the header support in the email package would  
>> also
>> support reading and writing that format.
>
> I'm not sure what you're saying here.  RFC 822 is inclusive.  More or
> less, if it looks like a header, it is a header, and we need to parse
> it at least into field name and field body, whether RFC 822 defines
> more specific syntax for it or not.

The way I read it was that certain RFC 5322 requirements should be  
relaxed in certain cases, e.g. line length limits.  If you're mutating  
the model, you wouldn't necessarily (ever? always?) throw an exception  
for long lines.

> Is that all, or do you mean you want it to give that MIME format
> special treatment, such as a method for converting a Message object
> containing a parsed RFC 822 message to a Message object containing a
> multipart/report message and a text/rfc822-headers subobject, ready to
> have the text/plain and message/delivery-status parts filled in per
> RFC 1892?
>
>> And MIME multipart is sometimes used in applications other than  
>> email.
>> It would be nice if the MIME parsing part of the email module could  
>> be
>> used for those purposes, as well -- basically without some of the
>> headers defined in 2822 and 2821.
>
> Ditto, here.
>
> I would expect that you could feed an HTTP stream containing headers
> and content to the Message constructor and get something sensible
> back.  Dunno what Barry thinks of that, though.

I think the Python community would expect the email package to support  
this and similar use cases.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091008/1ec02c7d/attachment-0001.pgp>

From barry at python.org  Thu Oct  8 22:00:16 2009
From: barry at python.org (Barry Warsaw)
Date: Thu, 8 Oct 2009 16:00:16 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <Pine.LNX.4.64.0910081553260.18193@kimball.webabinitio.net>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
	<5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org>
	<20091008091840.GB28906@phd.pp.ru>
	<0827D8A4-CC48-4B46-9C3E-EB6282D97BD8@python.org>
	<87bpkhrkfx.fsf@uwakimon.sk.tsukuba.ac.jp>
	<81800AC4-C935-4EC8-890A-3FB499A2BB95@python.org>
	<Pine.LNX.4.64.0910081553260.18193@kimball.webabinitio.net>
Message-ID: <53530105-6707-4457-8A56-97907980A378@python.org>

On Oct 8, 2009, at 3:55 PM, R. David Murray wrote:

> On Thu, 8 Oct 2009 at 15:49, Barry Warsaw wrote:
>> On Oct 8, 2009, at 11:09 AM, Stephen J. Turnbull wrote:
>>
>>> Barry Warsaw writes:
>>> > > > > from email import message_from_string
>>> > > > > with open('/dev/urandom') as wire:
>>> > ...   data = wire.read(1024)
>>> > ...
>>> # insert A
>>> > > > > msg = message_from_string(data)
>>> > > > > # number of headers
>>> > ... len(msg)
>>> > 0
>>> > > > > len(msg.get_payload())
>>> > 1024
>>> > > > > msg.defects
>>> > []
>>> > > This actually makes perfect sense.  A message with no headers  
>>> and a
>>> > mass of 1024 bytes in its payload is RFC valid!
>>> If you insert at A
>>> > > > wire = "".join(chr(ord(ch) & 127) for ch in wire)
>>> > > > # optional with reasonably high probability:
>>> > > > wire = wire[0:512] + "\r\n" + wire[512:1024]
>>> or similar.  Otherwise not. ;-)
>>
>> Right!  That makes it legal.
>>
>> What's interesting of course is that the parser can (and I submit,  
>> still should) handle the stream even without that.
>
> But it should be recording a couple defects in that case, right?

Possibly so, although on the header instances maybe, which email  
currently doesn't support, but that it probably should.

Which makes for an interesting idea.  Let's say protocol PML defines  
their formats in terms of RFC 5322, but with a line length limit of  
10k and allows 8-bit.  email would parse that just fine but might drop  
a few defects onto some headers.  The wrapper around PML could then  
remove those defects since they aren't defects in that protocol.  And  
the generator would still DTRT, though it's possible you'd need  
subclasses of the email package to support that.  Yet another  
interesting API challenge then.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091008/cad6924a/attachment.pgp>

From barry at python.org  Thu Oct  8 22:03:51 2009
From: barry at python.org (Barry Warsaw)
Date: Thu, 8 Oct 2009 16:03:51 -0400
Subject: [Email-SIG] Pycon 2010 sprint
Message-ID: <EED2D719-31AA-4F39-B3BC-FCF7EB74E6A2@python.org>

It's early still, but I'd like to get a sense of who might be  
interested in sprinting on email at Pycon 2010 in Atlanta.  I think  
the dates will be something around the week of 22-Feb-2010.  I'm sure  
I will have tension between wanting to sprint on email and wanting to  
sprint on Mailman (and possibly some Canonical stuff).  I think an  
email sprint would only work if there were critical mass.  Based on my  
past experience, I think we need at least three to five experts, with  
other interested hackers of course welcome.

No need to commit right now, but something to think about.  RDM,  
you'll probably be there right?  Who else?

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091008/aee7bfcd/attachment.pgp>

From rdmurray at bitdance.com  Thu Oct  8 22:18:00 2009
From: rdmurray at bitdance.com (R. David Murray)
Date: Thu, 8 Oct 2009 16:18:00 -0400 (EDT)
Subject: [Email-SIG] Pycon 2010 sprint
In-Reply-To: <EED2D719-31AA-4F39-B3BC-FCF7EB74E6A2@python.org>
References: <EED2D719-31AA-4F39-B3BC-FCF7EB74E6A2@python.org>
Message-ID: <Pine.LNX.4.64.0910081615110.18193@kimball.webabinitio.net>

On Thu, 8 Oct 2009 at 16:03, Barry Warsaw wrote:
> No need to commit right now, but something to think about.  RDM, you'll 
> probably be there right?  Who else?

I would expect to be, though I'll probably want to drop in on Core
as well.

--David (RDM)

From rdmurray at bitdance.com  Thu Oct  8 22:19:37 2009
From: rdmurray at bitdance.com (R. David Murray)
Date: Thu, 8 Oct 2009 16:19:37 -0400 (EDT)
Subject: [Email-SIG] Pycon 2010 sprint
In-Reply-To: <EED2D719-31AA-4F39-B3BC-FCF7EB74E6A2@python.org>
References: <EED2D719-31AA-4F39-B3BC-FCF7EB74E6A2@python.org>
Message-ID: <Pine.LNX.4.64.0910081618080.18193@kimball.webabinitio.net>

On Thu, 8 Oct 2009 at 16:03, Barry Warsaw wrote:
> sprinting on email at Pycon 2010 in Atlanta.  I think the dates will be 
> something around the week of 22-Feb-2010.  I'm sure I will have tension

The Sprint dates are listed as the 22nd to the 25th on the pycon
website.

--David (RDM)

From janssen at parc.com  Thu Oct  8 22:31:27 2009
From: janssen at parc.com (Bill Janssen)
Date: Thu, 8 Oct 2009 13:31:27 PDT
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <87vdippuwv.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
	<87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20091008123133.GA3059@phd.pp.ru>
	<87ab01rjfk.fsf@uwakimon.sk.tsukuba.ac.jp>
	<319.1255016149@parc.com>
	<87vdippuwv.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <15809.1255033887@parc.com>

Stephen J. Turnbull <stephen at xemacs.org> wrote:

> I'm not sure what you're saying here.  RFC 822 is inclusive.  More or
> less, if it looks like a header, it is a header, and we need to parse
> it at least into field name and field body, whether RFC 822 defines
> more specific syntax for it or not.

That's right.  I was just pointing out that there might be any
collection of headers, even collections without "From" or "Date".

Bill

From janssen at parc.com  Thu Oct  8 22:32:19 2009
From: janssen at parc.com (Bill Janssen)
Date: Thu, 8 Oct 2009 13:32:19 PDT
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <6B772657-8600-41E7-9E29-8984D70121FD@python.org>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
	<87my42qg96.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20091008123133.GA3059@phd.pp.ru>
	<87ab01rjfk.fsf@uwakimon.sk.tsukuba.ac.jp>
	<319.1255016149@parc.com>
	<87vdippuwv.fsf@uwakimon.sk.tsukuba.ac.jp>
	<6B772657-8600-41E7-9E29-8984D70121FD@python.org>
Message-ID: <15839.1255033939@parc.com>

Barry Warsaw <barry at python.org> wrote:

> On Oct 8, 2009, at 3:06 PM, Stephen J. Turnbull wrote:
> 
> > Bill Janssen writes:
> >
> >> I should point out that I also store lots of metadata in the
> >> registered
> >> MIME format text/rfc822-headers (defined in RFC 1892), data that
> >> doesn't
> >> necessarily conform to the specific set of headers mentioned in
> >> RFC822.
> >> It would be nice if the header support in the email package would
> >> also
> >> support reading and writing that format.
> >
> > I'm not sure what you're saying here.  RFC 822 is inclusive.  More or
> > less, if it looks like a header, it is a header, and we need to parse
> > it at least into field name and field body, whether RFC 822 defines
> > more specific syntax for it or not.
> 
> The way I read it was that certain RFC 5322 requirements should be
> relaxed in certain cases, e.g. line length limits.  If you're mutating
> the model, you wouldn't necessarily (ever? always?) throw an exception
> for long lines.

Yes, that's a good build.

Bill

From stephen at xemacs.org  Thu Oct  8 23:29:23 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Fri, 09 Oct 2009 06:29:23 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <440B5F4C-E210-46F0-B647-240CDF091F4D@python.org>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
	<5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org>
	<Pine.LNX.4.64.0910080250450.18193@kimball.webabinitio.net>
	<4ACD95DB.4040800@g.nevcal.com>
	<440B5F4C-E210-46F0-B647-240CDF091F4D@python.org>
Message-ID: <87my41pob0.fsf@uwakimon.sk.tsukuba.ac.jp>

Barry Warsaw writes:
 > On Oct 8, 2009, at 3:33 AM, Glenn Linderman wrote:
 > 
 > > Not sure that non-text MIME parts need to support being returned as  
 > > strings.
 > 
 > I don't think they do.

Most non-text media do support comments, though.  I don't know if
extracting comments is a reasonable response to a request for text
from an image, but we should provide a place to put any text that the
callbacks that do the actual work of decoding might return.

 > > However, I think it is proper that a MIME part that is not flagged  
 > > as text/* might produce an error if asked for as text.
 > 
 > +1

That doesn't preclude raising an error/returning a defect object in
many or most use cases, but there may be use cases where it would be
useful to allow a callback on a non-text object to return text.

From v+python at g.nevcal.com  Thu Oct  8 23:59:38 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Thu, 08 Oct 2009 14:59:38 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org>
	<4ACD94E5.5020808@g.nevcal.com>
	<4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org>
Message-ID: <4ACE60CA.6010907@g.nevcal.com>

On approximately 10/8/2009 6:00 AM, came the following characters from 
the keyboard of Barry Warsaw:
> On Oct 8, 2009, at 3:29 AM, Glenn Linderman wrote:
>> The application options are to drop the attachment, or pass through 
>> the corrupted bytes, and let the next application try to make sense 
>> of it.
>
> Exactly, and it's not for the email package to say which is right.
>
> Here's a use case: I've got a Message that was parsed from wire input 
> and I want to mangle the Subject heading to add the list prefix.  I 
> know exactly what charset the prefix is in because that's data I 
> control.  When I ask for the original Subject value, I'm handed an 
> instance that I can use to try to figure out how add the prefix.
>
> First thing I'll ask it is "are you a single chunk in my prefix 
> charset (or compatible)?"  If so, I can probably just prepend my 
> prefix onto the value.  If not, "are you composed of multiple valid 
> chunks in different charsets?"  If so, I know that I need to encode my 
> prefix, but I can still prepend it to the header value (hopefully 
> using the same API, and I don't care that the implementation could not 
> use string concatenation).
>
> If not, then what?  Maybe I don't care if some of the chunk charsets 
> aren't known because I can still use the right encode+prepend 
> strategy.  But if the header is a gobbledegook of 8-bit bytes?  I'm 
> pretty sure I want to be able to ask the API if that's the case rather 
> than get an exception.  The thing I'm not so sure about is what 
> happens if my application is just naive enough to just ask for the 
> header as a unicode and that conversion can't be made.  I /think/ it 
> should raise an exception in that case.  But then when I ask for the 
> header value as a mass of bytes, that should succeed and return me the 
> raw input. 

So for this use case, it is known that all headers are ASCII.  So the 
operation of prepending a list prefix should not care whether the 
Subject: value is valid or not... it can simply prepend the list prefix, 
followed by SP, to the existing, raw header that already exists.

The only remaining issue is line length limits, so maybe it has to use 
CR LF TAB instead of space, sometimes.

OK, so if the prefix is not ASCII, it gets separately encoded, including 
a trailing SP, and then prepended to the value followed by SP or CR LF 
TAB depending on the line length limit.

So to prepend into a text header, you shouldn't need to decode the 
undecodable... there should be a prepend (and possibly also an append) 
operation provided by the API, so that applications can tweak headers 
without decoding.  This allows useful behavior even if new methods of 
encoding are invented that are not yet understood by a particular 
version of the email library.

Asking for the header value (or whole header) in Unicode should decode 
the chunks that are understandable and decodable, and leave the chunks 
that are not understandable as 
ASCII-converted-to-Unicode-but-still-possibly-weirdly-encoded ... I 
think that is what the RFCs encourage.

Asking for a header as bytes should return the wire data, if it is 
available, or an encoding of real data as wire data (like generate would 
do).  There is no Unicode that cannot be encoded to wire format, IIUC, 
usually via a variety of heuristics once non-ASCII characters are 
included, that may produce a variety of differing results, all of which 
should decode back to the original data.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From v+python at g.nevcal.com  Fri Oct  9 00:39:23 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Thu, 08 Oct 2009 15:39:23 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org>
	<4ACD94E5.5020808@g.nevcal.com>
	<4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org>
Message-ID: <4ACE6A1B.7060702@g.nevcal.com>

On approximately 10/8/2009 6:00 AM, came the following characters from 
the keyboard of Barry Warsaw:
> On Oct 8, 2009, at 3:29 AM, Glenn Linderman wrote:
>> And I agree that APIs to retrieve any MIME part as undecoded bytes is 
>> appropriate; and to retrieve it as decoded strings is appropriate for 
>> text MIME parts.  Not sure that non-text MIME parts need to support 
>> being returned as strings.
>
> I hate to open another can of worms, but I've been thinking about this 
> a lot too :).  It's been discussed on list before, so nothing new 
> here.  I think the parser and MIME classes need to be hookable for 
> decoding their contents.  For example, if you have a text/* it might 
> well make sense to support bytes() and str()/unicode() on the part 
> instance.  But if it's image/* str() makes no sense.  part.decode() or 
> something similar makes sense, but this needs to be extensible because 
> the email package will not know how to convert every content-type.  At 
> best it will only know how to decode content-types that Python's 
> stdlib knows about.

Seems like the following should be obtainable from a MIME parts:

1) wire format.  Either what came in, in the parser case, or what would 
be generated.
2) internal headers from the MIME part
3) decoded BLOB.  This means that quopri and base64 are decoded, no more 
and no less.  This is bytes.  No headers, only payload.  For 
Content-Transfer-Encoding: binary, this is mostly a noop.
4) text/* parts should also be obtainable as str()/unicode(), payload 
only.  This is where charset decoding is done.

I think your talk in the next paragraph about hooks and other object 
types being produced is a generalization of 4, not 3, and generally no 
additional decoding needs to be done, just conversion to the right 
object type (or file, or file-like object).

> The problem is that if the bytes came off the wire, the parser 
> currently can only attach the most basic MIME base class.  It doesn't 
> know that an image/png should create a MIMEImagePNG instance there.  
> This is different from hacking the model directly because the 
> application can instantiate the right class.  So the parser either has 
> to have a hookable way for an application to go from content-type to 
> class, or the generic MIME base class needs to be hookable in its 
> .decode() method. 

So either the email package can stop at 3, and 4 only for text/* parts, 
or it could learn more types (registered types, with well-defined 
corresponding objects could be potentially built-in to the email 
package), and/or it could become hookable for application types.  Of 
course, for disposition to files, storing the BLOB in a file of the 
right name is adequate... to avoid the file, I agree that converting to 
a useful object type is handy.  But maybe file-like objects would 
suffice, for most of the types.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From barry at python.org  Fri Oct  9 00:39:56 2009
From: barry at python.org (Barry Warsaw)
Date: Thu, 8 Oct 2009 18:39:56 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACE60CA.6010907@g.nevcal.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org>
	<4ACD94E5.5020808@g.nevcal.com>
	<4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org>
	<4ACE60CA.6010907@g.nevcal.com>
Message-ID: <EF616A50-C6CE-47EF-BA55-185DE42EC459@python.org>

On Oct 8, 2009, at 5:59 PM, Glenn Linderman wrote:

> So to prepend into a text header, you shouldn't need to decode the  
> undecodable...

Except that you also have to collapse Re:'s and move them to the front  
of the string.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091008/5bdb6ddf/attachment.pgp>

From v+python at g.nevcal.com  Fri Oct  9 00:50:37 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Thu, 08 Oct 2009 15:50:37 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8510262.7231254589795083.JavaMail.root@boaz>	<4ACB0DC9.7080307@g.nevcal.com>	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACB971D.9080706@g.nevcal.com>	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACC0277.2060807@g.nevcal.com>	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACCD10D.4070308@g.nevcal.com>
	<87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <4ACE6CBD.2030805@g.nevcal.com>

On approximately 10/8/2009 4:40 AM, came the following characters from 
the keyboard of Stephen J. Turnbull:
> Glenn Linderman writes:
>
>  > >  > If conversions are avoided, then octets are unlikely to be out of 
>  > >  > range?
>  > >
>  > > Haven't looked in your spam bucket recently, I guess.  Spammers
>  > > regularly put 8 bit characters into headers (and into bodies in
>  > > messages without a Content-Type header), for one thing.
>  > 
>  > I'm aware of that, but if conversions are not done, octets are unlikely 
>  > to be _reported_ to be out of range....
>
> Conversions will eventually be done.  "Best it were done quickly."
>   

Disagree.  Deferring the conversions defers failure issues to the point 
where the code (hopefully) somewhat understands the type of data being 
manipulated, and can then handle it appropriately.  Converting up front 
causes errors in things that may never be touched or needed, so the 
error detection and handling is wasteful.

>  > > Most clients are simply not going to be prepared for the kind of
>  > > crap I see in /var/mail/turnbull every day.
>  > 
>  > Are you referring to most email clients, or most 
>  > Python-email-library-using clients?
>
> Sorry.  When I mean "MUA" I try to say "MUA".  By "client", I'm
> referring to the higher level logic that is going to be calling the
> email module.
>   

Yeah, terminology between people that haven't discussed the topic before 
can slow communication.

So for headers, which are supposed to be ASCII, or encoded via RFC rules 
to ASCII (no 8-bit chars), then the discovery of an 8-bit char should be 
produce a defect report, but then simply converted to Unicode as if it 
were Latin-1 (since there is no other knowledge available that could 
produce a better conversion).  And if the result of that is not expected 
by the client (your definition), then the client should either notice 
the defect report and reject it based on that, or attempt to parse it, 
and reject it if it encounters unexpected syntax.  After all, this is, 
for that client, "raw user input" (albeit from a remote source) so fully 
error checking the input is appropriate.

>  > Is it your point of view, then, that incorrectly formed email should be 
>  > mostly treated as SPAM?
>
> Heavens no!  Not by the email module, anyway!  The email module should
> not know about spam (but see Barry's "we're having spam for Launchpad"
> post: if you're that good, anything goes!), except maybe at a very
> high level.
>   

I didn't think you'd think that, but things you were saying seemed to be 
implying that.

>  > Your "hit me with your best shot" comment indicates that you want a
>  > failure code or exception when the data is bad, and then a way to
>  > "retry accepting errors"?
>
> My curent thinking is that the email module should return an object
> representing a partial parse.  The way that you find out if it is
> partial is to try to access some data that "should" be in the object.
> If the parse succeeded, the accessor returns the data (which might be
> empty).  If the parse did not succeed, you get an AttributeError.
> (This is just a paraphrase of what I wrote in response to Oleg.)

yeah, or some error, anyway.

The problem with the APIs that are spelled __str__ and __bytes__ is that 
there is no other way to return errors other than exceptions.... the 
Python way.  Since the email library is trying to avoid raising 
exceptions in large blocks of its code, it is non-Pythonic (which is 
what Oleg is probably complaining about, in part).  But because it needs 
to avoid exceptions, and is therefore non-Pythonic, it may be 
inappropriate to spell very many of its APIs __str__ and __bytes__, 
because that is Pythonic, and requires exceptions.  Once you become 
non-Pythonic in one area, you may have to also be non-Pythonic in some 
other areas...

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From v+python at g.nevcal.com  Fri Oct  9 01:02:47 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Thu, 08 Oct 2009 16:02:47 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <EF616A50-C6CE-47EF-BA55-185DE42EC459@python.org>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org>
	<4ACD94E5.5020808@g.nevcal.com>
	<4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org>
	<4ACE60CA.6010907@g.nevcal.com>
	<EF616A50-C6CE-47EF-BA55-185DE42EC459@python.org>
Message-ID: <4ACE6F97.6010605@g.nevcal.com>

On approximately 10/8/2009 3:39 PM, came the following characters from 
the keyboard of Barry Warsaw:
> On Oct 8, 2009, at 5:59 PM, Glenn Linderman wrote:
>
>> So to prepend into a text header, you shouldn't need to decode the 
>> undecodable...
>
> Except that you also have to collapse Re:'s and move them to the front 
> of the string. 

Well, that is a feature of some mailing list programs.  Those that want 
to do that, will have to decode and re-encode.

However, there are definitely mailing lists that don't do that.  Google 
Groups is one example that doesn't collapse, and always prepends the 
headers in front of Re:.  Seems like all the Python lists do the 
collapsing (I wonder why! :) )  Other lists don't do prepending (I think 
the RFCs recommend not prepending in Subject, actually), of the others 
I'm subscribed to, that prepend, some collapse and some don't.

I'm saying that there are use cases where prepending could be done 
without decoding; while you are positing use cases where that is 
insufficient, but you shouldn't have said "Except"... you should have 
said "There are also other use cases".

And when you collapse Re:, do you also collapse various 
language-specific spellings of Re: ???  that is a hard problem.

And don't forget removing the prior prepended text before adding the new 
prepended text.

Actually, as long as the prepended text is ASCII, all that work can be 
done on the encoded value.  When it is not ASCII, it may still be 
separated and recognizable.  Still that logic is more complex than 
decoding, handling as Unicode, and encoding.... when it works.  Just 
pointing out that there is more than one way to do things...

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From mark at msapiro.net  Fri Oct  9 01:20:23 2009
From: mark at msapiro.net (Mark Sapiro)
Date: Thu, 8 Oct 2009 16:20:23 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACE6F97.6010605@g.nevcal.com>
Message-ID: <PC19220091008162023032819903bdf@msapiro>

Glenn Linderman wrote:
>
>However, there are definitely mailing lists that don't do that.  Google 
>Groups is one example that doesn't collapse, and always prepends the 
>headers in front of Re:.  Seems like all the Python lists do the 
>collapsing (I wonder why! :) )  Other lists don't do prepending (I think 
>the RFCs recommend not prepending in Subject, actually), of the others 
>I'm subscribed to, that prepend, some collapse and some don't.


You seem to be forgetting the case where the encoded subject already
contains the prefix, or do you not care if the subject just continues
to grow with Re:'s and repeated prefixes?

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan


From rdmurray at bitdance.com  Fri Oct  9 01:52:10 2009
From: rdmurray at bitdance.com (R. David Murray)
Date: Thu, 8 Oct 2009 19:52:10 -0400 (EDT)
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACD95DB.4040800@g.nevcal.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
	<5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org>
	<Pine.LNX.4.64.0910080250450.18193@kimball.webabinitio.net>
	<4ACD95DB.4040800@g.nevcal.com>
Message-ID: <Pine.LNX.4.64.0910081927150.18193@kimball.webabinitio.net>

On Thu, 8 Oct 2009 at 00:33, Glenn Linderman wrote:
> On approximately 10/8/2009 12:16 AM, came the following characters from the 
> keyboard of R. David Murray:
>>  I'd like to try to summarize what I understand Barry to be saying 
>
> Good summary!  Deleted all but one point that I'd like to have clarified...

Thanks.

I have revised my summary to take into account the feedback received.
Specifically:  I reworded it so that parsing/serialization never raise
errors, and that text methods for binary subparts may not make sense.
I added the proposal for the object/attribute error method of handling
errors in the query portion of the API.  I replaced 'idempotent' with
'invertable' (I didn't use 'roundtrip' because it isn't euphonous as an
adjective...I just couldn't bring myself to write 'roundtrippable';
however if the consensus that it is clearer I will use it.)

I've added a page to the email wiki[1] with this version of the summary,
as a 'design overview proposal'.  I'll also include the revised text here.
Additional comments welcome.

--David

PS: I also updated the release targets and dates on the wiki.

[1] http://wiki.python.org/moin/Email%20SIG

-----------------------------------------------------------------

The email package consists of two major conceptual pieces: the API, and
the internal data model.  The API needs to have facilities for accepting
data in either text format or bytes format, and this data is used to
generate a model of the input message (a Message).  Likewise the API needs
to provide facilities for serializing a Message as either bytes or text.
The API also provides ways to build up a Message from pieces, or to
extract information from a Message in pieces, and to modify a Message,
and again input and output as both text and bytes must be supported,
except that in some cases text output may not make sense (eg: binary
attachments).

The data model used by the email package is an "implementation detail",
and we should not spend effort at this stage trying to optimize it for
anything except memory requirements with respect to potentially large
sub-objects, and even there it is more a matter of providing ways to
deal with potentially large sub-objects than it is a true optimization.
In general correctness and robustness is much more important than speed.

The data model will need to be a practical hybrid of the input data,
possibly transformed in some way in some cases, and various sorts of
meta-data.  The current email package already works this way.

An important characteristic of the model is that it be invertable whenever
sensible; that is, if a given byte stream is used to create a Message or
subobject, serializing that Message or subobject as bytes should return
the original byte stream whenever sensible (ie: when the data is not
pathologically malformed).  Likewise if a text stream is used to create
a Message or subobject, serializing it as text should produce, whenever
sensible, the original text stream.  In particular, well-formed (per RFC)
message data should always come out of a round trip through the email
module in exactly the format it went in.

An important property of the API is that both the parser that transforms
an input stream into a Message and Message serialization should not
raise exceptions.  Instead a defects list is maintained and exposed
through the API.  In the face of some defects it may not be sensible to
maintain invertability.  In the worst case for parser input the resulting
Message object may have no headers, a binary blob body, and a defect list,
but a Message object will always be produced.

The APIs that manipulate the data model either for piecewise construction
or for transformations may raise exceptions, and in most cases _should_
raise exceptions when encountering invalid data or operations.  APIs that
query the model should return as much information as possible without
throwing an exception.  (The current proposal to implement this is
to return objects that have defect lists, and/or raise exceptions when
methods of the object are called that would have worked if the input
data were valid, leaving the queryable object itself in the hands of the
application so that the application has the maximum possible information
available to try to handle the error if it wishes to do so.)

From steve at pearwood.info  Fri Oct  9 04:10:40 2009
From: steve at pearwood.info (Steven D'Aprano)
Date: Fri, 9 Oct 2009 13:10:40 +1100
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <20091008091840.GB28906@phd.pp.ru>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org>
	<20091008091840.GB28906@phd.pp.ru>
Message-ID: <200910091310.41133.steve@pearwood.info>

On Thu, 8 Oct 2009 08:18:40 pm Oleg Broytman wrote:
> > By keeping the various components clear in our mind, we can see
> > that ? both statements are correct in a sense. ?The parser and
> > generator should never raise exceptions. ?The model can and
> > probably should.
>
> ? ?Are you going to parse any garbage and create a Message (probably
> an empty Message) with one defect "cannot parse it at all"?

So long as the raw garbage is available for the caller somehow, that 
seems like a reasonable approach to me. That lets an application 
display "Unparsable message" to the user, who can then ask to "View 
Source" (or equivalent) to get access to the raw bytes of the message.


-- 
Steven D'Aprano

From v+python at g.nevcal.com  Fri Oct  9 05:20:29 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Thu, 08 Oct 2009 20:20:29 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <PC19220091008162023032819903bdf@msapiro>
References: <PC19220091008162023032819903bdf@msapiro>
Message-ID: <4ACEABFD.6010309@g.nevcal.com>

On approximately 10/8/2009 4:20 PM, came the following characters from 
the keyboard of Mark Sapiro:
> Glenn Linderman wrote:
>   
>> However, there are definitely mailing lists that don't do that.  Google 
>> Groups is one example that doesn't collapse, and always prepends the 
>> headers in front of Re:.  Seems like all the Python lists do the 
>> collapsing (I wonder why! :) )  Other lists don't do prepending (I think 
>> the RFCs recommend not prepending in Subject, actually), of the others 
>> I'm subscribed to, that prepend, some collapse and some don't.
>>     
>
>
> You seem to be forgetting the case where the encoded subject already
> contains the prefix, or do you not care if the subject just continues
> to grow with Re:'s and repeated prefixes?
>   

Mark,

Please read the last two paragraphs of my message you replied to, two or 
three more times.  Here they are again for reference.

> And don't forget removing the prior prepended text before adding the 
> new prepended text.
>
> Actually, as long as the prepended text is ASCII, all that work can be 
> done on the encoded value.  When it is not ASCII, it may still be 
> separated and recognizable.  Still that logic is more complex than 
> decoding, handling as Unicode, and encoding.... when it works.  Just 
> pointing out that there is more than one way to do things... 

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From tkikuchi at is.kochi-u.ac.jp  Fri Oct  9 05:47:00 2009
From: tkikuchi at is.kochi-u.ac.jp (Tokio Kikuchi)
Date: Fri, 09 Oct 2009 12:47:00 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACEABFD.6010309@g.nevcal.com>
References: <PC19220091008162023032819903bdf@msapiro>
	<4ACEABFD.6010309@g.nevcal.com>
Message-ID: <4ACEB234.9030309@is.kochi-u.ac.jp>


>> Actually, as long as the prepended text is ASCII, all that work can be
>> done on the encoded value.  When it is not ASCII, it may still be
>> separated and recognizable.  Still that logic is more complex than
>> decoding, handling as Unicode, and encoding.... when it works.  Just
>> pointing out that there is more than one way to do things... 

Oh, really?

Base64 is 3 to 4 octets encoding and there is no way to prepend padding.


-- 
Tokio Kikuchi, tkikuchi at is.kochi-u.ac.jp
http://weather.is.kochi-u.ac.jp/

From stephen at xemacs.org  Fri Oct  9 06:27:56 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Fri, 09 Oct 2009 13:27:56 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACE6CBD.2030805@g.nevcal.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACCD10D.4070308@g.nevcal.com>
	<87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACE6CBD.2030805@g.nevcal.com>
Message-ID: <87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp>

Glenn Linderman writes:

 > > Conversions will eventually be done.  "Best it were done quickly."
 > 
 > Disagree.  Deferring the conversions defers failure issues to the point 
 > where the code (hopefully) somewhat understands the type of data being 
 > manipulated, and can then handle it appropriately.  Converting up front 
 > causes errors in things that may never be touched or needed, so the 
 > error detection and handling is wasteful.

That's theory; my position is based on Mailman practice.  Don't believe
me, ask Barry.  I also spend most of my OSS time on the
internationalization of XEmacs, and the experience is similar there.
Best to convert everything as early as possible, or admit that you
don't know how.

 > So for headers, which are supposed to be ASCII, or encoded via RFC rules 
 > to ASCII (no 8-bit chars), then the discovery of an 8-bit char should be 
 > produce a defect report, but then simply converted to Unicode as if it 
 > were Latin-1 (since there is no other knowledge available that could 
 > produce a better conversion).

No, that is already corruption.  Most clients will assume that string
is valid as a header, because it's valid as a string.

 > And if the result of that is not expected by the client (your
 > definition), then the client should either notice the defect report
 > and reject it based on that, or attempt to parse it, and reject it
 > if it encounters unexpected syntax.  After all, this is, for that
 > client, "raw user input" (albeit from a remote source) so fully
 > error checking the input is appropriate.

No way.  That environment would suck to program in.  And it's
un-Pythonic: "Errors should never pass silently."

 > Python way.  Since the email library is trying to avoid raising 
 > exceptions in large blocks of its code, it is non-Pythonic

I disagree with that.  "Unless explicitly silenced."  The strategy
that Barry and I favor is to signal errors lazily.  So we *explicitly*
silence errors (at least of the Exception kind) when parsing.  If we
can't parse, we look for a part terminator, encapsulate the bad stuff
and move on to the rest of the input.  Later, at use time, *if* the
unparsable object is used, *then* the error will be raised, hopefully
with enough metainformation to figure out what to do about it.

I don't see what's un-Pythonic about that.

From v+python at g.nevcal.com  Fri Oct  9 08:26:39 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Thu, 08 Oct 2009 23:26:39 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8510262.7231254589795083.JavaMail.root@boaz>	<4ACB0DC9.7080307@g.nevcal.com>	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACB971D.9080706@g.nevcal.com>	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACC0277.2060807@g.nevcal.com>	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACCD10D.4070308@g.nevcal.com>	<87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACE6CBD.2030805@g.nevcal.com>
	<87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <4ACED79F.6050602@g.nevcal.com>

On approximately 10/8/2009 9:27 PM, came the following characters from 
the keyboard of Stephen J. Turnbull:
> Glenn Linderman writes:
>
>  > > Conversions will eventually be done.  "Best it were done quickly."
>  > 
>  > Disagree.  Deferring the conversions defers failure issues to the point 
>  > where the code (hopefully) somewhat understands the type of data being 
>  > manipulated, and can then handle it appropriately.  Converting up front 
>  > causes errors in things that may never be touched or needed, so the 
>  > error detection and handling is wasteful.
>
> That's theory; my position is based on Mailman practice.  Don't believe
> me, ask Barry.  I also spend most of my OSS time on the
> internationalization of XEmacs, and the experience is similar there.
> Best to convert everything as early as possible, or admit that you
> don't know how.
>   

Emacs is different than email.  Either you can read a file to edit it, 
or you can't.
The Postel principle for email says to try to do the best you can, for 
as much as you can.

>  > So for headers, which are supposed to be ASCII, or encoded via RFC rules 
>  > to ASCII (no 8-bit chars), then the discovery of an 8-bit char should be 
>  > produce a defect report, but then simply converted to Unicode as if it 
>  > were Latin-1 (since there is no other knowledge available that could 
>  > produce a better conversion).
>
> No, that is already corruption.  Most clients will assume that string
> is valid as a header, because it's valid as a string.
>   

Sure it is corruption.  That's why there is a defect report.  But the 
conversion technique is appropriate, per the Postel principle.

>  > And if the result of that is not expected by the client (your
>  > definition), then the client should either notice the defect report
>  > and reject it based on that, or attempt to parse it, and reject it
>  > if it encounters unexpected syntax.  After all, this is, for that
>  > client, "raw user input" (albeit from a remote source) so fully
>  > error checking the input is appropriate.
>
> No way.  That environment would suck to program in.  And it's
> un-Pythonic: "Errors should never pass silently."
>   

Then the Postel principle is un-Pythonic, and to be Pythonic any 
incorrect email should produce an error, and be unreadable. Again, I 
mentioned producing a defect report.  That is not passing an error silently.

It is still raw user input, and should still be checked for proper 
syntax by the client, even if the email is well-formed and conversion 
produces no defect report.  If you don't want to check proper syntax in 
your program inputs, I don't want to use your programs, they will be 
insecure.

>  > Python way.  Since the email library is trying to avoid raising 
>  > exceptions in large blocks of its code, it is non-Pythonic
>
> I disagree with that.  "Unless explicitly silenced."  The strategy
> that Barry and I favor is to signal errors lazily.  So we *explicitly*
> silence errors (at least of the Exception kind) when parsing.  If we
> can't parse, we look for a part terminator, encapsulate the bad stuff
> and move on to the rest of the input.  Later, at use time, *if* the
> unparsable object is used, *then* the error will be raised, hopefully
> with enough metainformation to figure out what to do about it.
>   

So there seem to be two techniques:

1) convert quickly, but don't raise errors... instead metainformation 
structures that record the errors, and raise them later if the converted 
data is accessed.  Because some kinds of not-quite-perfect data have 
alternate handling techniques, either all techniques must be performed 
and cached, or *some processing must be deferred until the client can 
decide*.

2) Store the data, and convert only if the data is accessed.  When 
client accesses the data, the exceptions raised allow the client to 
choose an appropriate processing technique for handling the 
not-quite-perfect data, based on the context of the client, the 
importance of that data item, etc.  Only the result of that technique 
need be cached for future accesses.

With both techniques, the data is given to the email library, and the 
errors are not seen until later... potentially the exact same user 
experience.  But with the technique 1, much effort is expended to 
convert data, parse data, and create error metainformation ready to 
return IF the data is accessed.  (yeah, don't say it, premature 
optmization -- I call it design, in this case)  With technique 2, little 
effort is required to store the data, create a state variable to 
indicate whether it has been converted and parsed, or not, and then IF 
(and only IF) the data is accessed, the conversion and parsing must be 
done on the first access, and instead of creating and storing 
metainformation about the errors, they could just be raised.

> I don't see what's un-Pythonic about that.
>   

The un-Pythonic thing is returning defect reports instead of raising 
errors.  There is no way for a simple assignment interface to return an 
error, because the API for simple assignment doesn't have an in-band 
signaling mechanism.  No "condition code" left around to be checked.  
And programmers often omit checking condition codes anyway, due to 
laziness and hubris "nothing will go wrong with THIS statement".  So the 
Pythonic way, AFAIU, is that errors are returned out-of-band via raised 
exceptions.

Perhaps this is why it is so hard to design a Pythonic interface to the 
Postel principle email handling... an out-of-band signalling system 
interrupts the flow of control, and the Postel principle wants to 
provide best-as-you-can data... and the easiest way to do Postel is to 
supply the not-quite-perfect data so the normal control flow can handle 
things, yet an out-of-band signal can't easily return to the normal 
control flow, and wrapping tiny try blocks around nearly every email API 
call is as annoying to the understanding of the control flow as putting 
all those if statements in the normal control flow to check "condition 
codes" (error codes, warning codes, defect reports, whatever you want to 
call them).

Stated another way, it is hard to process potentially not-quite-perfect 
data without writing complex code.  And because the email library wants 
to simplify the handling of email, it wants to limit the complexity of 
the client code.  But when dealing with not-quite-perfect data, there is 
a choice of different ways to handle it, and the email library doesn't 
know the best choice for any particular client application... if it did, 
then it could make the choices, and the client could be less complex.

The simplest client could be handed only perfectly structured, 100% 
accurately decodable email messages...  its logic would be (simply, and 
Pythonically):

while 1:
    try:
        getEmail()
    except:  
        logBadEmailReceived
    else:
       processEmail()

In order to allow defect reports to be useful, the client logic must be 
more complex; getEmail must be expanded to make decisions based on the 
content of the defect reports.  More try statements must be used, at a 
finer granularity, or more if statements to check defect reports.  The 
former is more Pythonic, the latter less, AFAIU.

Perhaps a given client knows how it wants to handle all types of 
not-quite-perfect data -- should the email library allow rules to be 
set, so that when a situation arises, it can handle it according to the 
rules?  This simplifies the client logic, at the cost of initialization 
setup, rules creation and caching, documenting the rules, adding the new 
APIs that don't seem to exist in today's email library.  While this 
could perhaps simplify many clients, it cannot simplify the email 
library... it still has to have the code for all the variant perfect and 
not-quite-perfect data handling techniques, plus the complexity of rule 
definition and usage.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From v+python at g.nevcal.com  Fri Oct  9 08:31:32 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Thu, 08 Oct 2009 23:31:32 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACEB234.9030309@is.kochi-u.ac.jp>
References: <PC19220091008162023032819903bdf@msapiro>
	<4ACEABFD.6010309@g.nevcal.com> <4ACEB234.9030309@is.kochi-u.ac.jp>
Message-ID: <4ACED8C4.5070906@g.nevcal.com>

On approximately 10/8/2009 8:47 PM, came the following characters from 
the keyboard of Tokio Kikuchi:
>>> Actually, as long as the prepended text is ASCII, all that work can be
>>> done on the encoded value.  When it is not ASCII, it may still be
>>> separated and recognizable.  Still that logic is more complex than
>>> decoding, handling as Unicode, and encoding.... when it works.  Just
>>> pointing out that there is more than one way to do things... 
>>>       
>
> Oh, really?
>
> Base64 is 3 to 4 octets encoding and there is no way to prepend padding.
>   

In header values, encoding is done using encoded-words.  A header value 
consists of a sequence of ASCII words, and encoded-words.  While an 
encoded word, that uses base64 encoding cannot easily be adjusted to 
prepend data into that encoded-word, additional ASCII or encoded-words 
can be prepended in front of the other ASCII or encoded words within the 
header-value.

So, yes, really!

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From tkikuchi at is.kochi-u.ac.jp  Fri Oct  9 10:38:03 2009
From: tkikuchi at is.kochi-u.ac.jp (Tokio Kikuchi)
Date: Fri, 09 Oct 2009 17:38:03 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACED8C4.5070906@g.nevcal.com>
References: <PC19220091008162023032819903bdf@msapiro>
	<4ACEABFD.6010309@g.nevcal.com> <4ACEB234.9030309@is.kochi-u.ac.jp>
	<4ACED8C4.5070906@g.nevcal.com>
Message-ID: <4ACEF66B.3000500@is.kochi-u.ac.jp>

Glenn Linderman wrote:
> On approximately 10/8/2009 8:47 PM, came the following characters from
> the keyboard of Tokio Kikuchi:
>>>> Actually, as long as the prepended text is ASCII, all that work can be
>>>> done on the encoded value.  When it is not ASCII, it may still be
>>>> separated and recognizable.  Still that logic is more complex than
>>>> decoding, handling as Unicode, and encoding.... when it works.  Just
>>>> pointing out that there is more than one way to do things...       
>>
>> Oh, really?
>>
>> Base64 is 3 to 4 octets encoding and there is no way to prepend padding.
>>   
> 
> In header values, encoding is done using encoded-words.  A header value
> consists of a sequence of ASCII words, and encoded-words.  While an
> encoded word, that uses base64 encoding cannot easily be adjusted to
> prepend data into that encoded-word, additional ASCII or encoded-words
> can be prepended in front of the other ASCII or encoded words within the
> header-value.
> 
> So, yes, really!
> 
Following two lines have equivalent header contents:

Re: [mmjp-users 123] =?iso-2022-jp?b?GyRCRnxLXDhsGyhC?=
Re: =?iso-2022-jp?b?W21tanAtdXNlcnMgMTIzXSAbJEJGfEtcOGwbKEI=?=

I'd like to see how you can extract ascii part without touching rest of
the encoded word in the second example.

What we do in mailman is that both are treated equally and delete
[mmjp-users 123] from the subject and prefix again by [mmjp-users 124]
(with new sequential number).  Some MUA encode subjects like the second
example and this is beyond our control.  Therefore, we are forced to
decode the whole part of header content.

-- 
Tokio Kikuchi, tkikuchi at is.kochi-u.ac.jp
http://weather.is.kochi-u.ac.jp/

From phd at phd.pp.ru  Fri Oct  9 12:54:33 2009
From: phd at phd.pp.ru (Oleg Broytman)
Date: Fri, 9 Oct 2009 14:54:33 +0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <200910091310.41133.steve@pearwood.info>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org>
	<20091008091840.GB28906@phd.pp.ru>
	<200910091310.41133.steve@pearwood.info>
Message-ID: <20091009105433.GA9096@phd.pp.ru>

On Fri, Oct 09, 2009 at 01:10:40PM +1100, Steven D'Aprano wrote:
> On Thu, 8 Oct 2009 08:18:40 pm Oleg Broytman wrote:
> > ? ?Are you going to parse any garbage and create a Message (probably
> > an empty Message) with one defect "cannot parse it at all"?
> 
> So long as the raw garbage is available for the caller somehow, that 
> seems like a reasonable approach to me. That lets an application 
> display "Unparsable message" to the user, who can then ask to "View 
> Source" (or equivalent) to get access to the raw bytes of the message.

   I don't see any difference with "raise an exception; the calling
application catches the exceptions, displays or logs "Unparseable message",
and displays or logs the original garbage (that can be an attribute of the
exception instance)".
   The difference IFAIU could be between well-formed messages and complete
garbage. A not well-formed input will be parsed to a Message, and such
parsing requires a clever algorithm with resynchronizations (jumps from a
bad point to a recognized good point to restart parsing there). I don't
know if it's possible to create such a clever algorithm; and for complete
unparseable garbage I still prefer an exception.

Oleg.
-- 
     Oleg Broytman            http://phd.pp.ru/            phd at phd.pp.ru
           Programmers don't die, they just GOSUB without RETURN.

From barry at python.org  Fri Oct  9 13:56:12 2009
From: barry at python.org (Barry Warsaw)
Date: Fri, 9 Oct 2009 07:56:12 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <87my41pob0.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
	<5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org>
	<Pine.LNX.4.64.0910080250450.18193@kimball.webabinitio.net>
	<4ACD95DB.4040800@g.nevcal.com>
	<440B5F4C-E210-46F0-B647-240CDF091F4D@python.org>
	<87my41pob0.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <D00250D5-EED0-4D6A-BC8A-FA9D37FF58B7@python.org>

On Oct 8, 2009, at 5:29 PM, Stephen J. Turnbull wrote:

> Most non-text media do support comments, though.  I don't know if
> extracting comments is a reasonable response to a request for text
> from an image, but we should provide a place to put any text that the
> callbacks that do the actual work of decoding might return.

Are you talking about comments embedded in things like id3 tags and  
jpg comments?  If so, ISTM those are outside the scope of the email  
package.  Message objects can return decoded payloads, but I don't  
think it should provide the framework for looking inside those payloads.

>>> However, I think it is proper that a MIME part that is not flagged
>>> as text/* might produce an error if asked for as text.
>>
>> +1
>
> That doesn't preclude raising an error/returning a defect object in
> many or most use cases, but there may be use cases where it would be
> useful to allow a callback on a non-text object to return text.

I think we should re-cast the discussion in terms of returning raw and  
decoded payloads.  The email package can provide methods for returning  
raw payloads as bytes and decoded payloads in the natural type as  
described by Content-Type.  For the latter, we probably need a  
registration and plugin system to handle types that email doesn't know  
about by default, but it should also be used to handle types it does  
know about.  That way, an application could override e.g. decoding  
text/html content if it wanted to.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091009/f218d684/attachment.pgp>

From barry at python.org  Fri Oct  9 14:05:44 2009
From: barry at python.org (Barry Warsaw)
Date: Fri, 9 Oct 2009 08:05:44 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACE6A1B.7060702@g.nevcal.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org>
	<4ACD94E5.5020808@g.nevcal.com>
	<4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org>
	<4ACE6A1B.7060702@g.nevcal.com>
Message-ID: <3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org>

On Oct 8, 2009, at 6:39 PM, Glenn Linderman wrote:

> 1) wire format.  Either what came in, in the parser case, or what  
> would be generated.
> 2) internal headers from the MIME part
> 3) decoded BLOB.  This means that quopri and base64 are decoded, no  
> more and no less.  This is bytes.  No headers, only payload.  For  
> Content-Transfer-Encoding: binary, this is mostly a noop.
> 4) text/* parts should also be obtainable as str()/unicode(),  
> payload only.  This is where charset decoding is done.
>
> I think your talk in the next paragraph about hooks and other object  
> types being produced is a generalization of 4, not 3, and generally  
> no additional decoding needs to be done, just conversion to the  
> right object type (or file, or file-like object).

I mostly agree with that.  I've always called #4 the "decoded payload"  
and #3 I've usually called the "raw payload".  Maybe we can bikeshed  
on better terms to help inform us about the API's method/attribute  
names.

Which brings up another point: right now Message objects have a  
single .get_payload() method that takes a flag to indicate whether it  
should be the decoded or raw payload.  That's bong.  These should be  
different interfaces.

>> The problem is that if the bytes came off the wire, the parser  
>> currently can only attach the most basic MIME base class.  It  
>> doesn't know that an image/png should create a MIMEImagePNG  
>> instance there.  This is different from hacking the model directly  
>> because the application can instantiate the right class.  So the  
>> parser either has to have a hookable way for an application to go  
>> from content-type to class, or the generic MIME base class needs to  
>> be hookable in its .decode() method.
>
> So either the email package can stop at 3, and 4 only for text/*  
> parts, or it could learn more types (registered types, with well- 
> defined corresponding objects could be potentially built-in to the  
> email package), and/or it could become hookable for application  
> types.  Of course, for disposition to files, storing the BLOB in a  
> file of the right name is adequate... to avoid the file, I agree  
> that converting to a useful object type is handy.  But maybe file- 
> like objects would suffice, for most of the types.

My own preferences here is that email does support #4 with a  
registration system to handle returning concrete payload objects based  
on the Content-Type.

I also think that the email package probably should not implement  
"store-payloads-on-disk" by default, although it may provide some  
example implementations for simple applications (much the same way  
there's wsgiref for simple applications).  Still, that's different  
than say, storing attachments in a file named by the Content- 
Disposition header's filename parameter.  That latter is firmly in the  
domain of the application.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091009/c2cbffa0/attachment.pgp>

From barry at python.org  Fri Oct  9 14:23:23 2009
From: barry at python.org (Barry Warsaw)
Date: Fri, 9 Oct 2009 08:23:23 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACE6CBD.2030805@g.nevcal.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>	<4ACB0DC9.7080307@g.nevcal.com>	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACB971D.9080706@g.nevcal.com>	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACC0277.2060807@g.nevcal.com>	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACCD10D.4070308@g.nevcal.com>
	<87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACE6CBD.2030805@g.nevcal.com>
Message-ID: <17899606-EE28-4800-A05D-95525AF90E3E@python.org>

On Oct 8, 2009, at 6:50 PM, Glenn Linderman wrote:

> On approximately 10/8/2009 4:40 AM, came the following characters  
> from the keyboard of Stephen J. Turnbull:
>> Glenn Linderman writes:
>>
>> > >  > If conversions are avoided, then octets are unlikely to be  
>> out of  > >  > range?
>> > >
>> > > Haven't looked in your spam bucket recently, I guess.  Spammers
>> > > regularly put 8 bit characters into headers (and into bodies in
>> > > messages without a Content-Type header), for one thing.
>> >  > I'm aware of that, but if conversions are not done, octets are  
>> unlikely  > to be _reported_ to be out of range....
>>
>> Conversions will eventually be done.  "Best it were done quickly."
>>
>
> Disagree.  Deferring the conversions defers failure issues to the  
> point where the code (hopefully) somewhat understands the type of  
> data being manipulated, and can then handle it appropriately.   
> Converting up front causes errors in things that may never be  
> touched or needed, so the error detection and handling is wasteful.

I'm with Stephen here.  Remember, we're saying the parser should never  
throw an exception, so any such conversion exception happens when you  
manipulate the model directly.  That /has/ to error early because  
otherwise it is impossible to debug.

> So for headers, which are supposed to be ASCII, or encoded via RFC  
> rules to ASCII (no 8-bit chars), then the discovery of an 8-bit char  
> should be produce a defect report, but then simply converted to  
> Unicode as if it were Latin-1 (since there is no other knowledge  
> available that could produce a better conversion).  And if the  
> result of that is not expected by the client (your definition), then  
> the client should either notice the defect report and reject it  
> based on that, or attempt to parse it, and reject it if it  
> encounters unexpected syntax.  After all, this is, for that client,  
> "raw user input" (albeit from a remote source) so fully error  
> checking the input is appropriate.

Sure, but I can also think of lots of other things the client might  
do, including blowing away the header value and substituting their  
own, doing the moral equivalent of a str.replace(), etc. etc.  It's  
not our job to decide.  It our job to provide the highest fidelity  
information we can and the best APIs for clients to do what they want.

> The problem with the APIs that are spelled __str__ and __bytes__ is  
> that there is no other way to return errors other than  
> exceptions.... the Python way.  Since the email library is trying to  
> avoid raising exceptions in large blocks of its code, it is non- 
> Pythonic (which is what Oleg is probably complaining about, in  
> part).  But because it needs to avoid exceptions, and is therefore  
> non-Pythonic, it may be inappropriate to spell very many of its APIs  
> __str__ and __bytes__, because that is Pythonic, and requires  
> exceptions.  Once you become non-Pythonic in one area, you may have  
> to also be non-Pythonic in some other areas...

As was pointed out in a previous message, we shouldn't be too  
concerned with __str__ and __bytes__ right now.  We'll design non- 
magical APIs for everything and they'll do the right thing.  We'll  
then alias what seems appropriate as __str__ and __bytes__ and they'll  
be as Pythonic as makes sense.  When I say that, I'm thinking about  
the semantic differences Message objects currently have in their dict- 
like-plus API (which I still think makes perfect practical sense).

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091009/d6f4444b/attachment-0001.pgp>

From barry at python.org  Fri Oct  9 14:25:15 2009
From: barry at python.org (Barry Warsaw)
Date: Fri, 9 Oct 2009 08:25:15 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACE6F97.6010605@g.nevcal.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org>
	<4ACD94E5.5020808@g.nevcal.com>
	<4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org>
	<4ACE60CA.6010907@g.nevcal.com>
	<EF616A50-C6CE-47EF-BA55-185DE42EC459@python.org>
	<4ACE6F97.6010605@g.nevcal.com>
Message-ID: <3685E116-E57A-43EF-89E2-1FB03D878C0E@python.org>

On Oct 8, 2009, at 7:02 PM, Glenn Linderman wrote:

> Well, that is a feature of some mailing list programs.  Those that  
> want to do that, will have to decode and re-encode.
>
> However, there are definitely mailing lists that don't do that.   
> Google Groups is one example that doesn't collapse, and always  
> prepends the headers in front of Re:.  Seems like all the Python  
> lists do the collapsing (I wonder why! :) )  Other lists don't do  
> prepending (I think the RFCs recommend not prepending in Subject,  
> actually), of the others I'm subscribed to, that prepend, some  
> collapse and some don't.
>
> I'm saying that there are use cases where prepending could be done  
> without decoding; while you are positing use cases where that is  
> insufficient, but you shouldn't have said "Except"... you should  
> have said "There are also other use cases".
>
> And when you collapse Re:, do you also collapse various language- 
> specific spellings of Re: ???  that is a hard problem.

I don't disagree with any of that.  It's all firmly in the scope of  
the application, not the email package.  The email package just has to  
make it possible.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091009/7446f92b/attachment.pgp>

From barry at python.org  Fri Oct  9 14:27:20 2009
From: barry at python.org (Barry Warsaw)
Date: Fri, 9 Oct 2009 08:27:20 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <Pine.LNX.4.64.0910081927150.18193@kimball.webabinitio.net>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1254929486.96.16481@mint-julep.mondoinfo.com>
	<20091007170718.GA1901@phd.pp.ru>
	<5D3CE953-D6A0-4A62-84C0-7B46E37B03CF@python.org>
	<Pine.LNX.4.64.0910080250450.18193@kimball.webabinitio.net>
	<4ACD95DB.4040800@g.nevcal.com>
	<Pine.LNX.4.64.0910081927150.18193@kimball.webabinitio.net>
Message-ID: <1ABDE764-850E-40E0-9491-01E9ECA78DC7@python.org>

On Oct 8, 2009, at 7:52 PM, R. David Murray wrote:

> [1] http://wiki.python.org/moin/Email%20SIG

Fantastic David, thanks!
-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091009/96716a71/attachment.pgp>

From barry at python.org  Fri Oct  9 15:21:17 2009
From: barry at python.org (Barry Warsaw)
Date: Fri, 9 Oct 2009 09:21:17 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACED79F.6050602@g.nevcal.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>	<4ACB0DC9.7080307@g.nevcal.com>	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACB971D.9080706@g.nevcal.com>	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACC0277.2060807@g.nevcal.com>	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACCD10D.4070308@g.nevcal.com>	<87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACE6CBD.2030805@g.nevcal.com>
	<87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACED79F.6050602@g.nevcal.com>
Message-ID: <EF4BCED2-DA95-4A0E-9343-65FE59FA90CA@python.org>

On Oct 9, 2009, at 2:26 AM, Glenn Linderman wrote:

> Then the Postel principle is un-Pythonic, and to be Pythonic any  
> incorrect email should produce an error, and be unreadable. Again, I  
> mentioned producing a defect report.  That is not passing an error  
> silently.

There's no conflict between principles here if you keep clear in your  
mind the two different patterns we're talking about.  When parsing raw  
data, we soldier on in the face of errors as best we can, never  
raising exceptions, but recording defects.  When manipulating the  
model, we throw exceptions as early as possible because these are  
application errors and the client controls the application.

> The un-Pythonic thing is returning defect reports instead of raising  
> errors.  There is no way for a simple assignment interface to return  
> an error, because the API for simple assignment doesn't have an in- 
> band signaling mechanism.

This "assignment interface" falls under "manipulating the model".  It  
does reveal an important point though: the parser may not be able to  
use the same API that model manipulation uses.  It may need to use a  
lower-level (read: more permissive) interface to the model.  The  
current parser mostly works well though because the current model  
doesn't do any standards checking.  Wanna create a 10k Subject  
header?  Fine!  In practice this works well, so perhaps we need to  
think about how "RFC enforcement" can be overlaid on the model.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091009/84c3e1a6/attachment.pgp>

From stephen at xemacs.org  Fri Oct  9 17:10:18 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Sat, 10 Oct 2009 00:10:18 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACED79F.6050602@g.nevcal.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACCD10D.4070308@g.nevcal.com>
	<87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACE6CBD.2030805@g.nevcal.com>
	<87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACED79F.6050602@g.nevcal.com>
Message-ID: <87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp>

Glenn Linderman writes:

 > Emacs is different than email.  Either you can read a file to edit it, 
 > or you can't.

*sigh* Emacs is as powerful a programming environment as Python, and
applications regularly deal with network streams (HTTP, NNTP, and SMTP
most commonly, but also raw X protocol and any kind of socket
supported by the platform).  So, yes, it's different from email,
because it's *far* more general.  That's precisely why I appreciate
Bill's concerns about non-email usage.

 > The Postel principle for email says to try to do the best you can,
 > for as much as you can.

Actually, it doesn't.  It says be lenient in what you accept, strict
in what you emit.  You accept it ... but you don't have to do
anything with it except preserve it verbatim for whoever wants it.

 > >  > produce a defect report, but then simply converted to Unicode as if it 
 > >  > were Latin-1 (since there is no other knowledge available that could 
 > >  > produce a better conversion).
 > >
 > > No, that is already corruption.  Most clients will assume that string
 > > is valid as a header, because it's valid as a string.
 > 
 > Sure it is corruption.  That's why there is a defect report.  But
 > the conversion technique is appropriate, per the Postel principle.

Actually, I would say you are emitting leniently, in violation of the
Postel principle.  You don't know what the client will do, they may
eat it in a single gulp without looking at it.  Thus you should avoid
converting anything that you don't know what it is (unless
specifically asked to do your best).

 > Again, I mentioned producing a defect report.  That is not passing
 > an error silently.

But if I access that Unicode object without looking at the defect
report, you *will* pass the error silently.  OTOH, if I look at the
defect report, I won't access the Unicode object.

 > It is still raw user input, and should still be checked for proper 
 > syntax by the client,

Nonsense.  The email module had better know a lot more about syntax
than the client.  If it doesn't, whack it with a 2x4 until it learns!

 > produces no defect report.  If you don't want to check proper syntax in 
 > your program inputs, I don't want to use your programs, they will be 
 > insecure.

So you're saying that every program that uses the email module should
reproduce 100% of the functionality of the email module's parser, or
it's insecure.  And you imply that's an excuse for passing corrupt
data to any client that asks for it.

I disagree.

 > So there seem to be two techniques:

Whatever gave you that idea?

 > 2) Store the data, and convert only if the data is accessed.

 > With technique 2, little effort is required to store the data,
 > create a state variable to indicate whether it has been converted

Why do that?  It's always "False" in technique 2.

 > and parsed, or not, and then IF (and only IF) the data is accessed,
 > the conversion and parsing must be done on the first access, and
 > instead of creating and storing metainformation about the errors,
 > they could just be raised.

No, they cannot just be raised.  If you just raise the error, then the
next time you try to access unparsed data, you'll hit the error
again.  If you use the same handler you did before, you're in an
infloop.  So you need a second handler to do things differently this
time or a flag ... but it's unclear to me that that flag can be a
boolean.  So you may as well store the defect list and information
about where to restart.

 > So the Pythonic way, AFAIU, is that errors are returned out-of-band
 > via raised exceptions.

Sure.  But what you're missing is that "Neither rain, nor snow, nor
dark of night may stop the Parser on her appointed rounds."  It is not
easy to write parsers, but I'll tell you one thing: it's orders of
magnitude harder to write a parser that starts in the middle and works
outward, than one that starts at the beginning and works forward to
the end.

So it's OK to write a lazy parser, but it must retain enough state so
that it can work forward until the end.  Because you don't know that
the client will not request the last character of the message, you
need to be able to try to get it, no matter what happened to the first
10GB of the message.  And if an exception occurs, it must be handled
by the parser itself; if not, you put the poor thing in the position
of starting over at the beginning (that way lies the madness of
infloops), or trying to start a parse in the middle and work out.


From v+python at g.nevcal.com  Fri Oct  9 20:01:01 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Fri, 09 Oct 2009 11:01:01 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <3685E116-E57A-43EF-89E2-1FB03D878C0E@python.org>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org>
	<4ACD94E5.5020808@g.nevcal.com>
	<4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org>
	<4ACE60CA.6010907@g.nevcal.com>
	<EF616A50-C6CE-47EF-BA55-185DE42EC459@python.org>
	<4ACE6F97.6010605@g.nevcal.com>
	<3685E116-E57A-43EF-89E2-1FB03D878C0E@python.org>
Message-ID: <4ACF7A5D.6030906@g.nevcal.com>

On approximately 10/9/2009 5:25 AM, came the following characters from 
the keyboard of Barry Warsaw:
> On Oct 8, 2009, at 7:02 PM, Glenn Linderman wrote:
>
>> Well, that is a feature of some mailing list programs.  Those that 
>> want to do that, will have to decode and re-encode.
>>
>> However, there are definitely mailing lists that don't do that.  
>> Google Groups is one example that doesn't collapse, and always 
>> prepends the headers in front of Re:.  Seems like all the Python 
>> lists do the collapsing (I wonder why! :) )  Other lists don't do 
>> prepending (I think the RFCs recommend not prepending in Subject, 
>> actually), of the others I'm subscribed to, that prepend, some 
>> collapse and some don't.
>>
>> I'm saying that there are use cases where prepending could be done 
>> without decoding; while you are positing use cases where that is 
>> insufficient, but you shouldn't have said "Except"... you should have 
>> said "There are also other use cases".
>>
>> And when you collapse Re:, do you also collapse various 
>> language-specific spellings of Re: ???  that is a hard problem.
>
> I don't disagree with any of that.  It's all firmly in the scope of 
> the application, not the email package.  The email package just has to 
> make it possible. 

Yes.  So since the application has such latitude to make such decisions, 
it seems that the email package should do minimal parsing, analysis, and 
decoding of incoming messages until such time as the application chooses 
to request particular information.

So it seems there need to be APIs to retrieve and set (using your 
terminology from another reply) wire format header values, as well as 
decoded header values.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From barry at python.org  Fri Oct  9 20:05:09 2009
From: barry at python.org (Barry Warsaw)
Date: Fri, 9 Oct 2009 14:05:09 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACF7A5D.6030906@g.nevcal.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org>
	<4ACD94E5.5020808@g.nevcal.com>
	<4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org>
	<4ACE60CA.6010907@g.nevcal.com>
	<EF616A50-C6CE-47EF-BA55-185DE42EC459@python.org>
	<4ACE6F97.6010605@g.nevcal.com>
	<3685E116-E57A-43EF-89E2-1FB03D878C0E@python.org>
	<4ACF7A5D.6030906@g.nevcal.com>
Message-ID: <DE221EA8-29AD-4E71-92FC-A924682C945A@python.org>

On Oct 9, 2009, at 2:01 PM, Glenn Linderman wrote:

> So it seems there need to be APIs to retrieve and set (using your  
> terminology from another reply) wire format header values, as well  
> as decoded header values.

Yes, I think everyone agrees that we need both low-level and higher  
level APIs.  (Or at least I hope so! :)

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091009/6a0bef0c/attachment.pgp>

From v+python at g.nevcal.com  Fri Oct  9 20:59:25 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Fri, 09 Oct 2009 11:59:25 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org>
	<4ACD94E5.5020808@g.nevcal.com>
	<4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org>
	<4ACE6A1B.7060702@g.nevcal.com>
	<3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org>
Message-ID: <4ACF880D.5080305@g.nevcal.com>

On approximately 10/9/2009 5:05 AM, came the following characters from 
the keyboard of Barry Warsaw:
> On Oct 8, 2009, at 6:39 PM, Glenn Linderman wrote:
>> 1) wire format.  Either what came in, in the parser case, or what 
>> would be generated.
>> 2) internal headers from the MIME part
>> 3) decoded BLOB.  This means that quopri and base64 are decoded, no 
>> more and no less.  This is bytes.  No headers, only payload.  For 
>> Content-Transfer-Encoding: binary, this is mostly a noop.
>> 4) text/* parts should also be obtainable as str()/unicode(), payload 
>> only.  This is where charset decoding is done.
>>
>> I think your talk in the next paragraph about hooks and other object 
>> types being produced is a generalization of 4, not 3, and generally 
>> no additional decoding needs to be done, just conversion to the right 
>> object type (or file, or file-like object).
> I mostly agree with that.  I've always called #4 the "decoded payload" 
> and #3 I've usually called the "raw payload".  Maybe we can bikeshed 
> on better terms to help inform us about the API's method/attribute names.

It would be good though to have standardized terms for easier 
communication.  Maybe as they are chosen, they could be added to that 
Wiki RDM set up?

My only problem with "raw" and "decoded" payload, is that there are 3 
payload formats, not 2, so there needs to be a 3rd term, corresponding 
to #1, #3, and #4, above.  #2 is somewhat orthogonal from the payload.

To me, "raw" conjures up #1, not #3.

If Content-Transfer-Encoding is 7bit, 8bit, or binary, then 2 is the 
same as 1, it is just a terminology change.  Only for 
Content-Transfer-Encoding of quoted-printable or base64 must work be 
done to convert from #1 to #3.

If Content-Type is text/*, then the transformation from 2 to 3 is more 
than a cast, but for many other formats, it is mostly a cast.

> Which brings up another point: right now Message objects have a single 
> .get_payload() method that takes a flag to indicate whether it should 
> be the decoded or raw payload.  That's bong.  These should be 
> different interfaces.

Separate APIs would be clearer, but for compatibility, should 
.get_payload() be retained, with the flag?  Fortunately, there is only 
one result value in any case, so it is just a matter of what the type of 
that output value is, and how it should be handled.

Perhaps the flag parameter should be extended to allow retrieval of all 
three payload formats instead of only two?

.get_payload could be converted to call the appropriate specific APIs, 
should it be desired to invent separate APIs for each payload format.

>>> The problem is that if the bytes came off the wire, the parser 
>>> currently can only attach the most basic MIME base class.  It 
>>> doesn't know that an image/png should create a MIMEImagePNG instance 
>>> there.  This is different from hacking the model directly because 
>>> the application can instantiate the right class.  So the parser 
>>> either has to have a hookable way for an application to go from 
>>> content-type to class, or the generic MIME base class needs to be 
>>> hookable in its .decode() method.
>>
>> So either the email package can stop at 3, and 4 only for text/* 
>> parts, or it could learn more types (registered types, with 
>> well-defined corresponding objects could be potentially built-in to 
>> the email package), and/or it could become hookable for application 
>> types.  Of course, for disposition to files, storing the BLOB in a 
>> file of the right name is adequate... to avoid the file, I agree that 
>> converting to a useful object type is handy.  But maybe file-like 
>> objects would suffice, for most of the types.
>
> My own preferences here is that email does support #4 with a 
> registration system to handle returning concrete payload objects based 
> on the Content-Type.

Sure, a registration system is fine.   It could work for any type that 
has a method that can be registered, that accepts a binary BLOB and 
returns an appropriate typed and functioning object that can manipulate 
that type.  That would mean that the application would have to make all 
the registration calls up front, instead of making the API calls when 
the objects are retrieved.  Basically, if the email package doesn't have 
a registration system that the application can use, the application has 
to invent its own, so this is work that could benefit all applications.

I suppose the default registration for text/* would be to convert from 
whatever to Unicode, and the default registration for all other 
Content-Type would be to pass back bytes().  Or maybe a few other common 
types, for which specific types are available, some specific image/* 
types, perhaps, that seems to have MIME types defined for them, although 
perhaps people may still prefer to register, say, a PIL type, for 
images, so I agree the email package should only provide default 
registrations.  On the other hand, I'm not sure how the registration 
system should work with threads, if different threads want different 
registrations...


Actually, although it is not common practice to have encodings other 
than the RFC defined base64 and quoted-printable, a registration system 
for converting from #1 to #3, with appropriate defaults for base64, 
quoted-printable, binary, 7bit, 8bit, would be appropriate, and would 
provide a framework for allowing easy extensions to the encodings.  
Future mail RFCs may define some, but more likely, applications that 
wish to use email transports, where both ends are application 
controlled, might wish to define other encodings... the RFCs do allow 
for x-* encodings that are user defined.  If a registration system is 
created for #3 to #4 encodings, the same mechanism could likely be use 
for the registration system for #1 to #3 encodings, so there would be 
added flexibility at very little cost.

> I also think that the email package probably should not implement 
> "store-payloads-on-disk" by default, although it may provide some 
> example implementations for simple applications (much the same way 
> there's wsgiref for simple applications).

Thinking about this, I agree that storing payloads on disk should not be 
the default action.  However, if an application wants to control its 
memory consumption, the receipt of a large email could negatively impact 
that desire.  It might be appropriate to place individual MIME parts on 
disk, as they are parsed, if the application indicates a threshold part 
size and/or threshold aggregate size, beyond which parts should be 
placed in cache.  Along with that, the temporary storage location in 
which to place them would have to be configured.

>   Still, that's different than say, storing attachments in a file 
> named by the Content-Disposition header's filename parameter.  That 
> latter is firmly in the domain of the application.

I again agree that this should not be the default action, but I assume 
that an API should be provided such that an application could tell the 
email package to place the content in the header's filename parameter.  
If such an API doesn't already exist, it seems it would be a helpful 
extension, and if the part was already cached on disk because of the 
above thresholds, the email package could possibly use rename instead of 
file copy to achieve the goal.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From v+python at g.nevcal.com  Fri Oct  9 21:40:33 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Fri, 09 Oct 2009 12:40:33 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <17899606-EE28-4800-A05D-95525AF90E3E@python.org>
References: <8510262.7231254589795083.JavaMail.root@boaz>	<4ACB0DC9.7080307@g.nevcal.com>	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACB971D.9080706@g.nevcal.com>	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACC0277.2060807@g.nevcal.com>	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACCD10D.4070308@g.nevcal.com>
	<87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACE6CBD.2030805@g.nevcal.com>
	<17899606-EE28-4800-A05D-95525AF90E3E@python.org>
Message-ID: <4ACF91B1.2090303@g.nevcal.com>

On approximately 10/9/2009 5:23 AM, came the following characters from 
the keyboard of Barry Warsaw:
> On Oct 8, 2009, at 6:50 PM, Glenn Linderman wrote:
>
>> On approximately 10/8/2009 4:40 AM, came the following characters 
>> from the keyboard of Stephen J. Turnbull:
>>> Glenn Linderman writes:
>>>
>>> > >  > If conversions are avoided, then octets are unlikely to be 
>>> out of  > >  > range?
>>> > >
>>> > > Haven't looked in your spam bucket recently, I guess.  Spammers
>>> > > regularly put 8 bit characters into headers (and into bodies in
>>> > > messages without a Content-Type header), for one thing.
>>> >  > I'm aware of that, but if conversions are not done, octets are 
>>> unlikely  > to be _reported_ to be out of range....
>>>
>>> Conversions will eventually be done.  "Best it were done quickly."
>>>
>>
>> Disagree.  Deferring the conversions defers failure issues to the 
>> point where the code (hopefully) somewhat understands the type of 
>> data being manipulated, and can then handle it appropriately.  
>> Converting up front causes errors in things that may never be touched 
>> or needed, so the error detection and handling is wasteful.
>
> I'm with Stephen here.  Remember, we're saying the parser should never 
> throw an exception, so any such conversion exception happens when you 
> manipulate the model directly.  That /has/ to error early because 
> otherwise it is impossible to debug.

I suspect we are talking with different terminology somehow, here.  At 
least it seems that way, between myself and Stephen.  So let me return 
to ground zero, and ask some very basic questions, to see what, if 
anything, I am missing in my understanding of Stephen's and perhaps 
your, terminology.

Let me speak in terms of parsing incoming wire-format messages, because 
the creation of a valid email from API calls should be straightforward.

I see the necessary job of the parser to received chunks of the message, 
parse the headers into individual headers (based mostly on CR LF TAB 
detection, and find the end of the headers.  Then, in order to properly 
handle the body, it needs to find several specific headers, or supply 
defaults for them if lacking.  They include validation of the 
MIME-Version, determining the Content-Type, and 
Content-Transfer-Encoding.  Other headers do not need to be decoded at 
parse time, if I understand things, just parsed into buckets (a list to 
preserve order, with possibly an index of some sort for performance if 
necessary).  The 3 headers mentioned should be fully validated and 
decoded, so that parsing the body can proceed.  Parsing the body finds 
one or more MIME parts, and for each part, a list of its headers should 
be created.  Content-Type and Content-Transfer-Encoding should again be 
fully validated and decoded, so that parsing the body of each part can 
proceed recursively.  The leaf MIME parts should have their wire format 
data stored also.

Do you agree with that minimal functionality of message parsing?

If content boundaries cannot be found, then the parsing will fail, and a 
defect report generated for that part, and any higher-level parts that 
include it, because they will also be incomplete.  That is just a 
parse-error flag, in the tree of MIME parts, AFAICT.

I see the further validation and decoding of the MIME tree for the 
message to be all based on API calls by the application to manipulate 
the model, which should be able to raise exceptions as needed, and could 
have fully Pythonic interfaces.

If the client wishes to have all headers, header values, and charset 
decoding validated before doing model manipulations, then it should call 
email package APIs that are provided to do that individually, per MIME 
part, or recursively over the model (and which might raise exceptions).

If the client wishes to have all leaf MIME parts decoded from wire 
format to "raw payload" or "decoded payload", before manipulating the 
model, then it should call the email package APIs that are provided to 
do that individually, per MIME part, or recursively over the model (and 
which might raise exceptions).

Is there any other functionality that should be performed?  If so, why?  
It seems that Stephen is perhaps saying that the functionality in the 
above two paragraphs should be performed during parsing. Is that what is 
being said?  I can hardly believe it, if so.  Since there are multiple 
ways to interpret not-quite-perfect data, application guidance is 
required for those choices, and the creation of defect reports along the 
way would be a bookkeeping headache.

>> So for headers, which are supposed to be ASCII, or encoded via RFC 
>> rules to ASCII (no 8-bit chars), then the discovery of an 8-bit char 
>> should be produce a defect report, but then simply converted to 
>> Unicode as if it were Latin-1 (since there is no other knowledge 
>> available that could produce a better conversion).  And if the result 
>> of that is not expected by the client (your definition), then the 
>> client should either notice the defect report and reject it based on 
>> that, or attempt to parse it, and reject it if it encounters 
>> unexpected syntax.  After all, this is, for that client, "raw user 
>> input" (albeit from a remote source) so fully error checking the 
>> input is appropriate.
>
> Sure, but I can also think of lots of other things the client might 
> do, including blowing away the header value and substituting their 
> own, doing the moral equivalent of a str.replace(), etc. etc.  It's 
> not our job to decide.  It our job to provide the highest fidelity 
> information we can and the best APIs for clients to do what they want.

Exactly.  So if the client is going to blow away the header value, no 
point to validate and decode it.

If the client is going to send it on, the client can choose to validate 
before sending, or just send what was received, whether or not it was 
valid.  This depends on the purpose and functionality of the client.


>> The problem with the APIs that are spelled __str__ and __bytes__ is 
>> that there is no other way to return errors other than exceptions.... 
>> the Python way.  Since the email library is trying to avoid raising 
>> exceptions in large blocks of its code, it is non-Pythonic (which is 
>> what Oleg is probably complaining about, in part).  But because it 
>> needs to avoid exceptions, and is therefore non-Pythonic, it may be 
>> inappropriate to spell very many of its APIs __str__ and __bytes__, 
>> because that is Pythonic, and requires exceptions.  Once you become 
>> non-Pythonic in one area, you may have to also be non-Pythonic in 
>> some other areas...
>
> As was pointed out in a previous message, we shouldn't be too 
> concerned with __str__ and __bytes__ right now.  We'll design 
> non-magical APIs for everything and they'll do the right thing.  We'll 
> then alias what seems appropriate as __str__ and __bytes__ and they'll 
> be as Pythonic as makes sense.  When I say that, I'm thinking about 
> the semantic differences Message objects currently have in their 
> dict-like-plus API (which I still think makes perfect practical sense). 

OK, it seems we all understand the limitations of the __str__, 
__bytes__, and assignment type APIs: they must either succeed, or raise 
exceptions.  Can we agree to that clients should only use such APIs when 
success is assured, or raising exceptions is acceptable?  And that if a 
client complains about an exception in a case they thought success 
should have been assured, that it is not a bug if they misunderstood?  
Clearly the email package should document the conditions for which 
success can be assured, if there are any... and that it is fair game to 
raise exceptions if those conditions are not met.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From v+python at g.nevcal.com  Fri Oct  9 22:26:19 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Fri, 09 Oct 2009 13:26:19 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8510262.7231254589795083.JavaMail.root@boaz>	<4ACB0DC9.7080307@g.nevcal.com>	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACB971D.9080706@g.nevcal.com>	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACC0277.2060807@g.nevcal.com>	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACCD10D.4070308@g.nevcal.com>	<87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACE6CBD.2030805@g.nevcal.com>	<87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACED79F.6050602@g.nevcal.com>
	<87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <4ACF9C6B.4020508@g.nevcal.com>

On approximately 10/9/2009 8:10 AM, came the following characters from 
the keyboard of Stephen J. Turnbull:
> Glenn Linderman writes:
>
>  > Emacs is different than email.  Either you can read a file to edit it, 
>  > or you can't.
>
> *sigh* Emacs is as powerful a programming environment as Python, and
> applications regularly deal with network streams (HTTP, NNTP, and SMTP
> most commonly, but also raw X protocol and any kind of socket
> supported by the platform).  So, yes, it's different from email,
> because it's *far* more general.  That's precisely why I appreciate
> Bill's concerns about non-email usage.
>   

OK, yes, Emacs is an operating system.  I am an Emacs user.  And yes, I 
know Emacs can read email (I used it to read and write email, but found 
it seriously lacking for the way I handle email, and annoying that the 
email buffers and edit buffers were all in the same buffer pool, and I 
quit using it for email).  And I know it can be programmed, and I've 
done a little of that, but I hate Lisp, so I mostly Google for the 
packages that do what I need, and don't try to create my own.

>  > The Postel principle for email says to try to do the best you can,
>  > for as much as you can.
>
> Actually, it doesn't.  It says be lenient in what you accept, strict
> in what you emit.  You accept it ... but you don't have to do
> anything with it except preserve it verbatim for whoever wants it.
>   

Yes, that is what it says, I agree.  But unless you do the best you can, 
for as much as you can, no one is going to want it, so they are 
basically the same.

>  > >  > produce a defect report, but then simply converted to Unicode as if it 
>  > >  > were Latin-1 (since there is no other knowledge available that could 
>  > >  > produce a better conversion).
>  > >
>  > > No, that is already corruption.  Most clients will assume that string
>  > > is valid as a header, because it's valid as a string.
>  > 
>  > Sure it is corruption.  That's why there is a defect report.  But
>  > the conversion technique is appropriate, per the Postel principle.
>
> Actually, I would say you are emitting leniently, in violation of the
> Postel principle.  

You can say that, but I don't have to believe it.  I'm talking about 
accepting; the message has arrived, it is here, the client is trying to 
look at it, and I'm talking about ways the client can look at 
not-quite-perfect data, knowing that it is not quite perfect, but still 
being able to see it.  I'm not at all talking about emitting data.  You 
seem to be calling the email package helping the client to accept 
not-quite-perfect data, as a form of emitting data.  It is not.

> You don't know what the client will do, they may
> eat it in a single gulp without looking at it.  Thus you should avoid
> converting anything that you don't know what it is (unless
> specifically asked to do your best).
>   


The email package cannot police the client... if it chooses to "eat it 
in a single gulp without looking at it" then it may get indigestion.  I 
never suggested that "converting to Unicode as if it were Latin-1" 
should be done without informing the client, or being requested by the 
client to do that via a special API call... I was only talking about an 
appropriate method of doing conversions in the presence of 
not-quite-perfect data input, so that the client, and possibly even a 
human, can try to make some sense out of the not-quite-perfect data.


>  > Again, I mentioned producing a defect report.  That is not passing
>  > an error silently.
>
> But if I access that Unicode object without looking at the defect
> report, you *will* pass the error silently.  OTOH, if I look at the
> defect report, I won't access the Unicode object.
>   

If those are the only two choices you see, then you are not doing your 
whole job.

If you ignore defect reports, you are ignorant (blunt, but not intended 
to be offensive).
If you treat all defect reports as fatal errors, then you are not being 
lenient in what you accept (non-Postel).

>  > It is still raw user input, and should still be checked for proper 
>  > syntax by the client,
>
> Nonsense.  The email module had better know a lot more about syntax
> than the client.  If it doesn't, whack it with a 2x4 until it learns!
>   

I think we are talking at cross purposes here.  I find it quite 
difficult to follow where you cross the boundary between talking about 
one sort of email package client, and then switch to another type, or 
switch to the responsibilities of the email package.

A client which is an MUA is just going to present the best possible data 
to a human user, and is done.  A client with is an email  archiver 
preserves the data for presenting via other MUAs. 

An application which is using email as a transport, has specific goals, 
which require specific content.  You were mentioning clients.  It is 
this sort of client I thought you were talking about, and about which I 
responded to.  If such a client doesn't validate the syntax of that 
content, it isn't much of an application.  The email module does not, 
and cannot, understand the application domain; it can only validate that 
the message has proper (or improper) structure.  The transported content 
is fully the responsibility of the application to validate, parse, and 
manipulate.  The email module may detect if the transport cause garbling 
in the structure of the message, and may be able to warn the application 
about such garbling.  But that may not prevent the application from 
finding its content within even a garbled email, and so it may still be 
able to validate, parse, and manipulate that content.  Such clients may 
transfer content either in headers or in MIME parts... in any case, 
whatever client specific content is expected in those headers or MIME 
parts should be validated by the client.


>  > produces no defect report.  If you don't want to check proper syntax in 
>  > your program inputs, I don't want to use your programs, they will be 
>  > insecure.
>
> So you're saying that every program that uses the email module should
> reproduce 100% of the functionality of the email module's parser, or
> it's insecure.  And you imply that's an excuse for passing corrupt
> data to any client that asks for it.
>
> I disagree.
>   

I'm glad you disagree with what you thought I was saying, because that 
isn't what I was saying, and I also disagree with your paraphrase of 
what I was saying.  The email package should parse email.  Where it 
finds not-quite-perfect data, the client may get involved to choose a 
path for interpreting the not-quite-perfect data... or to reject the 
not-quite-perfect data.

Once the data from the email is discovered, then the client must operate 
on the data.  An MUA would simply display it to a human.  Other clients 
would attempt to interpret the content.  The interpretation of the 
content requires the client to parse, validate the syntax of, and 
manipulate the content.  An example would be a program that does 
appointments via email.  If it finds an appointment in a known format, 
it enters it into the calendar.  The email package knows nothing about 
appointments or calendars (of the sort that hold appointments). It 
cannot help, only the client can do that part of the job.


>  > So there seem to be two techniques:
>
> Whatever gave you that idea?
>   

I'm not sure you what you are asking here.

>  > 2) Store the data, and convert only if the data is accessed.
>
>  > With technique 2, little effort is required to store the data,
>  > create a state variable to indicate whether it has been converted
>
> Why do that?  It's always "False" in technique 2.
>   

The first time it is always false.  Subsequent requests can leverage the 
work done by the first request, if results were created and cached.

>  > and parsed, or not, and then IF (and only IF) the data is accessed,
>  > the conversion and parsing must be done on the first access, and
>  > instead of creating and storing metainformation about the errors,
>  > they could just be raised.
>
> No, they cannot just be raised.  If you just raise the error, then the
> next time you try to access unparsed data, you'll hit the error
> again.  If you use the same handler you did before, you're in an
> infloop.  So you need a second handler to do things differently this
> time or a flag ... but it's unclear to me that that flag can be a
> boolean.  So you may as well store the defect list and information
> about where to restart.
>   

 From the point of view of the email package, the errors can just be 
raised.  Then the client can make choices, and use other APIs or other 
parameters to the API to direct the email package to attempt a different 
technique to access the data.  If the technique is successful, then 
progress is made.  If unsuccessful, another error is raised by the 
different technique.  If there are more techniques, repeat.  When out of 
techniques, and no success, then the client needs to remember (possibly 
with the help of APIs of the email package) that it cannot interpret 
this data in a useful manner.  If it then continues to attempt to access 
the data using failed techniques, and goes into an infinite loop, then 
the client has a bug.


>  > So the Pythonic way, AFAIU, is that errors are returned out-of-band
>  > via raised exceptions.
>
> Sure.  But what you're missing is that "Neither rain, nor snow, nor
> dark of night may stop the Parser on her appointed rounds."  

I haven't forgotten that, but clearly we haven't been communicating 
effectively.  That may be partly my fault, partly because I'm relatively 
new to Python and to the email package (having only experimented with it 
using Python 2.6, not coded inside it, to date), but I'm trying...  I'm 
hoping to write some email processing programs using the Python email 
package, and so I do have a strong interest in this topic.  I'm hoping I 
don't have to start from scratch and write my own email package, because 
Python's isn't functional enough, or doesn't perform well enough.  Being 
new to Python, I've chosen to focus on building my applications with 
Python 3, understanding that there are fewer fully functional pieces in 
that arena to date, and since email is one that has some rough edges 
because of the Unicode strings, I'm trying to participate where I can.

> It is not
> easy to write parsers, but I'll tell you one thing: it's orders of
> magnitude harder to write a parser that starts in the middle and works
> outward, than one that starts at the beginning and works forward to
> the end.
>   

Yes, I have learned that in my 34 years of programming.  I agree.

> So it's OK to write a lazy parser, but it must retain enough state so
> that it can work forward until the end.  Because you don't know that
> the client will not request the last character of the message, you
> need to be able to try to get it, no matter what happened to the first
> 10GB of the message.  And if an exception occurs, it must be handled
> by the parser itself; if not, you put the poor thing in the position
> of starting over at the beginning (that way lies the madness of
> infloops), or trying to start a parse in the middle and work out.
>   

Are you speaking about parsing the message into MIME parts, or parsing a 
particular MIME part contained within the message, or both?

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From v+python at g.nevcal.com  Fri Oct  9 22:43:59 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Fri, 09 Oct 2009 13:43:59 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACEF66B.3000500@is.kochi-u.ac.jp>
References: <PC19220091008162023032819903bdf@msapiro>
	<4ACEABFD.6010309@g.nevcal.com> <4ACEB234.9030309@is.kochi-u.ac.jp>
	<4ACED8C4.5070906@g.nevcal.com> <4ACEF66B.3000500@is.kochi-u.ac.jp>
Message-ID: <4ACFA08F.9080307@g.nevcal.com>

On approximately 10/9/2009 1:38 AM, came the following characters from 
the keyboard of Tokio Kikuchi:
> Glenn Linderman wrote:
>   
>> On approximately 10/8/2009 8:47 PM, came the following characters from
>> the keyboard of Tokio Kikuchi:
>>     
>>>>> Actually, as long as the prepended text is ASCII, all that work can be
>>>>> done on the encoded value.  When it is not ASCII, it may still be
>>>>> separated and recognizable.  Still that logic is more complex than
>>>>> decoding, handling as Unicode, and encoding.... when it works.  Just
>>>>> pointing out that there is more than one way to do things...       
>>>>>           
>>> Oh, really?
>>>
>>> Base64 is 3 to 4 octets encoding and there is no way to prepend padding.
>>>   
>>>       
>> In header values, encoding is done using encoded-words.  A header value
>> consists of a sequence of ASCII words, and encoded-words.  While an
>> encoded word, that uses base64 encoding cannot easily be adjusted to
>> prepend data into that encoded-word, additional ASCII or encoded-words
>> can be prepended in front of the other ASCII or encoded words within the
>> header-value.
>>
>> So, yes, really!
>>
>>     
> Following two lines have equivalent header contents:
>
> Re: [mmjp-users 123] =?iso-2022-jp?b?GyRCRnxLXDhsGyhC?=
> Re: =?iso-2022-jp?b?W21tanAtdXNlcnMgMTIzXSAbJEJGfEtcOGwbKEI=?=
>
> I'd like to see how you can extract ascii part without touching rest of
> the encoded word in the second example.
>   

I can't, and I didn't say I could.

> What we do in mailman is that both are treated equally and delete
> [mmjp-users 123] from the subject and prefix again by [mmjp-users 124]
> (with new sequential number).  Some MUA encode subjects like the second
> example and this is beyond our control.  Therefore, we are forced to
> decode the whole part of header content.
>   

Yes, if the MUA has created the second encoding, decoding is required in 
order to replace the header prefix.

If the MUA has created the first encoding, then decoding would not be 
required in order to replace the header prefix, but the logic to detect 
which case and handle them separately, results in more complexity in the 
application.

What I said, was that prefixing a header value with additional text 
didn't require decoding, and that is true.

What you are saying, is that you want to do more than prefix a header 
value with additional text.

What you are saying is that you would rather choose to keep the 
application logic simple, by assuming or requiring that the existing 
header value is able to be decoded.  If that is sufficient for your 
application, it is a reasonable choice.  What do you do with messages 
for which the header you wish to modify cannot be decoded?  Some options 
would be:

1) bounce the message

2) discard the message

3) determine if the header value is partially able to be decoded, and if 
the part that can be decoded contains the data you wish to modify, 
modify it, and simply preserve and pass-through the parts that could not 
be decoded.

4) if the header value cannot be at all decoded, or the parts that can 
be decoded do not contain the data you wish to modify, then you could 
possibly choose to simply prefix information into the header in that 
case, again preserving and passing through the parts that could not be 
decoded (or, in this case, the whole value).

Perhaps you can think of other alternatives besides these, feel free to 
suggest some.

Naturally, doing options 3 or 4 above requires more complex logic for 
the application than options 1 or 2.  The requirements of your 
application should determine the types of choices you make.

For example, if a new or non-standard charset appears, an application 
that requires the ability to decode the header, but hasn't been update 
to understand the charset, will fail to decode it.  Yet, if it has logic 
like 3 or 4, it may be more successful, and would be a more robust 
application.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From tkikuchi at is.kochi-u.ac.jp  Sat Oct 10 00:08:22 2009
From: tkikuchi at is.kochi-u.ac.jp (Tokio Kikuchi)
Date: Sat, 10 Oct 2009 07:08:22 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACFA08F.9080307@g.nevcal.com>
References: <PC19220091008162023032819903bdf@msapiro>
	<4ACEABFD.6010309@g.nevcal.com> <4ACEB234.9030309@is.kochi-u.ac.jp>
	<4ACED8C4.5070906@g.nevcal.com> <4ACEF66B.3000500@is.kochi-u.ac.jp>
	<4ACFA08F.9080307@g.nevcal.com>
Message-ID: <4ACFB456.6010106@is.kochi-u.ac.jp>

What you said in message-id: <4ACE6F97.6010605 at g.nevcal.com> was:

> When it is not ASCII, it may still be
> separated and recognizable.

and our discussion might be concluded that this is true 'not really, but
only theoretically.'  Your suggestions 1)-4) are not accesptable to
Japanese users at all.

I couldn't resist writing because the discussion was important in
designing Mailman's subject prefixing and numbering.  I'll shut up my
mouth again because I'm so busy.

Sorry for disturbing,

-- 
???? tkikuchi at is.kochi-u.ac.jp
http://weather.is.kochi-u.ac.jp/
?780-8520 ?????????????

From rdmurray at bitdance.com  Sat Oct 10 01:20:54 2009
From: rdmurray at bitdance.com (R. David Murray)
Date: Fri, 9 Oct 2009 19:20:54 -0400 (EDT)
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACF9C6B.4020508@g.nevcal.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACCD10D.4070308@g.nevcal.com>
	<87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACE6CBD.2030805@g.nevcal.com>
	<87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACED79F.6050602@g.nevcal.com>
	<87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACF9C6B.4020508@g.nevcal.com>
Message-ID: <Pine.LNX.4.64.0910091909000.18193@kimball.webabinitio.net>

On Fri, 9 Oct 2009 at 13:26, Glenn Linderman wrote:
> On approximately 10/9/2009 8:10 AM, came the following characters from the 
> keyboard of Stephen J. Turnbull:
>>  Glenn Linderman writes:
>> > > >  produce a defect report, but then simply converted to Unicode as if 
>> > > >  it were Latin-1 (since there is no other knowledge available that 
>> > > >  could produce a better conversion).
>> > > 
>> > >  No, that is already corruption.  Most clients will assume that string
>> > >  is valid as a header, because it's valid as a string.
>> > 
>> >  Sure it is corruption.  That's why there is a defect report.  But
>> >  the conversion technique is appropriate, per the Postel principle.
>>
>>  Actually, I would say you are emitting leniently, in violation of the
>>  Postel principle. 
>
> You can say that, but I don't have to believe it.  I'm talking about 
> accepting; the message has arrived, it is here, the client is trying to look 
> at it, and I'm talking about ways the client can look at not-quite-perfect 
> data, knowing that it is not quite perfect, but still being able to see it. 
> I'm not at all talking about emitting data.  You seem to be calling the email 
> package helping the client to accept not-quite-perfect data, as a form of 
> emitting data.  It is not.

IMO, the appropriate way for the email package to provide the API you
are talking about is it provide the client with a way to get at the raw
byte string, which I think everyone agrees on.  If the client wants to
decode it as if it were latin-1 to process it, it can then do that.

--David (RDM)

From v+python at g.nevcal.com  Sat Oct 10 02:54:20 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Fri, 09 Oct 2009 17:54:20 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <Pine.LNX.4.64.0910091909000.18193@kimball.webabinitio.net>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACCD10D.4070308@g.nevcal.com>
	<87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACE6CBD.2030805@g.nevcal.com>
	<87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACED79F.6050602@g.nevcal.com>
	<87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACF9C6B.4020508@g.nevcal.com>
	<Pine.LNX.4.64.0910091909000.18193@kimball.webabinitio.net>
Message-ID: <4ACFDB3C.5040307@g.nevcal.com>

On approximately 10/9/2009 4:20 PM, came the following characters from 
the keyboard of R. David Murray:
> On Fri, 9 Oct 2009 at 13:26, Glenn Linderman wrote:
>> On approximately 10/9/2009 8:10 AM, came the following characters 
>> from the keyboard of Stephen J. Turnbull:
>>>  Glenn Linderman writes:
>>> > > >  produce a defect report, but then simply converted to Unicode 
>>> as if > > >  it were Latin-1 (since there is no other knowledge 
>>> available that > > >  could produce a better conversion).
>>> > > > >  No, that is already corruption.  Most clients will assume 
>>> that string
>>> > >  is valid as a header, because it's valid as a string.
>>> > >  Sure it is corruption.  That's why there is a defect report.  But
>>> >  the conversion technique is appropriate, per the Postel principle.
>>>
>>>  Actually, I would say you are emitting leniently, in violation of the
>>>  Postel principle. 
>>
>> You can say that, but I don't have to believe it.  I'm talking about 
>> accepting; the message has arrived, it is here, the client is trying 
>> to look at it, and I'm talking about ways the client can look at 
>> not-quite-perfect data, knowing that it is not quite perfect, but 
>> still being able to see it. I'm not at all talking about emitting 
>> data.  You seem to be calling the email package helping the client to 
>> accept not-quite-perfect data, as a form of emitting data.  It is not.
>
> IMO, the appropriate way for the email package to provide the API you
> are talking about is it provide the client with a way to get at the raw
> byte string, which I think everyone agrees on.  If the client wants to
> decode it as if it were latin-1 to process it, it can then do that. 

That certainly works, but it isn't very helpful... that forces the 
client application to reproduce the logic to parse the header value and 
decode the parts that can be decoded successfully, and that is exactly 
the sort of thing Stephen was complaining about when he thought I was 
suggesting that to be a requirement (but he was confused about what I 
was suggesting).

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From v+python at g.nevcal.com  Sat Oct 10 03:12:38 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Fri, 09 Oct 2009 18:12:38 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACFB456.6010106@is.kochi-u.ac.jp>
References: <PC19220091008162023032819903bdf@msapiro>
	<4ACEABFD.6010309@g.nevcal.com> <4ACEB234.9030309@is.kochi-u.ac.jp>
	<4ACED8C4.5070906@g.nevcal.com> <4ACEF66B.3000500@is.kochi-u.ac.jp>
	<4ACFA08F.9080307@g.nevcal.com> <4ACFB456.6010106@is.kochi-u.ac.jp>
Message-ID: <4ACFDF86.8040104@g.nevcal.com>

On approximately 10/9/2009 3:08 PM, came the following characters from 
the keyboard of Tokio Kikuchi:
> What you said in message-id: <4ACE6F97.6010605 at g.nevcal.com> was:
>   
>> When it is not ASCII, it may still be
>> separated and recognizable.
>>     
> and our discussion might be concluded that this is true 'not really, but
> only theoretically.'  Your suggestions 1)-4) are not accesptable to
> Japanese users at all.
>   

There's something I don't understand here, and I'll hope you'll take a 
few moments to explain...

If a message with an encoded header arrives (like your number 2 sample) 
but it cannot be decoded, what action _is_ acceptable to Japanese 
users?  And what action is implemented in Mailman (if different)?

I can think of a 5th technique... don't modify the header, and send it 
through unchanged.  Now I think I've covered the gamut of possibilities, 
so if there is one I've missed, I'm extremely interested to learn (or be 
reminded) of it.


What I meant by "may still be separated and recognizable", is, in fact, 
somewhat theoretical.  Since I can't type Japanese, I'll just use a 
single accented non-ASCII character in my explanation, but here goes:

Message A arrives at Mailman for distribution.  No subject prefix or 
numbering.  Since it is Mailman doing it, Mailman could notice that the 
prefix is like  [abcd?fg 126] and must be encoded.  Mailman could encode 
the prefix as a separate encoded word than the rest of the subject 
value.  Let's assume that it does.

The rest cannot be guaranteed, because we have no control over the MUA 
of the person that replies.  But it might come back in the same 
manner... one encoded word with the prefix and then the rest of the 
subject line, possibly encoded, possibly not.  If it does, then if the 
first encoded word can be decoded, and the prefix and numbering 
recognized, then the modification to assign a new number can be done, 
whether or not the remaining part of the subject line can be decoded or not.

So that is what I meant, by the above.  It isn't a guarantee in any 
manner.  It could realistically happen, though, if an MUA simply adds 
"Re: " to the front of the stuff that it is passed (or an encoded word 
with an appropriate translation for "Re: ").

MUAs or mailing list handlers that decode, process, and reencode, will 
probably not produce headers with that pattern, but more likely like the 
one you showed in example 2.  MUAs or mailing list handlers that attempt 
to retain what was sent (idempotency or invertibility), would be more 
likely to do what I describe, and are more robust when faced with new 
character sets that they don't understand how to decode.

> I couldn't resist writing because the discussion was important in
> designing Mailman's subject prefixing and numbering.  I'll shut up my
> mouth again because I'm so busy.
>
> Sorry for disturbing

Thanks for your contribution; I hope for one more, at least.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From rdmurray at bitdance.com  Sat Oct 10 03:25:56 2009
From: rdmurray at bitdance.com (R. David Murray)
Date: Fri, 9 Oct 2009 21:25:56 -0400 (EDT)
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACFDB3C.5040307@g.nevcal.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACCD10D.4070308@g.nevcal.com>
	<87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACE6CBD.2030805@g.nevcal.com>
	<87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACED79F.6050602@g.nevcal.com>
	<87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACF9C6B.4020508@g.nevcal.com>
	<Pine.LNX.4.64.0910091909000.18193@kimball.webabinitio.net>
	<4ACFDB3C.5040307@g.nevcal.com>
Message-ID: <Pine.LNX.4.64.0910092101280.18193@kimball.webabinitio.net>

On Fri, 9 Oct 2009 at 17:54, Glenn Linderman wrote:
> On approximately 10/9/2009 4:20 PM, came the following characters from the 
> keyboard of R. David Murray:
>>  On Fri, 9 Oct 2009 at 13:26, Glenn Linderman wrote:
>> >  On approximately 10/9/2009 8:10 AM, came the following characters from 
>> >  the keyboard of Stephen J. Turnbull:
>> > >   Glenn Linderman writes:
>> > > > > >   produce a defect report, but then simply converted to Unicode 
>> > >  as if > > >  it were Latin-1 (since there is no other knowledge 
>> > >  available that > > >  could produce a better conversion).
>> > > > > > >   No, that is already corruption.  Most clients will assume 
>> > >  that string
>> > > > >   is valid as a header, because it's valid as a string.
>> > > > >   Sure it is corruption.  That's why there is a defect report.  But
>> > > >   the conversion technique is appropriate, per the Postel principle.
>> > > 
>> > >   Actually, I would say you are emitting leniently, in violation of the
>> > >   Postel principle. 
>> > 
>> >  You can say that, but I don't have to believe it.  I'm talking about 
>> >  accepting; the message has arrived, it is here, the client is trying to 
>> >  look at it, and I'm talking about ways the client can look at 
>> >  not-quite-perfect data, knowing that it is not quite perfect, but still 
>> >  being able to see it. I'm not at all talking about emitting data.  You 
>> >  seem to be calling the email package helping the client to accept 
>> >  not-quite-perfect data, as a form of emitting data.  It is not.
>>
>>  IMO, the appropriate way for the email package to provide the API you
>>  are talking about is it provide the client with a way to get at the raw
>>  byte string, which I think everyone agrees on.  If the client wants to
>>  decode it as if it were latin-1 to process it, it can then do that. 
>
> That certainly works, but it isn't very helpful... that forces the client 
> application to reproduce the logic to parse the header value and decode the 
> parts that can be decoded successfully, and that is exactly the sort of thing 
> Stephen was complaining about when he thought I was suggesting that to be a 
> requirement (but he was confused about what I was suggesting).

I wasn't clear, sorry :).  The current API has a 'decode_header' function,
which doesn't do the byte-to-unicode decode (yeah, there's another naming
problem here...we have two types of decoding and only one word for both)
but instead returns (bytes, charset) tuples.  This piece of the API is
broken in python3, and I don't think it is the right API going forward,
but that _kind_ of API is what I meant by 'getting at the raw byte
string':  the byte string that failed the bytes-to-unicode decoding,
not the entire header (though there will also be a way to get that if
you need it, I presume.)

--David (RDM)

From v+python at g.nevcal.com  Sat Oct 10 05:46:23 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Fri, 09 Oct 2009 20:46:23 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <Pine.LNX.4.64.0910092101280.18193@kimball.webabinitio.net>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACCD10D.4070308@g.nevcal.com>
	<87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACE6CBD.2030805@g.nevcal.com>
	<87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACED79F.6050602@g.nevcal.com>
	<87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACF9C6B.4020508@g.nevcal.com>
	<Pine.LNX.4.64.0910091909000.18193@kimball.webabinitio.net>
	<4ACFDB3C.5040307@g.nevcal.com>
	<Pine.LNX.4.64.0910092101280.18193@kimball.webabinitio.net>
Message-ID: <4AD0038F.4000705@g.nevcal.com>

On approximately 10/9/2009 6:25 PM, came the following characters from 
the keyboard of R. David Murray:
> On Fri, 9 Oct 2009 at 17:54, Glenn Linderman wrote:
>> On approximately 10/9/2009 4:20 PM, came the following characters 
>> from the keyboard of R. David Murray:
>>>  On Fri, 9 Oct 2009 at 13:26, Glenn Linderman wrote:
>>> >  On approximately 10/9/2009 8:10 AM, came the following characters 
>>> from >  the keyboard of Stephen J. Turnbull:
>>> > >   Glenn Linderman writes:
>>> > > > > >   produce a defect report, but then simply converted to 
>>> Unicode > >  as if > > >  it were Latin-1 (since there is no other 
>>> knowledge > >  available that > > >  could produce a better 
>>> conversion).
>>> > > > > > >   No, that is already corruption.  Most clients will 
>>> assume > >  that string
>>> > > > >   is valid as a header, because it's valid as a string.
>>> > > > >   Sure it is corruption.  That's why there is a defect 
>>> report.  But
>>> > > >   the conversion technique is appropriate, per the Postel 
>>> principle.
>>> > > > >   Actually, I would say you are emitting leniently, in 
>>> violation of the
>>> > >   Postel principle. > >  You can say that, but I don't have to 
>>> believe it.  I'm talking about >  accepting; the message has 
>>> arrived, it is here, the client is trying to >  look at it, and I'm 
>>> talking about ways the client can look at >  not-quite-perfect data, 
>>> knowing that it is not quite perfect, but still >  being able to see 
>>> it. I'm not at all talking about emitting data.  You >  seem to be 
>>> calling the email package helping the client to accept >  
>>> not-quite-perfect data, as a form of emitting data.  It is not.
>>>
>>>  IMO, the appropriate way for the email package to provide the API you
>>>  are talking about is it provide the client with a way to get at the 
>>> raw
>>>  byte string, which I think everyone agrees on.  If the client wants to
>>>  decode it as if it were latin-1 to process it, it can then do that. 
>>
>> That certainly works, but it isn't very helpful... that forces the 
>> client application to reproduce the logic to parse the header value 
>> and decode the parts that can be decoded successfully, and that is 
>> exactly the sort of thing Stephen was complaining about when he 
>> thought I was suggesting that to be a requirement (but he was 
>> confused about what I was suggesting).
>
> I wasn't clear, sorry :).  The current API has a 'decode_header' 
> function,
> which doesn't do the byte-to-unicode decode (yeah, there's another naming
> problem here...we have two types of decoding and only one word for both)
> but instead returns (bytes, charset) tuples.  This piece of the API is
> broken in python3, and I don't think it is the right API going forward,
> but that _kind_ of API is what I meant by 'getting at the raw byte
> string':  the byte string that failed the bytes-to-unicode decoding,
> not the entire header (though there will also be a way to get that if
> you need it, I presume.) 

Yeah, that'd be better. 

Of course, when returning Unicode strings, there would be no particular 
need to identify the various charsets in which the header was 
transmitted, except for invertibility and error handling, unless the 
client wanted to track that for some reason. 
If the goal is to preserve invertibility, then maybe tuples like (str, 
charset, defect) would be better.... where defect would be None for good 
data, but if defect were "non-ASCII", then you'd know the str was 
converted as if it were charset [Latin-1 in my book, but if  email 
package had rules or the API had parameters for how to deal with 
non-ASCII stuff, some other charset could be specified, perhaps, but if 
that fails it might still have to fall back to Latin-1]; if defect were 
"ASCII", then you'd know that the str looked like an encoded word, but 
couldn't be decoded because the charset wasn't recognized, or the 
decoding via that charset failed, so the encoded word was supplied.

Correspondingly, a header value could be set by supplying such a list, 
even with defect values as described above, to permit invertibility, and 
passing on what was obtained, so that if there are overriding local 
conventions (yep, such things used to be used, and maybe still are in 
some areas), that the data would be preserved as best as possible, and 
so that the email package could support creation of messages according 
to the local conventions.

I'd hope that a separate tuple would be used for each encoded-word, or, 
if charset ASCII and defect None, then it would describe a run of ASCII 
between encoded words.  Yes, an encoded word can be encoded in ASCII for 
rare use (if the input word looks like an encoded word), so that would 
cause a sequence of charset ASCII, defect None tuples, but otherwise a 
plain ASCII header value would have a single entry in the list of tuples.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From stephen at xemacs.org  Sat Oct 10 15:59:03 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Sat, 10 Oct 2009 22:59:03 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACF9C6B.4020508@g.nevcal.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACCD10D.4070308@g.nevcal.com>
	<87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACE6CBD.2030805@g.nevcal.com>
	<87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACED79F.6050602@g.nevcal.com>
	<87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACF9C6B.4020508@g.nevcal.com>
Message-ID: <87iqenl594.fsf@uwakimon.sk.tsukuba.ac.jp>

I'm running out of time to work on this (yeah, I know it's the
weekend, but my life is like that lately).  I think we're converging,
though, so I'd like try and tie some of those ends together.

Glenn Linderman writes:
 > On approximately 10/9/2009 8:10 AM, came the following characters from 
 > the keyboard of Stephen J. Turnbull:

 > > Actually, I would say you are emitting leniently, in violation of the
 > > Postel principle.  
 > 
 > You can say that, but I don't have to believe it.  I'm talking about 
 > accepting; the message has arrived, it is here, the client is trying to 
 > look at it, and I'm talking about ways the client can look at 
 > not-quite-perfect data, knowing that it is not quite perfect, but still 
 > being able to see it.  I'm not at all talking about emitting data.

It would be indeed, if the corrupt data is stored in the place where
correctly decoded data normally is stored, and is accessible in the
same way.  But I gather that's not what you were talking about, my
mistake.

 > You seem to be calling the email package helping the client to
 > accept not-quite-perfect data, as a form of emitting data.  It is
 > not.

No, I was confused by the way you wrote.  Saving the data *somewhere*
is absolutely necessary; not losing data is the #1 commandment of
low-level mail processing.  Surely the email module is subject to that
commandment.  *Nobody* is talking about losing any data yet, except
Barry indirectly when he says that some people think giving up on
invertibility (often called "idempotency"), and even he is quite
adamant that he's not going to give up on that.

So when you wrote about saving and converting to text form, without
mentioning that the specific APIs, I assumed you meant the "mainline"
APIs for parsing and accessing parts of a correctly formatted message.

 > The email package cannot police the client... if it chooses to "eat it 
 > in a single gulp without looking at it" then it may get indigestion.  I 
 > never suggested that "converting to Unicode as if it were Latin-1" 
 > should be done without informing the client, or being requested by the 
 > client to do that via a special API call...

Well, maybe I misread it, but it certainly looked like that to me.  I
would not object to that special API call defaulting to ISO 8859/1.

 > If you ignore defect reports, you are ignorant (blunt, but not intended 
 > to be offensive).

What I worried about is that if defect reports are present, *but
displayable data is also present*, programmers *will* simply display
it, for example in producing a prototype program.  It will be
impossible to determine without very close analysis of that program
that an early version became a production version without adding
appropriate checks.  In practice, this bug will be discovered when
some end user's installation breaks.

It seems that you agree with this, and because the special API call is
necessary, it will be easy to identify whether proper care is being
taken or not.  Right?

 > >  > It is still raw user input, and should still be checked for proper 
 > >  > syntax by the client,
 > >
 > > Nonsense.  The email module had better know a lot more about syntax
 > > than the client.  If it doesn't, whack it with a 2x4 until it learns!
 > 
 > I think we are talking at cross purposes here.  I find it quite 
 > difficult to follow where you cross the boundary between talking about 
 > one sort of email package client, and then switch to another type, or 
 > switch to the responsibilities of the email package.

Excuse me?  The "raw user input" you referred to above is material
that the client software receives from the email package.  The email
package should give it to the client in the "normal" (convenient) way
only if it can certify that it conforms to the appropriate standard.

That standard should be specified in the API documentation.  Any more
detailed structure, of course, is the responsibility of the client.

 > An application which is using email as a transport, has specific goals, 
 > which require specific content.  You were mentioning clients.

I've already said that when I speak of an MUA, I write "MUA".  In
speaking of the calling program, which might even be a user running
the module via the Python interpreter, I write "client".  It's a very
convenient way to describe the user of an API, in contrast to the
provider of the API (the implementation).

 > If such a client doesn't validate the syntax of that content, it
 > isn't much of an application.

If that MUA or email application uses RFC 822 addresses, it should be
able to rely on the email module to parse those addresses correctly,
or provide a defect report.  One might even go so far as to suggest
that it be able to parse the (non-RFC, but very common) "+" notation
for separating the "mailbox" from "additional data" used for VERP and
challenge-response applications.  That would have to be documented,
but if so documented client applications like the MUA should be able
to rely on it (and you can bet many will).

Application domain syntax of course is not the email module's problem
whether it arrives by email or Pony Express, and I'm really confused
why you're going so far afield.

 > > No, they cannot just be raised.  If you just raise the error, then the
 > > next time you try to access unparsed data, you'll hit the error
 > > again.  If you use the same handler you did before, you're in an
 > > infloop.  So you need a second handler to do things differently this
 > > time or a flag ... but it's unclear to me that that flag can be a
 > > boolean.  So you may as well store the defect list and information
 > > about where to restart.
 > 
 >  From the point of view of the email package, the errors can just be 
 > raised.  Then the client can make choices, and use other APIs or other 
 > parameters to the API to direct the email package to attempt a different 
 > technique to access the data.

The problem is that by this point some of the state of the parse may
be lost.  We can't say "just raise", we need to say "interrupt the
parse, preserve state, and then raise".   Python does absolutely
nothing to help with the problem of preserving the state.  We also
need to determine just what state to preserve.

 > Yes, I have learned that in my 34 years of programming.  I agree.
 > 
 > > So it's OK to write a lazy parser, but it must retain enough state so
 > > that it can work forward until the end. [...]
 > 
 > Are you speaking about parsing the message into MIME parts, or parsing a 
 > particular MIME part contained within the message, or both?

Both.  I *believe* (but it needs to be checked) that in a correctly
formed multipart MIME object (message or part), any internal structure
is context-free within the MIME boundaries.  If that is so, then
individual parts of the object can be stored in raw form and parsed
lazily.

Similarly, for any MIME or RFC 822 object, the object can be parsed
into header section and body section, and each can be stored and
parsed lazily, subject to the condition that the header section must
be sufficiently parsed to identify all headers that might affect
parsing the body part before the body part is parsed.  That
"condition" is the context.

From stephen at xemacs.org  Sat Oct 10 17:40:50 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Sun, 11 Oct 2009 00:40:50 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACFDF86.8040104@g.nevcal.com>
References: <PC19220091008162023032819903bdf@msapiro>
	<4ACEABFD.6010309@g.nevcal.com> <4ACEB234.9030309@is.kochi-u.ac.jp>
	<4ACED8C4.5070906@g.nevcal.com> <4ACEF66B.3000500@is.kochi-u.ac.jp>
	<4ACFA08F.9080307@g.nevcal.com> <4ACFB456.6010106@is.kochi-u.ac.jp>
	<4ACFDF86.8040104@g.nevcal.com>
Message-ID: <87fx9rl0jh.fsf@uwakimon.sk.tsukuba.ac.jp>

Glenn Linderman writes:
 > On approximately 10/9/2009 3:08 PM, came the following characters from 
 > the keyboard of Tokio Kikuchi:

 > > Your suggestions 1)-4) are not accesptable to Japanese users at
 > > all.

 > If a message with an encoded header arrives (like your number 2 sample) 
 > but it cannot be decoded, what action _is_ acceptable to Japanese 
 > users?  And what action is implemented in Mailman (if different)?

I know a fair bit about Japanese (both the language and the users),
and I'm having difficulty understanding what Tokio means, given your
list of hypotheses.  I suspect he's basically rejecting the hypothesis
that it can't be decoded -- if it can't be decoded, then learn how to
do so!

 > I can think of a 5th technique... don't modify the header, and send
 > it through unchanged.  Now I think I've covered the gamut of
 > possibilities,

I agree.  However, I think we're way out of bounds here.  We already
know how to decode anything that RFC 2047 can throw at us in charsets
that Python can handle.  Anything that can't be decoded then is
seriously malformed from the point of view of the mailing list users.
So why are we discussing this?  We don't even know what our mainline
APIs are going to look like, why are we discussing forcibly operating
on broken input?

[[ Aside:

 > with an appropriate translation for "Re: ").

"Re" is a Latin abbreviation; there is no appropriate translation. ;-)
]]

 > MUAs or mailing list handlers that attempt to retain what was sent
 > (idempotency or invertibility), would be more likely to do what I
 > describe, and are more robust when faced with new character sets
 > that they don't understand how to decode.

Maybe they are, but the email module doesn't know or care about what
they do.  Let's stick within what the email module is supposed to
handle.


From v+python at g.nevcal.com  Sat Oct 10 22:01:46 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Sat, 10 Oct 2009 13:01:46 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <87iqenl594.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8510262.7231254589795083.JavaMail.root@boaz>	<4ACB0DC9.7080307@g.nevcal.com>	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACB971D.9080706@g.nevcal.com>	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACC0277.2060807@g.nevcal.com>	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACCD10D.4070308@g.nevcal.com>	<87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACE6CBD.2030805@g.nevcal.com>	<87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACED79F.6050602@g.nevcal.com>	<87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACF9C6B.4020508@g.nevcal.com>
	<87iqenl594.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <4AD0E82A.5000603@g.nevcal.com>

On approximately 10/10/2009 6:59 AM, came the following characters from 
the keyboard of Stephen J. Turnbull:
> I'm running out of time to work on this (yeah, I know it's the
> weekend, but my life is like that lately).  I think we're converging,
> though, so I'd like try and tie some of those ends together.
>   

I think we are converging too... mostly terminology issues, and 
assumptions were causing a bit of misunderstandings.

> Glenn Linderman writes:
>  > On approximately 10/9/2009 8:10 AM, came the following characters from 
>  > the keyboard of Stephen J. Turnbull:
>
>  > > Actually, I would say you are emitting leniently, in violation of the
>  > > Postel principle.  
>  > 
>  > You can say that, but I don't have to believe it.  I'm talking about 
>  > accepting; the message has arrived, it is here, the client is trying to 
>  > look at it, and I'm talking about ways the client can look at 
>  > not-quite-perfect data, knowing that it is not quite perfect, but still 
>  > being able to see it.  I'm not at all talking about emitting data.
>
> It would be indeed, if the corrupt data is stored in the place where
> correctly decoded data normally is stored, and is accessible in the
> same way.  But I gather that's not what you were talking about, my
> mistake.
>   

Well, the client tells us where to store it, and we can't prevent it 
from being the same place.  But accessible in the same way?  Not.  Some 
extra parameter or different API, would surely be required to get 
not-quite-perfect data.

>  > You seem to be calling the email package helping the client to
>  > accept not-quite-perfect data, as a form of emitting data.  It is
>  > not.
>
> No, I was confused by the way you wrote.  Saving the data *somewhere*
> is absolutely necessary; not losing data is the #1 commandment of
> low-level mail processing.  Surely the email module is subject to that
> commandment.  *Nobody* is talking about losing any data yet, except
> Barry indirectly when he says that some people think giving up on
> invertibility (often called "idempotency"), and even he is quite
> adamant that he's not going to give up on that.
>
> So when you wrote about saving and converting to text form, without
> mentioning that the specific APIs, I assumed you meant the "mainline"
> APIs for parsing and accessing parts of a correctly formatted message.
>   

Mostly, I hadn't bothered about APIs yet; I'm not yet very familiar with 
the existing ones, because neither nPOPuk nor SeaMonkey nor Thunderbird, 
the only email programs that I have looked at source code for, use the 
Python email package!  So while I'm reasonably familiar with the RFCs, 
and quite familiar with nPOPuk source, and have looked at a small 
fraction of the SeaMonkey/Thunderbird source code (and been amazed at 
how big it is), and have examined email from a large variety of sources 
comparing it to the RFCs to see where it goes wrong and why it doesn't 
display in SeaMonkey/Thunderbird the same way as in Outlook/Outlook 
Express (or other programs), and have found Outlook 2000 and Apple Mail 
to be quite creative in interpreting the RFCs, I'm new to the Python 
email package.

>  > The email package cannot police the client... if it chooses to "eat it 
>  > in a single gulp without looking at it" then it may get indigestion.  I 
>  > never suggested that "converting to Unicode as if it were Latin-1" 
>  > should be done without informing the client, or being requested by the 
>  > client to do that via a special API call...
>
> Well, maybe I misread it, but it certainly looked like that to me.  I
> would not object to that special API call defaulting to ISO 8859/1.
>
>  > If you ignore defect reports, you are ignorant (blunt, but not intended 
>  > to be offensive).
>
> What I worried about is that if defect reports are present, *but
> displayable data is also present*, programmers *will* simply display
> it, for example in producing a prototype program.  It will be
> impossible to determine without very close analysis of that program
> that an early version became a production version without adding
> appropriate checks.  In practice, this bug will be discovered when
> some end user's installation breaks.
>
> It seems that you agree with this, and because the special API call is
> necessary, it will be easy to identify whether proper care is being
> taken or not.  Right?
>   

Well, yes and no. 

I think that the email package should require that some special action 
needs to be taken by the client to request not-quite-perfect data, 
either a special parameter value, or different API, etc.  But there is 
nothing that says that some client might not pass that all the time, and 
ignore the defect reports.  Whether that is easy to identify or not, and 
whether the email package wants to require that the normal APIs be tried 
before the not-quite-perfect APIs are issues for discussion.

Ultimately, the email package cannot enforce that proper case is taken 
by the client; only code reviews of the client can encourage that.

>  > >  > It is still raw user input, and should still be checked for proper 
>  > >  > syntax by the client,
>  > >
>  > > Nonsense.  The email module had better know a lot more about syntax
>  > > than the client.  If it doesn't, whack it with a 2x4 until it learns!
>  > 
>  > I think we are talking at cross purposes here.  I find it quite 
>  > difficult to follow where you cross the boundary between talking about 
>  > one sort of email package client, and then switch to another type, or 
>  > switch to the responsibilities of the email package.
>
> Excuse me?  The "raw user input" you referred to above is material
> that the client software receives from the email package.  The email
> package should give it to the client in the "normal" (convenient) way
> only if it can certify that it conforms to the appropriate standard.
>   

Yes, agreed.  And a special way or ways to get various algorithms for 
attempting to interpret not-quite-perfect data, when the client thinks 
that might be useful.  Then the client has "tweaked" user input.

> That standard should be specified in the API documentation.  Any more
> detailed structure, of course, is the responsibility of the client.
>   

Right.  And it is the more detailed structure that I was referring to... 
Even if the structure of the email is incorrect, if the client can find 
its input among the various attempts to obtain data from the 
not-quite-perfect email message, and can validate and check its input, 
it may choose to process it even if the email message is imperfect... it 
should probably note somewhere that the email message from which the 
data was obtained was not perfect, but really, that is up to the client 
to figure out, based on its requirements.

>  > An application which is using email as a transport, has specific goals, 
>  > which require specific content.  You were mentioning clients.
>
> I've already said that when I speak of an MUA, I write "MUA".  In
> speaking of the calling program, which might even be a user running
> the module via the Python interpreter, I write "client".  It's a very
> convenient way to describe the user of an API, in contrast to the
> provider of the API (the implementation).
>   

Yep, so I think my "application" and your "client" are the same thing.  
I'm trying to use your term as I continue responding in these threads, 
it is reasonable.

>  > If such a client doesn't validate the syntax of that content, it
>  > isn't much of an application.
>
> If that MUA or email application uses RFC 822 addresses, it should be
> able to rely on the email module to parse those addresses correctly,
> or provide a defect report.  One might even go so far as to suggest
> that it be able to parse the (non-RFC, but very common) "+" notation
> for separating the "mailbox" from "additional data" used for VERP and
> challenge-response applications.  That would have to be documented,
> but if so documented client applications like the MUA should be able
> to rely on it (and you can bet many will).
>   

Hmim.  This is an interesting digression...

"+", according to the RFCs, is just another of the legal characters that 
can be found before the @ in an unquoted email address... the list is 
!#$%&'*+-/=?^_`{}|~ in addition to the alphanumerics.

How a particular email server interprets the "stuff before the @" is 
pretty much up to it... so as long as it does something appropriate, it 
can interpret all or a fraction of it as a mailbox name, or could it 
intuit a mailbox name from the body content if it wants, or even from a 
special header.  So yeah, particular interpretations of the address is 
non-RFC stuff.

> Application domain syntax of course is not the email module's problem
> whether it arrives by email or Pony Express, and I'm really confused
> why you're going so far afield.
>   

Just to point out that good data can be obtained from bad email 
messages, I think, and that that is a use case.

>  > > No, they cannot just be raised.  If you just raise the error, then the
>  > > next time you try to access unparsed data, you'll hit the error
>  > > again.  If you use the same handler you did before, you're in an
>  > > infloop.  So you need a second handler to do things differently this
>  > > time or a flag ... but it's unclear to me that that flag can be a
>  > > boolean.  So you may as well store the defect list and information
>  > > about where to restart.
>  > 
>  >  From the point of view of the email package, the errors can just be 
>  > raised.  Then the client can make choices, and use other APIs or other 
>  > parameters to the API to direct the email package to attempt a different 
>  > technique to access the data.
>
> The problem is that by this point some of the state of the parse may
> be lost.  We can't say "just raise", we need to say "interrupt the
> parse, preserve state, and then raise".   Python does absolutely
> nothing to help with the problem of preserving the state.  We also
> need to determine just what state to preserve.
>
>  > Yes, I have learned that in my 34 years of programming.  I agree.
>  > 
>  > > So it's OK to write a lazy parser, but it must retain enough state so
>  > > that it can work forward until the end. [...]
>  > 
>  > Are you speaking about parsing the message into MIME parts, or parsing a 
>  > particular MIME part contained within the message, or both?
>
> Both.  I *believe* (but it needs to be checked) that in a correctly
> formed multipart MIME object (message or part), any internal structure
> is context-free within the MIME boundaries.  If that is so, then
> individual parts of the object can be stored in raw form and parsed
> lazily.
>
> Similarly, for any MIME or RFC 822 object, the object can be parsed
> into header section and body section, and each can be stored and
> parsed lazily, subject to the condition that the header section must
> be sufficiently parsed to identify all headers that might affect
> parsing the body part before the body part is parsed.  That
> "condition" is the context.
>   

Neither of these context conditions apply to correctly formed MIME 
trees, but are the only context I'm aware of that can affect parsing of 
MIME parts, AFAIK (and I just reread most of the MIME RFCs in the last 
few days).

The only context for parsing MIME parts that I'm aware of is that when 
determining the end of a nested MIME part, that the search for ending 
delimiter must include searching for any higher-level delimiter as 
well... to handle the case where the inner delimiter got lost.  So one 
should search for CR LF --, and then examine the stuff after the -- to 
match first the innermost delimiter, and then the next outermost, etc., 
and if finding a match, considering that it is the end of all the parts 
nested within the delimiter found, the inner ones being considered 
truncated, since their own delimiter was not found.

Unexpected end-of-data should also mark all unterminated nested MIME 
parts as incomplete, of course.


The only other cross-part context that I am aware of is Content-ID 
references.  That doesn't affect parsing, but rather semantic 
interpretation, after parsing, validation, and decoding is complete.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From v+python at g.nevcal.com  Sat Oct 10 22:20:02 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Sat, 10 Oct 2009 13:20:02 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <87fx9rl0jh.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <PC19220091008162023032819903bdf@msapiro>	<4ACEABFD.6010309@g.nevcal.com>	<4ACEB234.9030309@is.kochi-u.ac.jp>	<4ACED8C4.5070906@g.nevcal.com>	<4ACEF66B.3000500@is.kochi-u.ac.jp>	<4ACFA08F.9080307@g.nevcal.com>	<4ACFB456.6010106@is.kochi-u.ac.jp>	<4ACFDF86.8040104@g.nevcal.com>
	<87fx9rl0jh.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <4AD0EC72.6040704@g.nevcal.com>

On approximately 10/10/2009 8:40 AM, came the following characters from 
the keyboard of Stephen J. Turnbull:
> Glenn Linderman writes:
>  > On approximately 10/9/2009 3:08 PM, came the following characters from 
>  > the keyboard of Tokio Kikuchi:
>
>  > > Your suggestions 1)-4) are not accesptable to Japanese users at
>  > > all.
>
>  > If a message with an encoded header arrives (like your number 2 sample) 
>  > but it cannot be decoded, what action _is_ acceptable to Japanese 
>  > users?  And what action is implemented in Mailman (if different)?
>
> I know a fair bit about Japanese (both the language and the users),
> and I'm having difficulty understanding what Tokio means, given your
> list of hypotheses.  I suspect he's basically rejecting the hypothesis
> that it can't be decoded -- if it can't be decoded, then learn how to
> do so!
>
>  > I can think of a 5th technique... don't modify the header, and send
>  > it through unchanged.  Now I think I've covered the gamut of
>  > possibilities,
>
> I agree.  However, I think we're way out of bounds here.  We already
> know how to decode anything that RFC 2047 can throw at us in charsets
> that Python can handle.  Anything that can't be decoded then is
> seriously malformed from the point of view of the mailing list users.
> So why are we discussing this?  We don't even know what our mainline
> APIs are going to look like, why are we discussing forcibly operating
> on broken input?
>   

Use case generation.  If the only way to access header values is to 
successfully, fully, decode them, then some uses may be rendered 
impossible, or at least difficult, even by choice of APIs.


> [[ Aside:
>
>  > with an appropriate translation for "Re: ").
>
> "Re" is a Latin abbreviation; there is no appropriate translation. ;-)
>   

Nonetheless, I have seen both Re: and Fwd: translated to other languages 
(besides Latin or geek) :)
Communication to people with MUAs that do such translations tend to 
accumulate an alternating

Re: XRe: Re: XRe: Re: subject line

because neither MUA will recognize the other translation.

> ]]
>
>  > MUAs or mailing list handlers that attempt to retain what was sent
>  > (idempotency or invertibility), would be more likely to do what I
>  > describe, and are more robust when faced with new character sets
>  > that they don't understand how to decode.
>
> Maybe they are, but the email module doesn't know or care about what
> they do.  Let's stick within what the email module is supposed to
> handle

Yep, this is just use case exploration.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From rdmurray at bitdance.com  Sat Oct 10 23:20:59 2009
From: rdmurray at bitdance.com (R. David Murray)
Date: Sat, 10 Oct 2009 17:20:59 -0400 (EDT)
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACF880D.5080305@g.nevcal.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org>
	<4ACD94E5.5020808@g.nevcal.com>
	<4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org>
	<4ACE6A1B.7060702@g.nevcal.com>
	<3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org>
	<4ACF880D.5080305@g.nevcal.com>
Message-ID: <Pine.LNX.4.64.0910101708490.18193@kimball.webabinitio.net>

On Fri, 9 Oct 2009 at 11:59, Glenn Linderman wrote:
> On approximately 10/9/2009 5:05 AM, came the following characters from the 
> keyboard of Barry Warsaw:
>>  On Oct 8, 2009, at 6:39 PM, Glenn Linderman wrote:
>> >  1) wire format.  Either what came in, in the parser case, or what would 
>> >  be generated.
>> >  2) internal headers from the MIME part
>> >  3) decoded BLOB.  This means that quopri and base64 are decoded, no more 
>> >  and no less.  This is bytes.  No headers, only payload.  For 
>> >  Content-Transfer-Encoding: binary, this is mostly a noop.
>> >  4) text/* parts should also be obtainable as str()/unicode(), payload 
>> >  only.  This is where charset decoding is done.
>> > 
>> >  I think your talk in the next paragraph about hooks and other object 
>> >  types being produced is a generalization of 4, not 3, and generally no 
>> >  additional decoding needs to be done, just conversion to the right 
>> >  object type (or file, or file-like object).
>>  I mostly agree with that.  I've always called #4 the "decoded payload" and
>>  #3 I've usually called the "raw payload".  Maybe we can bikeshed on better
>>  terms to help inform us about the API's method/attribute names.
>
> It would be good though to have standardized terms for easier communication. 
> Maybe as they are chosen, they could be added to that Wiki RDM set up?

I didn't set it up, Barry did.  I just started adding stuff ;)

> My only problem with "raw" and "decoded" payload, is that there are 3 payload 
> formats, not 2, so there needs to be a 3rd term, corresponding to #1, #3, and 
> #4, above.  #2 is somewhat orthogonal from the payload.
>
> To me, "raw" conjures up #1, not #3.

I think I understand why Barry uses it for #3: it's the 'raw data' that
went in to get transfer-encoded in the first place.  But clearly the
term is ambiguous.

I have set up two more documents on the wiki.  One is UseCases[1], and I've
tried to copy into it all of the use cases that have been mentioned in
this discussion, plus a few more.  Edits welcome.

The other is a Glossary[2].  I think most of it accurately reflects the
consensus here, but in it I'm proposing to use the term 'transfer-decoded'
for #3, and 'transfer-encoded' as an alternative to 'wire-format' just
for symmetry.  Comments and suggestions welcome.

Any other terms of art we should record?

--David

[1] http://wiki.python.org/moin/Email%20SIG/UseCases
[2] http://wiki.python.org/moin/Email%20SIG/Glossary

From v+python at g.nevcal.com  Sun Oct 11 00:58:38 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Sat, 10 Oct 2009 15:58:38 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <Pine.LNX.4.64.0910101708490.18193@kimball.webabinitio.net>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org>
	<4ACD94E5.5020808@g.nevcal.com>
	<4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org>
	<4ACE6A1B.7060702@g.nevcal.com>
	<3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org>
	<4ACF880D.5080305@g.nevcal.com>
	<Pine.LNX.4.64.0910101708490.18193@kimball.webabinitio.net>
Message-ID: <4AD1119E.60409@g.nevcal.com>

On approximately 10/10/2009 2:20 PM, came the following characters from 
the keyboard of R. David Murray:
> On Fri, 9 Oct 2009 at 11:59, Glenn Linderman wrote:
>> On approximately 10/9/2009 5:05 AM, came the following characters 
>> from the keyboard of Barry Warsaw:
>>>  On Oct 8, 2009, at 6:39 PM, Glenn Linderman wrote:
>>> >  1) wire format.  Either what came in, in the parser case, or what 
>>> would >  be generated.
>>> >  2) internal headers from the MIME part
>>> >  3) decoded BLOB.  This means that quopri and base64 are decoded, 
>>> no more >  and no less.  This is bytes.  No headers, only payload.  
>>> For >  Content-Transfer-Encoding: binary, this is mostly a noop.
>>> >  4) text/* parts should also be obtainable as str()/unicode(), 
>>> payload >  only.  This is where charset decoding is done.
>>> > >  I think your talk in the next paragraph about hooks and other 
>>> object >  types being produced is a generalization of 4, not 3, and 
>>> generally no >  additional decoding needs to be done, just 
>>> conversion to the right >  object type (or file, or file-like object).
>>>  I mostly agree with that.  I've always called #4 the "decoded 
>>> payload" and
>>>  #3 I've usually called the "raw payload".  Maybe we can bikeshed on 
>>> better
>>>  terms to help inform us about the API's method/attribute names.
>>
>> It would be good though to have standardized terms for easier 
>> communication. Maybe as they are chosen, they could be added to that 
>> Wiki RDM set up?
>
> I didn't set it up, Barry did.  I just started adding stuff ;)

OK.  I seem to have an account there, so made some edits.

>> My only problem with "raw" and "decoded" payload, is that there are 3 
>> payload formats, not 2, so there needs to be a 3rd term, 
>> corresponding to #1, #3, and #4, above.  #2 is somewhat orthogonal 
>> from the payload.
>>
>> To me, "raw" conjures up #1, not #3.
>
> I think I understand why Barry uses it for #3: it's the 'raw data' that
> went in to get transfer-encoded in the first place.  But clearly the
> term is ambiguous.

I found it so.

> I have set up two more documents on the wiki.  One is UseCases[1], and 
> I've
> tried to copy into it all of the use cases that have been mentioned in
> this discussion, plus a few more.  Edits welcome.

I hadn't seen UTF-16/-32/-BE/-LE mentioned in this discussion, but the 
MIME RFCs do mention use cases that require them, so I added it to 
RFC822 handling, but it might be better in HTTP handling?  Or maybe 
elsewhere?

> The other is a Glossary[2].  I think most of it accurately reflects the
> consensus here, but in it I'm proposing to use the term 
> 'transfer-decoded'
> for #3, and 'transfer-encoded' as an alternative to 'wire-format' just
> for symmetry.  Comments and suggestions welcome.

I like the distinction you made that 'wire format' is "in the wild", not 
known to be RFC compliant, and 'transfer-encoded' be the generated type, 
and compliant.  I would think that if we get data as far as 
'transfer-decoded', that we've (mostly) proven that the received 'wire 
format' is compliant, or can be made compliant. (I switched conformant 
to compliant, not finding the former at dictionary.com, and not liking 
conformable which I found there, as it seems to imply able to be changed 
to conform, in my head, although not in the definition).

> Any other terms of art we should record?
>
> --David
>
> [1] http://wiki.python.org/moin/Email%20SIG/UseCases
> [2] http://wiki.python.org/moin/Email%20SIG/Glossary
>


-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From stephen at xemacs.org  Sun Oct 11 02:47:39 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Sun, 11 Oct 2009 09:47:39 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4AD0EC72.6040704@g.nevcal.com>
References: <PC19220091008162023032819903bdf@msapiro>
	<4ACEABFD.6010309@g.nevcal.com> <4ACEB234.9030309@is.kochi-u.ac.jp>
	<4ACED8C4.5070906@g.nevcal.com> <4ACEF66B.3000500@is.kochi-u.ac.jp>
	<4ACFA08F.9080307@g.nevcal.com> <4ACFB456.6010106@is.kochi-u.ac.jp>
	<4ACFDF86.8040104@g.nevcal.com>
	<87fx9rl0jh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4AD0EC72.6040704@g.nevcal.com>
Message-ID: <878wfilpsk.fsf@uwakimon.sk.tsukuba.ac.jp>

Glenn Linderman writes:
 > On approximately 10/10/2009 8:40 AM, came the following characters from 
 > the keyboard of Stephen J. Turnbull:

 > > So why are we discussing this?  We don't even know what our mainline
 > > APIs are going to look like, why are we discussing forcibly operating
 > > on broken input?
 > 
 > Use case generation.  If the only way to access header values is to 
 > successfully, fully, decode them, then some uses may be rendered 
 > impossible, or at least difficult, even by choice of APIs.

Since invertibility is a requirement, "successfully fully decoding" a
header field is not a prerequisite to accessing it.

The question of "what should we do about broken mail" at this point
has three components:

(1) To what level do we (ie, the email module) promise to parse
    conforming wire format into useful objects?

(2) For nonconforming input, when is it OK to raise an error and
    return to the calling client rather than handle it ourselves?

(3) What is the API for accessing and/or mutating unparsed data, and
    requesting a reparse?

I don't think we should go any farther than that.

 > > "Re" is a Latin abbreviation; there is no appropriate translation. ;-)
 > >   
 > 
 > Nonetheless, I have seen both Re: and Fwd: translated to other languages 
 > (besides Latin or geek) :)

Sure.  This is an aspect of question (1): is this the responsibility
of the email module?

 > > Maybe they are, but the email module doesn't know or care about what
 > > they do.  Let's stick within what the email module is supposed to
 > > handle
 > 
 > Yep, this is just use case exploration.

But since by definition this is broken input, discussing what
applications are going to want to do with it is inappropriate, IMO.
We don't care if the app is going to prefix, suffix, or crucifix it.
We need to specify

(a) what object will hold the raw data we couldn't handle
(b) how a calling client can retrieve the raw data
(c) how the client can replace (or more generally mutate) that data
(d) how the client can request a reparse from us if it attempted to
    repair the breakage at a low level rather than parse it

Manipulations of text or bytes are in principle not the responsibility
of the email module IMO; that will be done *by* the client *using* raw
Python, not methods provided by email.  I don't see how discussion of
*what* manipulations can be done with one hand up our nose is anything
but useless bikeshedding.

If we decide that the email module can usefully provide sufficiently
general facilities that would be convenient and hard to implement by
general client programmers (eg, the Mailman Developers collective
wisdom about foreign equivalents for "re" and "fwd" is surely greater
than that of the average American programmer), we will do it by
calling low-level methods to get and put the data, and raw Python to
manipulate it as text or bytes.

From stephen at xemacs.org  Sun Oct 11 05:23:36 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Sun, 11 Oct 2009 12:23:36 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4AD0E82A.5000603@g.nevcal.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACCD10D.4070308@g.nevcal.com>
	<87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACE6CBD.2030805@g.nevcal.com>
	<87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACED79F.6050602@g.nevcal.com>
	<87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACF9C6B.4020508@g.nevcal.com>
	<87iqenl594.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4AD0E82A.5000603@g.nevcal.com>
Message-ID: <877hv2likn.fsf@uwakimon.sk.tsukuba.ac.jp>

Glenn Linderman writes:
 > On approximately 10/10/2009 6:59 AM, came the following characters from 
 > the keyboard of Stephen J. Turnbull:
 > > Glenn Linderman writes:
 > >  > On approximately 10/9/2009 8:10 AM, came the following characters from 
 > >  > the keyboard of Stephen J. Turnbull:

 > > correctly decoded data normally is stored, and is accessible in the
 > > same way.  But I gather that's not what you were talking about, my
 > > mistake.
 > 
 > Well, the client tells us where to store it, and we can't prevent it 
 > from being the same place.

Huh?  No way!  We decide where our data is stored.  This isn't C where
you pass around arbitrary pointers for efficiency.  In particular,
strings (whether Unicode or bytes) are not mutable.  So the client can
keep a copy if it likes, but once it hands us raw message text as
bytes, after that we decide where we put parsed pieces and/or slices
of the unparsed original.

 > > So when you wrote about saving and converting to text form, without
 > > mentioning that the specific APIs, I assumed you meant the "mainline"
 > > APIs for parsing and accessing parts of a correctly formatted message.
 > 
 > Mostly, I hadn't bothered about APIs yet;

You may not bother about APIs, but it sure looks like you do to me.
You can't talk about where to store stuff without touching the API.

 > I think that the email package should require that some special action 
 > needs to be taken by the client to request not-quite-perfect data, 
 > either a special parameter value, or different API, etc.

That's all I need to hear, until we're ready to write specs for that
API.  (Note that a special parameter value is part of the API in a
sense, if we specify and document what it means, so I tend to use API
for that, too, not just for whole functions.)

 > But there is nothing that says that some client might not pass that
 > all the time, and ignore the defect reports.  Whether that is easy
 > to identify or not, and whether the email package wants to require
 > that the normal APIs be tried before the not-quite-perfect APIs are
 > issues for discussion.

The answers are obvious to me: yes and no.  You can identify whether a
particular API has been used with standard text search tools like M-x
occur.  (For non-Emacsers, that is an Emacs command that finds all
occurances of a particular string in the buffer.)  If a program wants
to call the quick & dirty APIs first, that's none of our business,
except that if parsing is being done lazily we should be careful to
update the defect list, so that the program can check them when it
wants to.

 > Ultimately, the email package cannot enforce that proper case is taken 
 > by the client; only code reviews of the client can encourage that.

My point is not to enforce anything, not even code reviews.  But by
having separate APIs for parsed and unparsed data, code review can be
made easier and more accurate.

 > Yes, agreed.  And a special way or ways to get various algorithms for 
 > attempting to interpret not-quite-perfect data, when the client thinks 
 > that might be useful.

I don't think we should be talking about special ways (plural) or
"not-quite-perfect" data.  At this point in the design process, we
have *parsed* and *unparsed* data.  Heuristic algorithms for
recovering from unparsable input can be layered on top of these two
sets of APIs, when we have *real* use cases for them.  For example, I
don't think your use case of prepending a mailing list's topic or
serial number to an unparseable subject is realistic; in all lists I
know of such a message would be held for moderation, or even discarded
outright as spam.

And again:

 > Right.  And it is the more detailed structure that I was referring to... 

But why?  There is no need to discuss it at this point, and bringing
it up is confusing as all get-out.

 > How a particular email server interprets the "stuff before the @" is 
 > pretty much up to it... so as long as it does something appropriate, it 
 > can interpret all or a fraction of it as a mailbox name, or could it 
 > intuit a mailbox name from the body content if it wants, or even from a 
 > special header.  So yeah, particular interpretations of the address is 
 > non-RFC stuff.

Right.  To riff on the RFC vs. not theme ["Barry, pick up the bass
line, need more bottom here!"], I think we should pick a list of RFCs
we "promise" to implement as "defining" email; if we reserve any
structures as "too obscure for us to parse," we should say so (and
reference chapter and verse of the Holy RFC).  On the other hand, of
course as we discover common use cases for which precise
specifications can be given, we should be flexible and implement them.
But there should be no rush.

Which RFCs?

First of all, the STD 11 series (RFCs 733, 822, 2822, 5322).  Here we
have to worry about the standard's recommended format vs. the obsolete
format because of the Postel principle.  AFAIK, there is no reason not
to insist on *producing* strictly RFC 5322 conformant messages, but I
think we should implement both strict and lax parsers.  The lax parser
is for "daily use", the strict parser for validation.

Second, the basic MIME structure RFCs: 2045-2049, 2231.  (Some of
these have been at least partially superseded by now, I think.)

The mailing list header RFCs: 2369 and 2919.

Not RFCs, per se, but an auxiliary module should provide the
registered IANA data for the above RFCs.

Strictly speaking outside of the email module, but we make use of URLs
(RFC 3986 -- superseded?) and mimetypes data (this overlaps
substantially with the "registered IANA data".  We need to coordinate
with the responsible maintainers for those.

Ditto coordinating with modules that we share a lot of structure with,
the "not email but very similar" like HTTP (RFC 2616), and netnews
(NNTP = 3397 and RFC 1036).

Which extensions?

Er, don't you think the above is enough for now?<wink>

 > Just to point out that good data can be obtained from bad email 
 > messages, I think, and that that is a use case.

But we already know that, and the basic idea of how to treat bad data
(send it to a locked room without any supper).  No need to rehash
that, AFAICS from your use case.

 > The only context for parsing MIME parts that I'm aware of is that when 
 > determining the end of a nested MIME part,

Indeed, but this is Postel principle stuff, not about parsing correct
syntax.  First we need to decide what to do with correct syntax, then
come up with belt and suspenders algorithms for broken mail.

 > The only other cross-part context that I am aware of is Content-ID 
 > references.  That doesn't affect parsing, but rather semantic 
 > interpretation, after parsing, validation, and decoding is complete.

I wasn't thinking of those, but that's a good point.  Those will need
to be kept in a mapping at a higher level of the representation,
probably top-level, I guess.


From turnbull at sk.tsukuba.ac.jp  Sun Oct 11 05:52:54 2009
From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull)
Date: Sun, 11 Oct 2009 12:52:54 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <Pine.LNX.4.64.0910101708490.18193@kimball.webabinitio.net>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org>
	<4ACD94E5.5020808@g.nevcal.com>
	<4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org>
	<4ACE6A1B.7060702@g.nevcal.com>
	<3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org>
	<4ACF880D.5080305@g.nevcal.com>
	<Pine.LNX.4.64.0910101708490.18193@kimball.webabinitio.net>
Message-ID: <8763amlh7t.fsf@uwakimon.sk.tsukuba.ac.jp>

R. David Murray writes:
 > I have set up two more documents on the wiki.  One is UseCases[1], [...].
 > The other is a Glossary[2].

Thank you, very much!

 > I think most of it accurately reflects the consensus here, but in
 > it I'm proposing to use the term 'transfer-decoded' for #3, and
 > 'transfer-encoded' as an alternative to 'wire-format' just for
 > symmetry.  Comments and suggestions welcome.

'Wire-format' means "you can cat it to the wire", ie, RFC-conforming
(in fact, it's the only meaning in the RFCs by definition), and for
email itself it's always bytes AFAIK (Mama don' 'low no XML roun'
here, Lord, Lord!).  That's not true of all our applications, though,
especially stuff like doctests.  There are also some RFCs we use such
as BASE64 (specifically relevant to transfer encodings) that are
defined in terms of characters, not bytes, so 'transfer-encoded' is
slightly different from 'wire-format'.

I think in general that kind of comment should be applied directly to
the Glossary, but what deserves general discussion is "how pedantic do
we want to be?  I think the distinction made here between 'wire-format'
and 'transfer-encoded' is useful *to us*, and in general lean toward
"high pedantry" (cf how much smoke and how little fire Glenn and I are
generating!)  WDOT?

From stephen at xemacs.org  Sun Oct 11 06:01:50 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Sun, 11 Oct 2009 13:01:50 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4AD1119E.60409@g.nevcal.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org>
	<4ACD94E5.5020808@g.nevcal.com>
	<4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org>
	<4ACE6A1B.7060702@g.nevcal.com>
	<3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org>
	<4ACF880D.5080305@g.nevcal.com>
	<Pine.LNX.4.64.0910101708490.18193@kimball.webabinitio.net>
	<4AD1119E.60409@g.nevcal.com>
Message-ID: <874oq6lgsx.fsf@uwakimon.sk.tsukuba.ac.jp>

Glenn Linderman writes:
 > (I switched conformant to compliant,

Conformant is in common use.  You might be more comfortable with
conforming.

Richard Stallman points out that you comply with the law, but you
conform to a standard.  I think it's useful to make that semantic
distinction, cf. RFC 2119 MUST vs. SHOULD or MAY.

From v+python at g.nevcal.com  Sun Oct 11 06:37:48 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Sat, 10 Oct 2009 21:37:48 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <874oq6lgsx.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8510262.7231254589795083.JavaMail.root@boaz>	<4ACB0DC9.7080307@g.nevcal.com>	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACB971D.9080706@g.nevcal.com>	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACC0277.2060807@g.nevcal.com>	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>	<643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org>	<4ACD94E5.5020808@g.nevcal.com>	<4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org>	<4ACE6A1B.7060702@g.nevcal.com>	<3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org>	<4ACF880D.5080305@g.nevcal.com>	<Pine.LNX.4.64.0910101708490.18193@kimball.webabinitio.net>	<4AD1119E.60409@g.nevcal.com>
	<874oq6lgsx.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <4AD1611C.6030406@g.nevcal.com>

On approximately 10/10/2009 9:01 PM, came the following characters from 
the keyboard of Stephen J. Turnbull:
> Glenn Linderman writes:
>  > (I switched conformant to compliant,
>
> Conformant is in common use.  You might be more comfortable with
> conforming.
>
> Richard Stallman points out that you comply with the law, but you
> conform to a standard.  I think it's useful to make that semantic
> distinction, cf. RFC 2119 MUST vs. SHOULD or MAY.
>   

conformant is not in the dictionaries I've consulted.  Conforming is 
mostly a verb, not an adjective.

Richard Stallman is a great programmer, but conformable and compliant 
are synonyms.  I don't like the word conformable, but if you appreciate 
his distinction, then we should use the word conformable even though I 
don't like it.  But we shouldn't use the letter sequence conformant, 
because although I know what you mean by it, it appears not to be a 
word, and English is hard enough for ESL folks when they can find the 
words in the dictionary.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From rdmurray at bitdance.com  Sun Oct 11 07:12:27 2009
From: rdmurray at bitdance.com (R. David Murray)
Date: Sun, 11 Oct 2009 01:12:27 -0400 (EDT)
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <874oq6lgsx.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org>
	<4ACD94E5.5020808@g.nevcal.com>
	<4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org>
	<4ACE6A1B.7060702@g.nevcal.com>
	<3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org>
	<4ACF880D.5080305@g.nevcal.com>
	<Pine.LNX.4.64.0910101708490.18193@kimball.webabinitio.net>
	<4AD1119E.60409@g.nevcal.com>
	<874oq6lgsx.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <Pine.LNX.4.64.0910110109551.18193@kimball.webabinitio.net>

On Sun, 11 Oct 2009 at 13:01, Stephen J. Turnbull wrote:
> Glenn Linderman writes:
> > (I switched conformant to compliant,
>
> Conformant is in common use.  You might be more comfortable with
> conforming.
>
> Richard Stallman points out that you comply with the law, but you
> conform to a standard.  I think it's useful to make that semantic
> distinction, cf. RFC 2119 MUST vs. SHOULD or MAY.

Indeed.  My regular dictionary doesn't have it, but WordWeb does:

http://www.wordwebonline.com/en/CONFORMANT

Seems to be a 'term of art' in computing rather than a regular
English word, and the most appropriate word in the context in
which I used it.  But perhaps it should be added to the Glossary
itself :)

--David

From v+python at g.nevcal.com  Sun Oct 11 07:15:49 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Sat, 10 Oct 2009 22:15:49 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <877hv2likn.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8510262.7231254589795083.JavaMail.root@boaz>	<4ACB0DC9.7080307@g.nevcal.com>	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACB971D.9080706@g.nevcal.com>	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACC0277.2060807@g.nevcal.com>	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACCD10D.4070308@g.nevcal.com>	<87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACE6CBD.2030805@g.nevcal.com>	<87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACED79F.6050602@g.nevcal.com>	<87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp>	<4ACF9C6B.4020508@g.nevcal.com>	<87iqenl594.fsf@uwakimon.sk.tsukuba.ac.jp>	<4AD0E82A.5000603@g.nevcal.com>
	<877hv2likn.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <4AD16A05.8020302@g.nevcal.com>

On approximately 10/10/2009 8:23 PM, came the following characters from 
the keyboard of Stephen J. Turnbull:
> Glenn Linderman writes:
>  > On approximately 10/10/2009 6:59 AM, came the following characters from 
>  > the keyboard of Stephen J. Turnbull:
>  > > Glenn Linderman writes:
>  > >  > On approximately 10/9/2009 8:10 AM, came the following characters from 
>  > >  > the keyboard of Stephen J. Turnbull:
>
>  > > correctly decoded data normally is stored, and is accessible in the
>  > > same way.  But I gather that's not what you were talking about, my
>  > > mistake.
>  > 
>  > Well, the client tells us where to store it, and we can't prevent it 
>  > from being the same place.
>
> Huh?  No way!  We decide where our data is stored.  This isn't C where
> you pass around arbitrary pointers for efficiency.  In particular,
> strings (whether Unicode or bytes) are not mutable.  So the client can
> keep a copy if it likes, but once it hands us raw message text as
> bytes, after that we decide where we put parsed pieces and/or slices
> of the unparsed original.
>   

Yes, email package can figure out where to store its copy, client 
figures out where to store its copy.

We're getting better at communicating, but not 100% there yet :)  I was 
thinking of the case where the client asks the email package for data, 
and stores it in its variable; you seem to be thinking of the case where 
the client gives the email package data.

>  > > So when you wrote about saving and converting to text form, without
>  > > mentioning that the specific APIs, I assumed you meant the "mainline"
>  > > APIs for parsing and accessing parts of a correctly formatted message.
>  > 
>  > Mostly, I hadn't bothered about APIs yet;
>
> You may not bother about APIs, but it sure looks like you do to me.
> You can't talk about where to store stuff without touching the API.
>   

Well, I'm sure there will be APIs; the names and parameters is what I 
haven't bothered about yet, much, except if the discussion seemed to 
require such.

>  > I think that the email package should require that some special action 
>  > needs to be taken by the client to request not-quite-perfect data, 
>  > either a special parameter value, or different API, etc.
>
> That's all I need to hear, until we're ready to write specs for that
> API.  (Note that a special parameter value is part of the API in a
> sense, if we specify and document what it means, so I tend to use API
> for that, too, not just for whole functions.)
>   

Yes, I was just trying to be clear that it could be either case.

>  > But there is nothing that says that some client might not pass that
>  > all the time, and ignore the defect reports.  Whether that is easy
>  > to identify or not, and whether the email package wants to require
>  > that the normal APIs be tried before the not-quite-perfect APIs are
>  > issues for discussion.
>
> The answers are obvious to me: yes and no.  You can identify whether a
> particular API has been used with standard text search tools like M-x
> occur.  (For non-Emacsers, that is an Emacs command that finds all
> occurances of a particular string in the buffer.)  If a program wants
> to call the quick & dirty APIs first, that's none of our business,
> except that if parsing is being done lazily we should be careful to
> update the defect list, so that the program can check them when it
> wants to.
>
>  > Ultimately, the email package cannot enforce that proper case is taken 
>  > by the client; only code reviews of the client can encourage that.
>
> My point is not to enforce anything, not even code reviews.  But by
> having separate APIs for parsed and unparsed data, code review can be
> made easier and more accurate.
>   

You have to analyze the control flow as well, not just search for 
existence of the API.  In normal code, that should be straightforward, 
but there is no guarantee that the client doesn't use spaghetti code, or 
even obfuscated code, where the analysis would be hard.  The API call 
could exist, but never be invoked; the API call could take parameters 
that never have particular values of interest at run-time.  Hence, it 
may or may not be easy to search the client code and figure it out.  But 
I agree with your stated point: we can't enforce anything about the 
client code, unless we write it ourself, or have some sort of authority 
over it.  I intend to write a client, so I'll have control over that 
one, and don't plan to obfuscate it.


>  > Yes, agreed.  And a special way or ways to get various algorithms for 
>  > attempting to interpret not-quite-perfect data, when the client thinks 
>  > that might be useful.
>
> I don't think we should be talking about special ways (plural) or
> "not-quite-perfect" data.  At this point in the design process, we
> have *parsed* and *unparsed* data.  Heuristic algorithms for
> recovering from unparsable input can be layered on top of these two
> sets of APIs, when we have *real* use cases for them.  For example, I
> don't think your use case of prepending a mailing list's topic or
> serial number to an unparseable subject is realistic; in all lists I
> know of such a message would be held for moderation, or even discarded
> outright as spam.
>   

So if the subject is unparseable, what is the moderator to do?  He can't 
read the subject if it unparseable.  Perhaps he can read the body, but 
it might be in the same unparseable charset.  Let's say he can read the 
body, and the message seems to be valid for the list, and he marks it to 
be forwarded to list members.  Now what is the mailing list to do, it 
still can't parse the subject?

And if there is no moderator, it still may not be spam, just a mailing 
list manager that doesn't understand a valid charset, likely because it 
predates the definition of the charset.

> And again:
>
>  > Right.  And it is the more detailed structure that I was referring to... 
>
> But why?  There is no need to discuss it at this point, and bringing
> it up is confusing as all get-out.
>   

The more we understand/discuss about how different client can function, 
the better we can design the email package.  We'll still not likely 
cover all the possibilities, but we don't want to have tunnel vision and 
declare that because Mailman works this way, that all mailing list 
managers work this way, or that because we haven't discussed that some 
client might do something this way, that it won't.  So I have no problem 
bringing clients into the discussion, to make sure that we don't 
preclude their reasonable behaviors as use cases.


>  > How a particular email server interprets the "stuff before the @" is 
>  > pretty much up to it... so as long as it does something appropriate, it 
>  > can interpret all or a fraction of it as a mailbox name, or could it 
>  > intuit a mailbox name from the body content if it wants, or even from a 
>  > special header.  So yeah, particular interpretations of the address is 
>  > non-RFC stuff.
>
> Right.  To riff on the RFC vs. not theme ["Barry, pick up the bass
> line, need more bottom here!"], I think we should pick a list of RFCs
> we "promise" to implement as "defining" email; if we reserve any
> structures as "too obscure for us to parse," we should say so (and
> reference chapter and verse of the Holy RFC).  On the other hand, of
> course as we discover common use cases for which precise
> specifications can be given, we should be flexible and implement them.
> But there should be no rush.
>
> Which RFCs?
>
> First of all, the STD 11 series (RFCs 733, 822, 2822, 5322).  Here we
> have to worry about the standard's recommended format vs. the obsolete
> format because of the Postel principle.  AFAIK, there is no reason not
> to insist on *producing* strictly RFC 5322 conformant messages, but I
> think we should implement both strict and lax parsers.  The lax parser
> is for "daily use", the strict parser for validation.
>
> Second, the basic MIME structure RFCs: 2045-2049, 2231.  (Some of
> these have been at least partially superseded by now, I think.)
>
> The mailing list header RFCs: 2369 and 2919.
>
> Not RFCs, per se, but an auxiliary module should provide the
> registered IANA data for the above RFCs.
>
> Strictly speaking outside of the email module, but we make use of URLs
> (RFC 3986 -- superseded?) and mimetypes data (this overlaps
> substantially with the "registered IANA data".  We need to coordinate
> with the responsible maintainers for those.
>
> Ditto coordinating with modules that we share a lot of structure with,
> the "not email but very similar" like HTTP (RFC 2616), and netnews
> (NNTP = 3397 and RFC 1036).
>
> Which extensions?
>
> Er, don't you think the above is enough for now?<wink>
>   

It's a good list, yes.

>  > Just to point out that good data can be obtained from bad email 
>  > messages, I think, and that that is a use case.
>
> But we already know that, and the basic idea of how to treat bad data
> (send it to a locked room without any supper).  No need to rehash
> that, AFAICS from your use case.
>   

Locked room is the first pass; unlocking it belongs to the heuristics, 
for determined clients.

The use case wasn't at http://wiki.python.org/moin/Email%20SIG/UseCases 
so I've added it there, as "Handling pathological data #2"

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From v+python at g.nevcal.com  Sun Oct 11 07:49:25 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Sat, 10 Oct 2009 22:49:25 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <878wfilpsk.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <PC19220091008162023032819903bdf@msapiro>	<4ACEABFD.6010309@g.nevcal.com>	<4ACEB234.9030309@is.kochi-u.ac.jp>	<4ACED8C4.5070906@g.nevcal.com>	<4ACEF66B.3000500@is.kochi-u.ac.jp>	<4ACFA08F.9080307@g.nevcal.com>	<4ACFB456.6010106@is.kochi-u.ac.jp>	<4ACFDF86.8040104@g.nevcal.com>	<87fx9rl0jh.fsf@uwakimon.sk.tsukuba.ac.jp>	<4AD0EC72.6040704@g.nevcal.com>
	<878wfilpsk.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <4AD171E5.40307@g.nevcal.com>

On approximately 10/10/2009 5:47 PM, came the following characters from 
the keyboard of Stephen J. Turnbull:
> Glenn Linderman writes:
>  > On approximately 10/10/2009 8:40 AM, came the following characters from 
>  > the keyboard of Stephen J. Turnbull:
>
>  > > So why are we discussing this?  We don't even know what our mainline
>  > > APIs are going to look like, why are we discussing forcibly operating
>  > > on broken input?
>  > 
>  > Use case generation.  If the only way to access header values is to 
>  > successfully, fully, decode them, then some uses may be rendered 
>  > impossible, or at least difficult, even by choice of APIs.
>
> Since invertibility is a requirement, "successfully fully decoding" a
> header field is not a prerequisite to accessing it.
>
> The question of "what should we do about broken mail" at this point
> has three components:
>
> (1) To what level do we (ie, the email module) promise to parse
>     conforming wire format into useful objects?
>
> (2) For nonconforming input, when is it OK to raise an error and
>     return to the calling client rather than handle it ourselves?
>
> (3) What is the API for accessing and/or mutating unparsed data, and
>     requesting a reparse?
>
> I don't think we should go any farther than that.
>   

I agree with your three components; but I think the answer to (3) 
requires discussion/speculation of what clients might want to to when 
faced with errors, otherwise the API won't likely help them much, 
without reimplementing email package logic.  It is easy to design 
"sufficient", but unhelpful, APIs.  So I've been willing to discuss such 
things.  Maybe at too much length, and maybe with insufficient clarity 
that that is what I'm discussing, for which I apologize.  But I don't 
think that not discussing it helps to answer (3).

>  > > "Re" is a Latin abbreviation; there is no appropriate translation. ;-)
>  > >   
>  > 
>  > Nonetheless, I have seen both Re: and Fwd: translated to other languages 
>  > (besides Latin or geek) :)
>
> Sure.  This is an aspect of question (1): is this the responsibility
> of the email module?
>   

I don't think the old RFCs even discuss the use of Re: and Fwd:, nor 
whether they should be collapsed or translated, or even used at all.  
Just checked: RFC 822 had an example that showed Re:, but RFC 2822 does 
discuss it a bit, and suggests not adding duplicate Re:.  Fwd: is not 
mentioned at all, in those two RFCs.  So no, adding and collapsing 
Re:/Fwd: is not the responsibility of the email package.  But making it 
easy to do so, might be, as it is a common client operation.  Lots of 
email style guides discuss it.

>  > > Maybe they are, but the email module doesn't know or care about what
>  > > they do.  Let's stick within what the email module is supposed to
>  > > handle
>  > 
>  > Yep, this is just use case exploration.
>
> But since by definition this is broken input, discussing what
> applications are going to want to do with it is inappropriate, IMO.
> We don't care if the app is going to prefix, suffix, or crucifix it.
> We need to specify
>
> (a) what object will hold the raw data we couldn't handle
> (b) how a calling client can retrieve the raw data
> (c) how the client can replace (or more generally mutate) that data
> (d) how the client can request a reparse from us if it attempted to
>     repair the breakage at a low level rather than parse it
>
> Manipulations of text or bytes are in principle not the responsibility
> of the email module IMO; that will be done *by* the client *using* raw
> Python, not methods provided by email.  I don't see how discussion of
> *what* manipulations can be done with one hand up our nose is anything
> but useless bikeshedding.
>
> If we decide that the email module can usefully provide sufficiently
> general facilities that would be convenient and hard to implement by
> general client programmers (eg, the Mailman Developers collective
> wisdom about foreign equivalents for "re" and "fwd" is surely greater
> than that of the average American programmer), we will do it by
> calling low-level methods to get and put the data, and raw Python to
> manipulate it as text or bytes

Except it may be perfectly valid input using a standard that post-dates 
the application.  Doing something reasonable with it is appropriate.  
The email RFCs go to great lengths to make new features work reasonably 
in old clients that have limited understanding; with fallback 
interpretations for unknown MIME subtypes and even MIME types, and 
ensuring that some type of reasonable interpretation might be done.  The 
RFCs define ways that new MIME types and subtypes might be defined, and 
new charsets, it seems reasonable to attempt to accommodate the 
possibility that such may actually be defined in the future.

If we don't discuss some of the possibilities, we'll never learn enough 
to "decide that the email module can usefully provide sufficiently 
general facilities that would be convenient and hard to implement by 
general client programmers" :)

To me, "hard" would mean that they would have to rewrite portions of 
logic that already exists in the email package, and then tweak it 
slightly to compensate for not-quite-perfect data, or maybe I should 
switch to saying "not-quite-perfect-or-possibly-later-standardized data" :)

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From v+python at g.nevcal.com  Sun Oct 11 07:51:50 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Sat, 10 Oct 2009 22:51:50 -0700
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <Pine.LNX.4.64.0910110109551.18193@kimball.webabinitio.net>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org>
	<4ACD94E5.5020808@g.nevcal.com>
	<4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org>
	<4ACE6A1B.7060702@g.nevcal.com>
	<3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org>
	<4ACF880D.5080305@g.nevcal.com>
	<Pine.LNX.4.64.0910101708490.18193@kimball.webabinitio.net>
	<4AD1119E.60409@g.nevcal.com>
	<874oq6lgsx.fsf@uwakimon.sk.tsukuba.ac.jp>
	<Pine.LNX.4.64.0910110109551.18193@kimball.webabinitio.net>
Message-ID: <4AD17276.10205@g.nevcal.com>

On approximately 10/10/2009 10:12 PM, came the following characters from 
the keyboard of R. David Murray:
> But perhaps it should be added to the Glossary itself :) 

That would, to me, make it more acceptable for use.  Like I said, I knew 
what was meant, but tried several printed and internet dictionaries, and 
didn't find it.  Didn't try wordwebonline, as you might suppose!

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From stephen at xemacs.org  Sun Oct 11 10:25:39 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Sun, 11 Oct 2009 17:25:39 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4AD171E5.40307@g.nevcal.com>
References: <PC19220091008162023032819903bdf@msapiro>
	<4ACEABFD.6010309@g.nevcal.com> <4ACEB234.9030309@is.kochi-u.ac.jp>
	<4ACED8C4.5070906@g.nevcal.com> <4ACEF66B.3000500@is.kochi-u.ac.jp>
	<4ACFA08F.9080307@g.nevcal.com> <4ACFB456.6010106@is.kochi-u.ac.jp>
	<4ACFDF86.8040104@g.nevcal.com>
	<87fx9rl0jh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4AD0EC72.6040704@g.nevcal.com>
	<878wfilpsk.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4AD171E5.40307@g.nevcal.com>
Message-ID: <87zl7yjq0s.fsf@uwakimon.sk.tsukuba.ac.jp>

Glenn Linderman writes:

 > > (3) What is the API for accessing and/or mutating unparsed data, and
 > >     requesting a reparse?
 > >
 > > I don't think we should go any farther than that.
 > 
 > I agree with your three components; but I think the answer to (3) 
 > requires discussion/speculation of what clients might want to to when 
 > faced with errors,

I could be wrong, but I don't think it does.  We don't to implement
YAGNIs.

 > otherwise the API won't likely help them much, without
 > reimplementing email package logic.

(1) That's why I propose parsing as much as possible, but no more.
    The parts that are in email package will not only be implemented
    and available, but they will already have been done.  What hasn't
    been done yet, the email module doesn't know how to do anyway.

(2) DRY simply doesn't apply.  The logic for dealing with erroneous
    data is not the same as dealing with conforming data.  If it were,
    we would have succeeded in the first place.

 > Except it may be perfectly valid input using a standard that post-dates 
 > the application.  Doing something reasonable with it is appropriate.  

I have no idea what you're thinking of.  If it's a standard we
implement, we'll handle it.  If it isn't, it's not our problem.

Discussing "possibilities" is out of the realm of "useful" already.
Useful is "Existing client X does Y, and Z does it too.  We can do Y
for them, faster, better, cheaper."


From stephen at xemacs.org  Sun Oct 11 10:42:07 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Sun, 11 Oct 2009 17:42:07 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4AD1611C.6030406@g.nevcal.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org>
	<4ACD94E5.5020808@g.nevcal.com>
	<4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org>
	<4ACE6A1B.7060702@g.nevcal.com>
	<3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org>
	<4ACF880D.5080305@g.nevcal.com>
	<Pine.LNX.4.64.0910101708490.18193@kimball.webabinitio.net>
	<4AD1119E.60409@g.nevcal.com>
	<874oq6lgsx.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4AD1611C.6030406@g.nevcal.com>
Message-ID: <87y6nijp9c.fsf@uwakimon.sk.tsukuba.ac.jp>

Glenn Linderman writes:

 > conformant is not in the dictionaries I've consulted.

Try these (top 3 goggle results for "conformant"):

conformant- WordWeb dictionary definition
    (computing) conforming to a particular specification or standard "In
    this paper we present a new approach to conformant planning". Nearest
    ...
    www.wordwebonline.com/en/CONFORMANT - Cached - Similar - 
conformant - Definition from the Merriam-Webster Online Dictionary
    conformant can be found at Merriam-WebsterUnabridged.com. Click here
    to start your free trial! Click here to search for another word in the
    Merriam-Webster ...
    www.merriam-webster.com/dictionary/conformant - Cached - Similar - 
Conformance
    The notion of TEI conformance is intended as an aid in describing the
    format and contents of a particular document or set of documents. ...
    www.tei-c.org/Guidelines/P4/html/CF.html - Cached - Similar - 

A quick look at some of the results show that the word "conformant" is
typically used in a section called "conformance", which defines what
criteria are used to determine if an application is following the
standard or not.  OTOH, the fact that the top three results are
dictionary definitions suggests an awful lot of people are looking up
the word in dictionaries....

 > Conforming is mostly a verb, not an adjective.

Goggling gives "Results 1 - 10 of about 3,680,000 for conforming
application," but " Results 1 - 10 of about 324,000 for conformant
application."  Looks like "conforming" is the preferred adjectival
form.

 > but conformable and compliant are synonyms.

When used to mean "submissive."  "Conformable" won't do.

 > English is hard enough for ESL folks when they can find the 
 > words in the dictionary.

Compliant does seem to be the winner.  "Results 1 - 10 of about
13,900,000 for compliant application."  Conformant or conforming is
better IMHO but much less popular.  Tie goes to the lusers, as usual.

From stephen at xemacs.org  Sun Oct 11 11:11:19 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Sun, 11 Oct 2009 18:11:19 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4AD16A05.8020302@g.nevcal.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACCD10D.4070308@g.nevcal.com>
	<87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACE6CBD.2030805@g.nevcal.com>
	<87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACED79F.6050602@g.nevcal.com>
	<87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACF9C6B.4020508@g.nevcal.com>
	<87iqenl594.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4AD0E82A.5000603@g.nevcal.com>
	<877hv2likn.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4AD16A05.8020302@g.nevcal.com>
Message-ID: <87ws32jnwo.fsf@uwakimon.sk.tsukuba.ac.jp>

Glenn Linderman writes:
 > On approximately 10/10/2009 8:23 PM, came the following characters from 
 > the keyboard of Stephen J. Turnbull:

 > > I don't think your use case of prepending a mailing list's topic or
 > > serial number to an unparseable subject is realistic; in all lists I
 > > know of such a message would be held for moderation, or even discarded
 > > outright as spam.
 > 
 > So if the subject is unparseable, what is the moderator to do?

That's her problem, not ours.  I can think of a number of things she
can do, starting with bouncing the mail back to sender with a note
that it was broken, please fix.  If the moderator is me, I might load
the mail into XEmacs and see if Gnus can grok it.  Etc.

If and when we discover there are "best practices" for this situation,
we should help automate them.  Until then, "it broke -- here are all
the pieces" is what we should say, IMO.

 > The more we understand/discuss about how different client can function, 
 > the better we can design the email package.

Sure, but about this level of discussion ... "Although never is often
better than *right* now" applies, I think.


From barry at python.org  Mon Oct 12 22:18:32 2009
From: barry at python.org (Barry Warsaw)
Date: Mon, 12 Oct 2009 16:18:32 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <Pine.LNX.4.64.0910091909000.18193@kimball.webabinitio.net>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACCD10D.4070308@g.nevcal.com>
	<87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACE6CBD.2030805@g.nevcal.com>
	<87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACED79F.6050602@g.nevcal.com>
	<87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACF9C6B.4020508@g.nevcal.com>
	<Pine.LNX.4.64.0910091909000.18193@kimball.webabinitio.net>
Message-ID: <CD39A272-BBDE-4CEF-8E9C-3873BE4BAFA2@python.org>

On Oct 9, 2009, at 7:20 PM, R. David Murray wrote:

> IMO, the appropriate way for the email package to provide the API you
> are talking about is it provide the client with a way to get at the  
> raw
> byte string, which I think everyone agrees on.  If the client wants to
> decode it as if it were latin-1 to process it, it can then do that.

I agree.  I'm running out of time to participate in this lengthy  
thread, but I just wanted to say that of the 3 accessors (raw,  
transport-decoded, fully-decoded) I'm not sure transport-decoded is  
all that interesting.  I wouldn't support it directly in the API.  I  
think they library's clients are mostly going to be interested in raw  
or fully-decoded values, and there will be plenty of library utilities  
to get from raw to transport-decoded if they really want it.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091012/fee315a5/attachment.pgp>

From barry at python.org  Mon Oct 12 22:19:34 2009
From: barry at python.org (Barry Warsaw)
Date: Mon, 12 Oct 2009 16:19:34 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACFDB3C.5040307@g.nevcal.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACCD10D.4070308@g.nevcal.com>
	<87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACE6CBD.2030805@g.nevcal.com>
	<87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACED79F.6050602@g.nevcal.com>
	<87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACF9C6B.4020508@g.nevcal.com>
	<Pine.LNX.4.64.0910091909000.18193@kimball.webabinitio.net>
	<4ACFDB3C.5040307@g.nevcal.com>
Message-ID: <1D493DD5-7EE0-486B-BD8D-24FC5BB0B7A0@python.org>

On Oct 9, 2009, at 8:54 PM, Glenn Linderman wrote:

> That certainly works, but it isn't very helpful... that forces the  
> client application to reproduce the logic to parse the header value  
> and decode the parts that can be decoded successfully, and that is  
> exactly the sort of thing Stephen was complaining about when he  
> thought I was suggesting that to be a requirement (but he was  
> confused about what I was suggesting).

There are/will be utilities in the email package to make this easy.  I  
don't think there's a ton of benefit to be had by supporting transport- 
decoded directly in the Message or Header API.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091012/ee90b2f3/attachment.pgp>

From barry at python.org  Mon Oct 12 22:30:28 2009
From: barry at python.org (Barry Warsaw)
Date: Mon, 12 Oct 2009 16:30:28 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <87iqenl594.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACCD10D.4070308@g.nevcal.com>
	<87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACE6CBD.2030805@g.nevcal.com>
	<87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACED79F.6050602@g.nevcal.com>
	<87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACF9C6B.4020508@g.nevcal.com>
	<87iqenl594.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <2F987CAA-8FC6-406E-A825-E12B97659A60@python.org>

On Oct 10, 2009, at 9:59 AM, Stephen J. Turnbull wrote:

> Both.  I *believe* (but it needs to be checked) that in a correctly
> formed multipart MIME object (message or part), any internal structure
> is context-free within the MIME boundaries.  If that is so, then
> individual parts of the object can be stored in raw form and parsed
> lazily.

I too /think/ that's correct.  There are some MIME content-types that  
cause parts to be related (e.g. multipart/alternative and multipart/ 
related), but those are all operating at a higher level.

In practice it probably makes sense to parse all the headers right  
away.  Content-Type has the most bearing on parsing the rest of the  
stuff, so by that time you already need to parse parameters to e.g.  
get the boundary.  Early on I claimed that headers were so manageable  
in practice that we could implement an ordered-dictionary with  
duplicates as a simple list, with linear searching and nobody would  
notice.  I think nobody has noticed ;).

Lazy parsing of the body does make sense.  You only need to parse  
enough to find end boundaries, or recurse into parsing an embedded  
part.  This is how the parser currently works anyway.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091012/fa49bd45/attachment.pgp>

From barry at python.org  Mon Oct 12 22:41:48 2009
From: barry at python.org (Barry Warsaw)
Date: Mon, 12 Oct 2009 16:41:48 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <4ACF880D.5080305@g.nevcal.com>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org>
	<4ACD94E5.5020808@g.nevcal.com>
	<4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org>
	<4ACE6A1B.7060702@g.nevcal.com>
	<3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org>
	<4ACF880D.5080305@g.nevcal.com>
Message-ID: <87C79F21-2E8D-460E-992D-AE6050F0C394@python.org>

On Oct 9, 2009, at 2:59 PM, Glenn Linderman wrote:

> It would be good though to have standardized terms for easier  
> communication.  Maybe as they are chosen, they could be added to  
> that Wiki RDM set up?

I like raw, transfer-decoded, decoded (or maybe fully-decoded).  As  
I've mentioned before, I don't think the Message or Header APIs need  
to directly support transfer-decoded.

> Separate APIs would be clearer, but for compatibility,  
> should .get_payload() be retained, with the flag?

No.  It was a mistake that should be taken out back and shot.

I would proposal a radical suggestion: we treat backward compatibility  
the way Python 3 did.  Nice to keep, but we can throw it over the side  
in order to fix the warts.  We'll worry about migration strategy later.

Aside: I would really like to have a much more @property based API  
where appropriate.  E.g. Message.get_content_type() would be  
Message.content_type.  And in this case we'd probably have  
message.payload_bytes or some such.  Decoding may require additional  
parameters so it will probably be a method.

> Sure, a registration system is fine.   It could work for any type  
> that has a method that can be registered, that accepts a binary BLOB  
> and returns an appropriate typed and functioning object that can  
> manipulate that type.  That would mean that the application would  
> have to make all the registration calls up front, instead of making  
> the API calls when the objects are retrieved.  Basically, if the  
> email package doesn't have a registration system that the  
> application can use, the application has to invent its own, so this  
> is work that could benefit all applications.

I'm sure there will be lots of default content-types registered, and  
there ought to be a "default" or fallback converter that can be  
overridden.  It should also be possible for third party extensions to  
add additional converters.  Models for this would be timzeone  
additions for datetime, and codecs.

> Actually, although it is not common practice to have encodings other  
> than the RFC defined base64 and quoted-printable, a registration  
> system for converting from #1 to #3, with appropriate defaults for  
> base64, quoted-printable, binary, 7bit, 8bit, would be appropriate,

That makes sense.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091012/36621696/attachment.pgp>

From barry at python.org  Mon Oct 12 22:45:09 2009
From: barry at python.org (Barry Warsaw)
Date: Mon, 12 Oct 2009 16:45:09 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <Pine.LNX.4.64.0910101708490.18193@kimball.webabinitio.net>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org>
	<4ACD94E5.5020808@g.nevcal.com>
	<4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org>
	<4ACE6A1B.7060702@g.nevcal.com>
	<3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org>
	<4ACF880D.5080305@g.nevcal.com>
	<Pine.LNX.4.64.0910101708490.18193@kimball.webabinitio.net>
Message-ID: <C43179FF-72C4-4829-9E7B-37EECD8DD16F@python.org>

On Oct 10, 2009, at 5:20 PM, R. David Murray wrote:

> The other is a Glossary[2].  I think most of it accurately reflects  
> the
> consensus here, but in it I'm proposing to use the term 'transfer- 
> decoded'
> for #3, and 'transfer-encoded' as an alternative to 'wire-format' just
> for symmetry.  Comments and suggestions welcome.

wire-format is potentially misleading because the RFCs define line- 
endings as CRLF, but we accept system native line-endings, and  
sometimes output them too.

ready-for-another-can-of-worms-yum!-ly y'rs,
-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091012/f7eab821/attachment-0001.pgp>

From barry at python.org  Mon Oct 12 22:47:31 2009
From: barry at python.org (Barry Warsaw)
Date: Mon, 12 Oct 2009 16:47:31 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <878wfilpsk.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <PC19220091008162023032819903bdf@msapiro>
	<4ACEABFD.6010309@g.nevcal.com> <4ACEB234.9030309@is.kochi-u.ac.jp>
	<4ACED8C4.5070906@g.nevcal.com> <4ACEF66B.3000500@is.kochi-u.ac.jp>
	<4ACFA08F.9080307@g.nevcal.com> <4ACFB456.6010106@is.kochi-u.ac.jp>
	<4ACFDF86.8040104@g.nevcal.com>
	<87fx9rl0jh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4AD0EC72.6040704@g.nevcal.com>
	<878wfilpsk.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <19D59CF5-9027-4544-AE8A-207AAA26F6D3@python.org>

On Oct 10, 2009, at 8:47 PM, Stephen J. Turnbull wrote:

> The question of "what should we do about broken mail" at this point
> has three components:
>
> (1) To what level do we (ie, the email module) promise to parse
>    conforming wire format into useful objects?
>
> (2) For nonconforming input, when is it OK to raise an error and
>    return to the calling client rather than handle it ourselves?
>
> (3) What is the API for accessing and/or mutating unparsed data, and
>    requesting a reparse?
>
> I don't think we should go any farther than that.

Agreed!

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091012/421de0db/attachment.pgp>

From barry at python.org  Mon Oct 12 22:54:17 2009
From: barry at python.org (Barry Warsaw)
Date: Mon, 12 Oct 2009 16:54:17 -0400
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <877hv2likn.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACCD10D.4070308@g.nevcal.com>
	<87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACE6CBD.2030805@g.nevcal.com>
	<87eipdp4xf.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACED79F.6050602@g.nevcal.com>
	<87ws34li1x.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACF9C6B.4020508@g.nevcal.com>
	<87iqenl594.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4AD0E82A.5000603@g.nevcal.com>
	<877hv2likn.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <82DD0453-E0D5-4853-9678-27D7E0FEB9CC@python.org>

On Oct 10, 2009, at 11:23 PM, Stephen J. Turnbull wrote:

> Right.  To riff on the RFC vs. not theme ["Barry, pick up the bass
> line, need more bottom here!"], I think we should pick a list of RFCs
> we "promise" to implement as "defining" email; if we reserve any
> structures as "too obscure for us to parse," we should say so (and
> reference chapter and verse of the Holy RFC).  On the other hand, of
> course as we discover common use cases for which precise
> specifications can be given, we should be flexible and implement them.
> But there should be no rush.

Although of course Rush is the most awesomest band EVAR.  But I'm  
slappin' and poppin' to your groove here my bruthah.

> Which RFCs?
>
> First of all, the STD 11 series (RFCs 733, 822, 2822, 5322).  Here we
> have to worry about the standard's recommended format vs. the obsolete
> format because of the Postel principle.  AFAIK, there is no reason not
> to insist on *producing* strictly RFC 5322 conformant messages, but I
> think we should implement both strict and lax parsers.  The lax parser
> is for "daily use", the strict parser for validation.
>
> Second, the basic MIME structure RFCs: 2045-2049, 2231.  (Some of
> these have been at least partially superseded by now, I think.)
>
> The mailing list header RFCs: 2369 and 2919.

Yep, yep, and yep.

> Not RFCs, per se, but an auxiliary module should provide the
> registered IANA data for the above RFCs.
>
> Strictly speaking outside of the email module, but we make use of URLs
> (RFC 3986 -- superseded?) and mimetypes data (this overlaps
> substantially with the "registered IANA data".  We need to coordinate
> with the responsible maintainers for those.
>
> Ditto coordinating with modules that we share a lot of structure with,
> the "not email but very similar" like HTTP (RFC 2616), and netnews
> (NNTP = 3397 and RFC 1036).
>
> Which extensions?
>
> Er, don't you think the above is enough for now?<wink>

Surely is, at least until that U$1M grant from the PSF comes through  
<wink>.  Oh wait, we blew that on lunch at Pycon 2009.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091012/a1ac6364/attachment.pgp>

From stephen at xemacs.org  Tue Oct 13 06:07:00 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Tue, 13 Oct 2009 13:07:00 +0900
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <87C79F21-2E8D-460E-992D-AE6050F0C394@python.org>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org>
	<4ACD94E5.5020808@g.nevcal.com>
	<4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org>
	<4ACE6A1B.7060702@g.nevcal.com>
	<3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org>
	<4ACF880D.5080305@g.nevcal.com>
	<87C79F21-2E8D-460E-992D-AE6050F0C394@python.org>
Message-ID: <871vl8hr8b.fsf@uwakimon.sk.tsukuba.ac.jp>

Barry Warsaw writes:

 > I would proposal a radical suggestion: we treat backward compatibility  
 > the way Python 3 did.  Nice to keep, but we can throw it over the side  
 > in order to fix the warts.  We'll worry about migration strategy later.

+1

 > Aside: I would really like to have a much more @property based API  
 > where appropriate.

+1

 > E.g. Message.get_content_type() would be Message.content_type.  And
 > in this case we'd probably have message.payload_bytes or some such.
 > Decoding may require additional parameters so it will probably be a
 > method.

Maybe, but in general those parameters can be deduced from the
metadata.  If we can use those defaults often enough, then the
default-decoded version can be a property too.

We would have to provide alternatives, though.  I've seen Shift JIS
encoded Japanese labelled "ISO-2022-JP", and apparently many Japanese
MUAs actually decode that to Japanese!  Not suggesting that we should
do the same, but probably the generic function that is used to decode
should be exposed as a method so that clients who encounter such
nonsense can deal with it, and override any of the metadata.

From andrewm at object-craft.com.au  Mon Oct 19 06:39:06 2009
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Mon, 19 Oct 2009 15:39:06 +1100
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <1685F0AB-8B57-445A-BE03-3782E07DB8FD@python.org>
References: <10506972.7161254576370614.JavaMail.root@boaz>
	<8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org>
	<87zl88h4cj.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1685F0AB-8B57-445A-BE03-3782E07DB8FD@python.org>
Message-ID: <20091019043906.542AE59C086@longblack.object-craft.com.au>

>Just to ramble a little longer, it's been argued that we should give  
>up on idempotency, but I'm not convinced.  I think people want to see  
>an email message they throw into the system come out the other end as  
>closely as possible (well, /exactly/ for well-formed messages).

I, for one, would be disappointed if we lost idempotency. If people want
a use-case, think of SpamBayes, where we read the message, do our best
to analyse it, then insert a header or two. If this mangled messages,
the email module would be nearly useless to SB.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Mon Oct 19 06:50:26 2009
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Mon, 19 Oct 2009 15:50:26 +1100
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACCD10D.4070308@g.nevcal.com>
	<87ljjmqfk0.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <20091019045027.12F3E59C086@longblack.object-craft.com.au>

> > Your "hit me with your best shot" comment indicates that you want a
> > failure code or exception when the data is bad, and then a way to
> > "retry accepting errors"?
>
>My curent thinking is that the email module should return an object
>representing a partial parse.  The way that you find out if it is
>partial is to try to access some data that "should" be in the object.
>If the parse succeeded, the accessor returns the data (which might be
>empty).  If the parse did not succeed, you get an AttributeError.
>(This is just a paraphrase of what I wrote in response to Oleg.)

I agree - try to extract as much intelligence as we can from the malformed
message, and hold the unparseable bits in a "bad chunk" object. If
possible, when reserialising the message, e-mail the bad chunk verbatim,
or possibly with minor fixes to keep the containing MIME structure legal
if we have to. But I'd rather see "garbage-in and same garbage-out",
than "garbage-in and even worse garbage out".

Maybe the parsing should lazy where possible: don't recurse deeper
into the structure if all we're doing is looking at a top level header,
for instance.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Mon Oct 19 07:05:10 2009
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Mon, 19 Oct 2009 16:05:10 +1100
Subject: [Email-SIG] fixing the current email module
In-Reply-To: <C43179FF-72C4-4829-9E7B-37EECD8DD16F@python.org>
References: <8510262.7231254589795083.JavaMail.root@boaz>
	<4ACB0DC9.7080307@g.nevcal.com>
	<8763astxlw.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACB971D.9080706@g.nevcal.com>
	<87k4z8rqou.fsf@uwakimon.sk.tsukuba.ac.jp>
	<4ACC0277.2060807@g.nevcal.com>
	<87ab03sdbt.fsf@uwakimon.sk.tsukuba.ac.jp>
	<643CAE16-EFB2-4D8A-83B4-44888DAAED74@python.org>
	<4ACD94E5.5020808@g.nevcal.com>
	<4A1C4B7E-57DF-4C1C-83BD-C20B47781CB3@python.org>
	<4ACE6A1B.7060702@g.nevcal.com>
	<3DF8BB7E-7C60-444A-8D5D-C74F58606184@python.org>
	<4ACF880D.5080305@g.nevcal.com>
	<Pine.LNX.4.64.0910101708490.18193@kimball.webabinitio.net>
	<C43179FF-72C4-4829-9E7B-37EECD8DD16F@python.org>
Message-ID: <20091019050510.C238259C086@longblack.object-craft.com.au>

>wire-format is potentially misleading because the RFCs define line- 
>endings as CRLF, but we accept system native line-endings, and  
>sometimes output them too.

And, in some contexts, when forwarding e-mail it is important that we
emit exactly the line endings we received, without trying to be "helpful"
and "fix" them.  But, in the case of text content inserted into a message,
I think we should convert the system line endings into CRLF (possibly with
some way to override this - a "literal" mode).

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From rdmurray at bitdance.com  Thu Oct 22 00:58:42 2009
From: rdmurray at bitdance.com (R. David Murray)
Date: Wed, 21 Oct 2009 18:58:42 -0400 (EDT)
Subject: [Email-SIG] invertability and idempotence
In-Reply-To: <20091019043906.542AE59C086@longblack.object-craft.com.au>
References: <10506972.7161254576370614.JavaMail.root@boaz>
	<8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org>
	<87zl88h4cj.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1685F0AB-8B57-445A-BE03-3782E07DB8FD@python.org>
	<20091019043906.542AE59C086@longblack.object-craft.com.au>
Message-ID: <Pine.LNX.4.64.0910211839010.18193@kimball.webabinitio.net>

On Mon, 19 Oct 2009 at 15:39, Andrew McNamara wrote:
>> Just to ramble a little longer, it's been argued that we should give
>> up on idempotency, but I'm not convinced.  I think people want to see
>> an email message they throw into the system come out the other end as
>> closely as possible (well, /exactly/ for well-formed messages).
>
> I, for one, would be disappointed if we lost idempotency. If people want
> a use-case, think of SpamBayes, where we read the message, do our best
> to analyse it, then insert a header or two. If this mangled messages,
> the email module would be nearly useless to SB.

You are referring here to invertability, rather than idempotence.

But it turns out that idempotence does have a meaning in the context
of the email module, so I think I need to remove 'depreciated' from
my glossary[1] entry for it, and explain what it means in the context
of the email module.

For background, see issue 7119[2].

Here's what I propose: _invertability_ applies to the data path
into the parser and out of the generator.  That is:

     generate(parse(msg)) == msg

should be true whenever possible.

On the other hand, when _constructing_ a message, sometimes not all data
is filled in (in the example above, it is the MIME boundary marker).
In that case, it is important (I think, please discuss :) that generating
the message maintain _idempotency_: once you have generated the message,
then if you have not further mutated the message, generating the message
again should produce the _same_ output.  That is:

     generate(msg) == generate(msg)

even though the state of msg may change after the _first_ generate call.

--David

[1] http://wiki.python.org/moin/Email%20SIG/Glossary
[2] http://bugs.python.org/issue7119

From andrewm at object-craft.com.au  Thu Oct 22 06:58:24 2009
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Thu, 22 Oct 2009 15:58:24 +1100
Subject: [Email-SIG] invertability and idempotence
In-Reply-To: <Pine.LNX.4.64.0910211839010.18193@kimball.webabinitio.net>
References: <10506972.7161254576370614.JavaMail.root@boaz>
	<8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org>
	<87zl88h4cj.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1685F0AB-8B57-445A-BE03-3782E07DB8FD@python.org>
	<20091019043906.542AE59C086@longblack.object-craft.com.au>
	<Pine.LNX.4.64.0910211839010.18193@kimball.webabinitio.net>
Message-ID: <20091022045824.ABEBD600111@longblack.object-craft.com.au>

>You are referring here to invertability, rather than idempotence.

The discussion had referred to idempotency up until that point, and I
didn't want to introduce new terminology. But referring to this:

>    generate(parse(msg)) == msg

as "idempotency" is perfectly valid in my opinion (as in, applying an
operation multiple times produces the same result). 

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From stephen at xemacs.org  Thu Oct 22 10:00:13 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Thu, 22 Oct 2009 17:00:13 +0900
Subject: [Email-SIG] invertability and idempotence
In-Reply-To: <20091022045824.ABEBD600111@longblack.object-craft.com.au>
References: <10506972.7161254576370614.JavaMail.root@boaz>
	<8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org>
	<87zl88h4cj.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1685F0AB-8B57-445A-BE03-3782E07DB8FD@python.org>
	<20091019043906.542AE59C086@longblack.object-craft.com.au>
	<Pine.LNX.4.64.0910211839010.18193@kimball.webabinitio.net>
	<20091022045824.ABEBD600111@longblack.object-craft.com.au>
Message-ID: <87eiovdfjm.fsf@uwakimon.sk.tsukuba.ac.jp>

Andrew McNamara writes:

 > The discussion had referred to idempotency up until that point, and I
 > didn't want to introduce new terminology. But referring to this:
 > 
 > >    generate(parse(msg)) == msg
 > 
 > as "idempotency" is perfectly valid in my opinion (as in, applying an
 > operation multiple times produces the same result). 

That would be generate(generate(msg)) == generate(msg) or
parse(parse(email)) == parse(email).  The input and output of
these functions are of *different types*, they cannot possibly be
idempotent.

I'm +1 on changing to use "invertible", -0 on continuing to use
"idempotent" (since it's the traditional idiom), and -1 on using
"idempotent" to mean "is deterministic", ie, generate(msg) ==
generate(msg).

If msg changes state in an irrelevant way, it would be nice to produce
the same output from generate.  But that is not "idempotency".

And we would need to specify precisely what irrelevant means.  For
example, if a client of the Message class decides to specify the MIME
boundary explicitly, then the output of generate has to change IMO.
OTOH, many MIME implementations put the time of day or the generating
process into the MIME boundary.  This is unnecessary (boundaries need
to be unique only message-wide, and the email package can adjust the
boundary to not conflict with message content, eg, Emacs/Gnus uses
something like "-=-=-=-=-" by default), and I would hope that email
avoids such practices when possible.


From andrewm at object-craft.com.au  Thu Oct 22 11:42:43 2009
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Thu, 22 Oct 2009 20:42:43 +1100
Subject: [Email-SIG] invertability and idempotence
In-Reply-To: <87eiovdfjm.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <10506972.7161254576370614.JavaMail.root@boaz>
	<8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org>
	<87zl88h4cj.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1685F0AB-8B57-445A-BE03-3782E07DB8FD@python.org>
	<20091019043906.542AE59C086@longblack.object-craft.com.au>
	<Pine.LNX.4.64.0910211839010.18193@kimball.webabinitio.net>
	<20091022045824.ABEBD600111@longblack.object-craft.com.au>
	<87eiovdfjm.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <20091022094243.41A25600111@longblack.object-craft.com.au>

> > didn't want to introduce new terminology. But referring to this:
> > 
> > >    generate(parse(msg)) == msg
> > 
> > as "idempotency" is perfectly valid in my opinion (as in, applying an
> > operation multiple times produces the same result). 
>
>That would be generate(generate(msg)) == generate(msg) or
>parse(parse(email)) == parse(email).  The input and output of
>these functions are of *different types*, they cannot possibly be
>idempotent.

You're splitting hairs - the operation "generate(parse(X))" is idempotent,
and that's what I was referring to. 

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From barry at python.org  Thu Oct 22 13:36:12 2009
From: barry at python.org (Barry Warsaw)
Date: Thu, 22 Oct 2009 07:36:12 -0400
Subject: [Email-SIG] invertability and idempotence
In-Reply-To: <Pine.LNX.4.64.0910211839010.18193@kimball.webabinitio.net>
References: <10506972.7161254576370614.JavaMail.root@boaz>
	<8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org>
	<87zl88h4cj.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1685F0AB-8B57-445A-BE03-3782E07DB8FD@python.org>
	<20091019043906.542AE59C086@longblack.object-craft.com.au>
	<Pine.LNX.4.64.0910211839010.18193@kimball.webabinitio.net>
Message-ID: <ACA7B267-EB0E-49FE-A417-4150D7006B5A@python.org>

On Oct 21, 2009, at 6:58 PM, R. David Murray wrote:

> But it turns out that idempotence does have a meaning in the context
> of the email module, so I think I need to remove 'depreciated' from
> my glossary[1] entry for it, and explain what it means in the context
> of the email module.

I think you're onto something here.

> For background, see issue 7119[2].
>
> Here's what I propose: _invertability_ applies to the data path
> into the parser and out of the generator.  That is:
>
>    generate(parse(msg)) == msg
>
> should be true whenever possible.

Agreed, where 'msg' in this context means the message text or bytes.

> On the other hand, when _constructing_ a message, sometimes not all  
> data
> is filled in (in the example above, it is the MIME boundary marker).
> In that case, it is important (I think, please discuss :) that  
> generating
> the message maintain _idempotency_: once you have generated the  
> message,
> then if you have not further mutated the message, generating the  
> message
> again should produce the _same_ output.  That is:
>
>    generate(msg) == generate(msg)
>
> even though the state of msg may change after the _first_ generate  
> call.

"Idempotent" means: "multiple applications of the operation do not  
change the result".  So here where the operation is to take a message  
object and generate a stream of text or bytes, this should absolutely  
return the same stream if the object is not mutated between calls.  I  
think it's fair though that if the model is manipulated in any way, we  
make no guarantees of idempotency, though we should strive for minimal  
differences.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091022/edf0332b/attachment.pgp>

From stephen at xemacs.org  Thu Oct 22 20:09:43 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Fri, 23 Oct 2009 03:09:43 +0900
Subject: [Email-SIG] invertability and idempotence
In-Reply-To: <20091022094243.41A25600111@longblack.object-craft.com.au>
References: <10506972.7161254576370614.JavaMail.root@boaz>
	<8A41B92B-6D7F-4A85-BA64-B5C5C861805A@python.org>
	<87zl88h4cj.fsf@uwakimon.sk.tsukuba.ac.jp>
	<1685F0AB-8B57-445A-BE03-3782E07DB8FD@python.org>
	<20091019043906.542AE59C086@longblack.object-craft.com.au>
	<Pine.LNX.4.64.0910211839010.18193@kimball.webabinitio.net>
	<20091022045824.ABEBD600111@longblack.object-craft.com.au>
	<87eiovdfjm.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20091022094243.41A25600111@longblack.object-craft.com.au>
Message-ID: <874oprcnbs.fsf@uwakimon.sk.tsukuba.ac.jp>

Andrew McNamara writes:

 > > > didn't want to introduce new terminology. But referring to this:
 > > > 
 > > > >    generate(parse(msg)) == msg
 > > > 
 > > > as "idempotency" is perfectly valid in my opinion (as in, applying an
 > > > operation multiple times produces the same result). 
 > >
 > >That would be generate(generate(msg)) == generate(msg) or
 > >parse(parse(email)) == parse(email).  The input and output of
 > >these functions are of *different types*, they cannot possibly be
 > >idempotent.
 > 
 > You're splitting hairs - the operation "generate(parse(X))" is
 > idempotent, and that's what I was referring to.

Yes and no.  The equation above does imply idempotency, but it is a
much stronger statement: generate(parse()) is the identity.  That
stronger statement could be useful in practice, but it could also be
expensive to implement.  That tension could engender flamewars if the
requirement is expressed by the word "idempotency" but the intent is
"identity".

For example, suppose that for MIME multipart messages, generate() uses
"$%$%$%$%$%$" as the separator as long as no component contains that
string.  Then generate(parse(msg)) will be *equivalent* but not
*identical* to msg for most messages received from non-Python-email-
using MUAs.  generate(parse()) is idempotent, though.  I don't think
the folks who ask for "idempotency" would be satisfied with that!

As I said earlier, if we're going to use the word "idempotent" to mean
"invertible", that's established practice, so we footnote the
Humpty-Dumpty-ism, and I can live with that.  But if we're going to
try to be more accurate, let's be fully accurate.