From barry at python.org  Thu Apr  2 14:54:08 2009
From: barry at python.org (Barry Warsaw)
Date: Thu, 2 Apr 2009 07:54:08 -0500
Subject: [Email-SIG] Plans for email 6.0
Message-ID: <0DC79F9C-450F-484F-BBB0-28B69EB879F9@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello everyone.

Today's the last day of Pycon 2009 sprints and I'm eager to return  
home and see my family.  Chris Withers and I had a good day sprinting  
on the email package before he had to jet out, and although we only  
closed one bug in Python 2.7 (this is where Chris's mantra "backport,  
backport" begins :) we had a lot of good discussions about how and  
where to fix outstanding problems in email.

I have lots of ideas on how to improve the email package.  I plan on  
creating a bit of space on the Python wiki to consolidate my thoughts  
and to coordinate implementation.  I'm hoping some of you will be  
interested enough to help with design, testing, use cases, and coding.

We have a few older pages in the wiki covering the email package:

http://wiki.python.org/moin/EmailSigSprint
http://wiki.python.org/moin/EmailSprint

Some of this we've accomplished.  Here's a rambling of some of my  
thoughts on things we should do.

* Turn all header values into Header instances.  It's difficult and  
error prone to have to manage both strings and Headers as values, so  
they should always be Header instances.  We should add a registry of  
Header subclasses, based on the lower cased header name, for allowing  
higher level semantic folding of header strings.

* Implement a Message subclass registry for parsing.  This would allow  
the parser to create custom subclasses based on the Content-Type found  
while parsing the message.

* Bytes and string interfaces.  This is the trickiest one.  I think  
that internally, header names and values, and payloads should all be  
represented as bytes.  But APIs should accept bytes and strings,  
converting to bytes on input, and provide APIs to extract information  
as either bytes or strings.  I've thought about a few ways to do this  
cleanly, but haven't found anything I particularly like yet.  Remember  
that in email in Py2 is horribly broken in its discrimination between  
bytes and strings, but Py3 forces us to make a choice (which is a good  
thing).

* Clean up the API.  Where possible, simple attribute access should be  
the norm.  Let's get rid of dumb API decisions (like str(msg)  
including the Unix-From).  Let's fix the whole  
get_payload(decode=True) debacle.  Let's fix stuff like needing to  
specify unicode encodings twice in the same call.  Etc.

* Add an external storage API so that messages with huge binary  
payloads don't need to be fully stored in memory.

* Let's target Python 3.1 (coming very soon) if possible, or Python  
3.2 if not.  We should back port email 6.0 to Python 2.x, though we'll  
have to decide how far back we should go (my suggestion: no earlier  
than Python 2.5).

* Fix the myriad of bugs in the tracker!

That's it for now.  I'll figure out a place in the wiki for this and  
we can start capturing our thoughts there.  One thing I've heard  
pretty consistently is that while the email package has its problems,  
it's one of the best email packages available for any language.  Let's  
make it rock.

Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Darwin)

iQCVAwUBSdS1cHEjvBPtnXfVAQL7egQAk4LQpdfruSdW3R+Egz7dqAWfbftBnQio
dGdyZT/X8cyjGVO9wwcwo2u2c7+JPElpnvBnYZc9oMSFErfUvgumXZo3mEORaGpm
hj/+s0vG8c79SzA9Jz5wB1sBj50c7xN1L7kDCR3Ncwhz4vJSkO8nLvOqaJiccuF8
7s76zNewnO8=
=Dayc
-----END PGP SIGNATURE-----

From tonynelson at georgeanelson.com  Thu Apr  2 16:16:43 2009
From: tonynelson at georgeanelson.com (Tony Nelson)
Date: Thu, 2 Apr 2009 10:16:43 -0400
Subject: [Email-SIG] Plans for email 6.0
In-Reply-To: <0DC79F9C-450F-484F-BBB0-28B69EB879F9@python.org>
References: <0DC79F9C-450F-484F-BBB0-28B69EB879F9@python.org>
Message-ID: <p04330100c5fa73bc5e05@[192.168.123.162]>

At 07:54 -0500 2009/04/02, Barry Warsaw wrote:
 ...
>...Here's a rambling of some of my thoughts on things we should do.
 ...

>* Bytes and string interfaces.  This is the trickiest one.  I think
>that internally, header names and values, and payloads should all be
>represented as bytes.  But APIs should accept bytes and strings,
>converting to bytes on input, and provide APIs to extract information
>as either bytes or strings.  I've thought about a few ways to do this
>cleanly, but haven't found anything I particularly like yet.  Remember
>that in email in Py2 is horribly broken in its discrimination between
>bytes and strings, but Py3 forces us to make a choice (which is a good
>thing).

AIUI, this or something like it must be done soon, as the email package is
broken on 3.x now.


>* Clean up the API.  Where possible, simple attribute access should be
>the norm.  Let's get rid of dumb API decisions (like str(msg)
>including the Unix-From).  Let's fix the whole
>get_payload(decode=True) debacle.  Let's fix stuff like needing to
>specify unicode encodings twice in the same call.  Etc.

Sounds good.  I'd like __setitem__ (msg[hdr] = foo) to act more like a
mapping, and not just append new header fields, with .replace_header() and
.add_header() folded together as .set_header().


>* Add an external storage API so that messages with huge binary
>payloads don't need to be fully stored in memory.
>
>* Let's target Python 3.1 (coming very soon) if possible, or Python
>3.2 if not.  We should back port email 6.0 to Python 2.x, though we'll
>have to decide how far back we should go (my suggestion: no earlier
>than Python 2.5).

Python 3.1 should have a working email package, and a simple way for users
needing more to get a better replacement (which they'd install as a
site-package).  I think that a sane split between bytes and string (or
string and Unicode on 2.x) is most needed.


>* Fix the myriad of bugs in the tracker!

Sure, I'm game!  We 2.x users would benefit.  Again, a place for users to
get an "official" current package is needed, as 2.7 is a ways off.
-- 
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>

From barry at python.org  Sun Apr  5 19:26:52 2009
From: barry at python.org (Barry Warsaw)
Date: Sun, 5 Apr 2009 13:26:52 -0400
Subject: [Email-SIG] Email 6.0
Message-ID: <74AB269B-B7E8-4706-B066-E2AA662EF3DB@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I've started a branch for the email package version 6.0.0.  Given that  
we have until May 2nd to solidify this thing for Python 3.1, I  
honestly don't think we'll make it.  I would rather concentrate on  
getting this right, and usable as a standalone package, then work  
toward getting the new version into Python 3.2 and backported to 2.7.

I'm working on a branch in Bazaar, at lp:~barry/python/email6

% bzr branch lp:~barry/python/email6

This is a branch of the Py3k trunk.  I'm starting by refactoring the  
huge test_email.py file into smaller separate tests, then fixing thing  
as I go.  After the tests are working I plan on starting to fix the  
API and other problems we've talked about.  For now, let's coordinate  
on this branch.  IOW, if you'd like to contribute (and I hope you do!)  
please branch the above and let us know about it here.  I'll keep the  
above branch as (for now) the master copy.

Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Darwin)

iQCVAwUBSdjp3HEjvBPtnXfVAQJcMgQApgnYaX34Au1AhFgOdbRlxbgxN7kRcB/N
F+LK0IsPsrk8nqUoTpCcsNyZA/ErNUqeNctikZprdOz28xPnndrwaFNDHsWwbMfn
NbzacfTP/2R106wOwNANc68dj7jfco7R6fp8Qa3i4vo1S59SiDuyQy7zMstiql/T
nUhCIwijS/Q=
=NLTE
-----END PGP SIGNATURE-----

From barry at python.org  Sun Apr  5 19:30:24 2009
From: barry at python.org (Barry Warsaw)
Date: Sun, 5 Apr 2009 13:30:24 -0400
Subject: [Email-SIG] Plans for email 6.0
In-Reply-To: <p04330100c5fa73bc5e05@[192.168.123.162]>
References: <0DC79F9C-450F-484F-BBB0-28B69EB879F9@python.org>
	<p04330100c5fa73bc5e05@[192.168.123.162]>
Message-ID: <88F420F3-4CA1-4BE6-B696-3CDB066B314B@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Apr 2, 2009, at 10:16 AM, Tony Nelson wrote:

>> * Bytes and string interfaces.  This is the trickiest one.  I think
>> that internally, header names and values, and payloads should all be
>> represented as bytes.  But APIs should accept bytes and strings,
>> converting to bytes on input, and provide APIs to extract information
>> as either bytes or strings.  I've thought about a few ways to do this
>> cleanly, but haven't found anything I particularly like yet.   
>> Remember
>> that in email in Py2 is horribly broken in its discrimination between
>> bytes and strings, but Py3 forces us to make a choice (which is a  
>> good
>> thing).
>
> AIUI, this or something like it must be done soon, as the email  
> package is
> broken on 3.x now.

Indeed.

>> * Clean up the API.  Where possible, simple attribute access should  
>> be
>> the norm.  Let's get rid of dumb API decisions (like str(msg)
>> including the Unix-From).  Let's fix the whole
>> get_payload(decode=True) debacle.  Let's fix stuff like needing to
>> specify unicode encodings twice in the same call.  Etc.
>
> Sounds good.  I'd like __setitem__ (msg[hdr] = foo) to act more like a
> mapping, and not just append new header fields,  
> with .replace_header() and
> .add_header() folded together as .set_header().

Is there a reason for this?  This is one part of the API that I've  
found where practicality beats purity.

>> * Add an external storage API so that messages with huge binary
>> payloads don't need to be fully stored in memory.
>>
>> * Let's target Python 3.1 (coming very soon) if possible, or Python
>> 3.2 if not.  We should back port email 6.0 to Python 2.x, though  
>> we'll
>> have to decide how far back we should go (my suggestion: no earlier
>> than Python 2.5).
>
> Python 3.1 should have a working email package, and a simple way for  
> users
> needing more to get a better replacement (which they'd install as a
> site-package).  I think that a sane split between bytes and string (or
> string and Unicode on 2.x) is most needed.

Unfortunately, it's a /very/ tricky problem.  This pervades every  
aspect of the package.  I'm slowly byte-ifying the internals as I  
refactor the tests.  That's the first step IMO, but it doesn't make  
for a very convenient API.

>> * Fix the myriad of bugs in the tracker!
>
> Sure, I'm game!  We 2.x users would benefit.  Again, a place for  
> users to
> get an "official" current package is needed, as 2.7 is a ways off.

We will definitely make standalone packages available on the  
Cheeseshop for Python 2.x and 3.x.  The question of what goes into 3.1  
is still up in the air I think.

Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Darwin)

iQCVAwUBSdjqsHEjvBPtnXfVAQJZSwP/fABeQG7Q1c4LOZhwCZBcb41Gh4ybZVoK
tZFM2Q1UTdq0bvaEG5xKMkGPHd1S/+AovrwtC4qTIL531p/RJZp3KaDvucGLfWJ3
w61Mk75Zj6yTEbg2GtJwKiY1Zj7oYZgod0NEQ6vgaBAchLAWrnwsE52ap3w+9K7M
wzmppfl/r/I=
=sxwD
-----END PGP SIGNATURE-----

From tonynelson at georgeanelson.com  Sun Apr  5 21:04:54 2009
From: tonynelson at georgeanelson.com (Tony Nelson)
Date: Sun, 5 Apr 2009 15:04:54 -0400
Subject: [Email-SIG] Plans for email 6.0
In-Reply-To: <88F420F3-4CA1-4BE6-B696-3CDB066B314B@python.org>
References: <0DC79F9C-450F-484F-BBB0-28B69EB879F9@python.org>
	<p04330100c5fa73bc5e05@[192.168.123.162]>
	<88F420F3-4CA1-4BE6-B696-3CDB066B314B@python.org>
Message-ID: <p0433010dc5feaa29e031@[192.168.123.162]>

Traffic!

At 13:30 -0400 04/05/2009, Barry Warsaw wrote:
>-----BEGIN PGP SIGNED MESSAGE-----
>Hash: SHA1
>
>On Apr 2, 2009, at 10:16 AM, Tony Nelson wrote:

>>>* Clean up the API. Where possible, simple attribute access should be
>>>the norm. Let's get rid of dumb API decisions (like str(msg) including
>>>the Unix-From). Let's fix the whole get_payload(decode=True) debacle.
>>>Let's fix stuff like needing to specify unicode encodings twice in the
>>>same call. Etc.
>>
>>Sounds good. I'd like __setitem__ (msg[hdr] = foo) to act more like a
>>mapping, and not just append new header fields, with .replace_header()
>>and .add_header() folded together as .set_header().
>
>Is there a reason for this?  This is one part of the API that I've
>found where practicality beats purity.

What part of saying:

    msg["Subject"] = "new subject line"

and getting a second Subject: header field is practical?  For those times
when you really want more then one instance of a header field:

    msg.append_header("Subject", "new subject line")

In general, users of the email package must currently be familiar with all
the mail RFCs in order to properly use the package to create or manipulate
any but the simplest messages, and having "[]" mean "append" isn't helping.
Your suggestion that header fields should always be represented as Header
objects is urgently needed.  Those Header objects will need to be smart
about the header field they represent, and apply all the various encodings
etc. as necessary.


 ...
>>>* Let's target Python 3.1 (coming very soon) if possible, or Python 3.2
>>>if not. We should back port email 6.0 to Python 2.x, though we'll have
>>>to decide how far back we should go (my suggestion: no earlier than
>>>Python 2.5).
>>
>>Python 3.1 should have a working email package, and a simple way for
>>users needing more to get a better replacement (which they'd install as a
>>site-package). I think that a sane split between bytes and string (or
>>string and Unicode on 2.x) is most needed.
>
>Unfortunately, it's a /very/ tricky problem.

I assume you mean "working email package", not "a simple way for users ...
to get a better replacement".

>This pervades every
>aspect of the package.  I'm slowly byte-ifying the internals as I
>refactor the tests.  That's the first step IMO, but it doesn't make
>for a very convenient API.

So it goes.  It may make more sense as you get farther along.  What parts
of that work can you farm out?  Do you need a RFC-compliant header parser?
I could write one in a few days, I think.


>>> * Fix the myriad of bugs in the tracker!
>>
>>Sure, I'm game! We 2.x users would benefit. Again, a place for users to
>>get an "official" current package is needed, as 2.7 is a ways off.
>
>We will definitely make standalone packages available on the
>Cheeseshop for Python 2.x and 3.x.  The question of what goes into 3.1
>is still up in the air I think.

Well, I think that the bugs I've worked on so far should go into 2.6, 2.7,
and 3.1 (unless 3.1 makes a lot of progress and renders some of the bugs
obsolete).

    [issue5610] email feedparser.py CRLFLF bug: $ vs \Z
    [issue5638] test_httpservers fails CGI tests if --enable-shared
    [issue1555570] email parser incorrectly breaks headers with a CRLF
        at 8192
    [issue3169] email/header.py doesn't handle Base64 headers that have
        been insufficiently padded.
    [issue4487] Add utf8 alias for email charsets
    [issue1079] decode_header does not follow RFC 2047

(There's some argument on the last one, where R. David Murray doesn't want
any header that might not conform to the RFCs to be decoded, and I want any
header that might corform to be decoded -- I cite Postel's law in another
issue, and I think it applies here as well.  A full header parser and
Header implementation would solve the problem properly, but only for Python
3.2 or later.)
-- 
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>

From tonynelson at georgeanelson.com  Sun Apr  5 22:01:26 2009
From: tonynelson at georgeanelson.com (Tony Nelson)
Date: Sun, 5 Apr 2009 16:01:26 -0400
Subject: [Email-SIG] Email 6.0
In-Reply-To: <74AB269B-B7E8-4706-B066-E2AA662EF3DB@python.org>
References: <74AB269B-B7E8-4706-B066-E2AA662EF3DB@python.org>
Message-ID: <p0433010cc5feaa1adc96@[192.168.123.162]>

At 13:26 -0400 04/05/2009, Barry Warsaw wrote:
>-----BEGIN PGP SIGNED MESSAGE-----
>Hash: SHA1
>
>I've started a branch for the email package version 6.0.0.  Given that  
>we have until May 2nd to solidify this thing for Python 3.1, I  
>honestly don't think we'll make it.  I would rather concentrate on  
>getting this right, and usable as a standalone package, then work  
>toward getting the new version into Python 3.2 and backported to 2.7.

Lets also fix some existing bugs, for 2.6, 2.7, and possibly 3.1 if
it can get healthy enough to use.


>I'm working on a branch in Bazaar, at lp:~barry/python/email6
>
>% bzr branch lp:~barry/python/email6

I'm not able to check out that branch:

    bzr ERROR: Not a branch: "bzr+ssh"//bazaar.lanchpad.net//~barry/python/email6/".

Probably it is because it has not been pushed.


>This is a branch of the Py3k trunk.  I'm starting by refactoring the  
>huge test_email.py file into smaller separate tests, then fixing thing  
>as I go.  After the tests are working I plan on starting to fix the  
>API and other problems we've talked about.  For now, let's coordinate  
>on this branch.  IOW, if you'd like to contribute (and I hope you do!)  
>please branch the above and let us know about it here.  I'll keep the  
>above branch as (for now) the master copy.

As I'm new to bzr and launchpad, I'm not sure what all that means. 
Does it mean that I should create a branch at my own launchpad account,
based on a checkout of lp:~barry/python/email6?
-- 
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>

From stephen at xemacs.org  Tue Apr  7 07:22:19 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Tue, 07 Apr 2009 14:22:19 +0900
Subject: [Email-SIG] Plans for email 6.0
In-Reply-To: <p0433010dc5feaa29e031@[192.168.123.162]>
References: <0DC79F9C-450F-484F-BBB0-28B69EB879F9@python.org>
	<p04330100c5fa73bc5e05@[192.168.123.162]>
	<88F420F3-4CA1-4BE6-B696-3CDB066B314B@python.org>
	<p0433010dc5feaa29e031@[192.168.123.162]>
Message-ID: <8763hht4vo.fsf@xemacs.org>

Tony Nelson writes:

 > In general, users of the email package must currently be familiar with all
 > the mail RFCs in order to properly use the package to create or manipulate
 > any but the simplest messages,

IMHO, that's a problem with the mail RFCs, not with the email
package.  Internet messaging is inherently complex because of the
backward and Microsoft compatibility requirements.

 > and having "[]" mean "append" isn't helping.

That's probably true, but that's because in Python mapping semantics
are invariably replace rather than append in this circumstance.  It
has nothing to do with the RFCs per se.

 > Your suggestion that header fields should always be represented as
 > Header objects is urgently needed.  Those Header objects will need
 > to be smart about the header field they represent, and apply all
 > the various encodings etc. as necessary.

That's not a good idea.  Header methods should be strict about what
encodings are allowed, but all too often the decisions between
quoted-printable and base64 transfer encodings, and among various
possible text encodings (Japanese alone has 4 majors ones in *daily*
use, with different ones typically used in the header and body! and
Chinese isn't much better) are dependent on content or receiver and/or
sender.

It's reasonable for email to have "recommendations", perhaps
implemented as defaults, for each situation, but programmers should be
reminded that that the text they provide to the Header class etc is
being munged as it gets inserted into the message.  For simple
situations, of course it makes sense to provide a high-level
interface, such as a string:contents dictionary for headers.

headers = { "From" : [("Stephen J. Turnbull", "stephen at xemacs.org")],
            "To" : [("Email SIG", "email-sig at python.org"),
                    ("da FLUFL", "barry at python.org")],
            "Subject" : "Don't DO that!"
            "Summary" : "This could go on forever but doesn't." }

body = """I just wanted you to know that I
don't think it's a good idea.

Just-yer-neighborhood-busybody-ly y'rs
"""

ready_for_sendmail = email.format_simple_message (headers, body)

And that would be encoded in some lowest-common-denominator charset
like ASCII, ISO-8859-15, ISO-8859-1, or UTF-8 with the earliest
feasible one used, and some heuristic like minimum encoded size or
fraction of non-ASCII used to determine content-transfer-encoding.

But it should be implemented by .format_simple_message, not Header,
IMHO.


From v+python at g.nevcal.com  Tue Apr  7 07:44:22 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Mon, 06 Apr 2009 22:44:22 -0700
Subject: [Email-SIG] Plans for email 6.0
In-Reply-To: <8763hht4vo.fsf@xemacs.org>
References: <0DC79F9C-450F-484F-BBB0-28B69EB879F9@python.org>	<p04330100c5fa73bc5e05@[192.168.123.162]>	<88F420F3-4CA1-4BE6-B696-3CDB066B314B@python.org>	<p0433010dc5feaa29e031@[192.168.123.162]>
	<8763hht4vo.fsf@xemacs.org>
Message-ID: <49DAE836.3030107@g.nevcal.com>

On approximately 4/6/2009 10:22 PM, came the following characters from 
the keyboard of Stephen J. Turnbull:
> IMHO, that's a problem with the mail RFCs, not with the email
> package.  Internet messaging is inherently complex because of the
> backward and Microsoft compatibility requirements.

I agree that Internet messaging, particularly some of the character 
encodings, in inherently complex due to backward compatibility requirements.

I'm not surprised that you mention Microsoft issues, as I've found quite 
a few cases of messages from Microsoft email clients that do not conform 
to the RFCs.  Apple Mail violates a number of them, also, especially 
with MIME constructions.  But I've never attempted to track the 
Microsoft violations of the RFCs... do you have or know of a list of such?

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From stephen at xemacs.org  Tue Apr  7 13:42:44 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Tue, 07 Apr 2009 20:42:44 +0900
Subject: [Email-SIG] Plans for email 6.0
In-Reply-To: <49DAE836.3030107@g.nevcal.com>
References: <0DC79F9C-450F-484F-BBB0-28B69EB879F9@python.org>
	<p04330100c5fa73bc5e05@[192.168.123.162]>
	<88F420F3-4CA1-4BE6-B696-3CDB066B314B@python.org>
	<p0433010dc5feaa29e031@[192.168.123.162]>
	<8763hht4vo.fsf@xemacs.org> <49DAE836.3030107@g.nevcal.com>
Message-ID: <87vdpgsn9n.fsf@xemacs.org>

Glenn Linderman writes:

 > I'm not surprised that you mention Microsoft issues, as I've found
 > quite a few cases of messages from Microsoft email clients that do
 > not conform to the RFCs.  Apple Mail violates a number of them,
 > also, especially with MIME constructions.  But I've never attempted
 > to track the Microsoft violations of the RFCs... do you have or
 > know of a list of such?

No, I don't.  For me it's not been worth keeping one, but if email is
going to be the world-beating email library, it might be worth keeping
one.  I mean, just how many people would fall in love with Mailman if
there were a "select your broken MUA here" in the personal user's
page, and selecting actually got you a personalized message that
didn't display Sender in the From field in Outlook Express? :-)


From tonynelson at georgeanelson.com  Thu Apr  9 17:05:38 2009
From: tonynelson at georgeanelson.com (Tony Nelson)
Date: Thu, 9 Apr 2009 11:05:38 -0400
Subject: [Email-SIG] email package Bytes vs Unicode (was Re: [Python-Dev]
 Dropping bytes "support" in json)
In-Reply-To: <grkodk$j4p$1@ger.gmane.org>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<grkodk$j4p$1@ger.gmane.org>
Message-ID: <p04330100c603badeb135@[192.168.123.162]>

(email-sig added)

At 08:07 -0400 04/09/2009, Steve Holden wrote:
>Barry Warsaw wrote:
 ...
>> This is an interesting question, and something I'm struggling with for
>> the email package for 3.x.  It turns out to be pretty convenient to have
>> both a bytes and a string API, both for input and output, but I think
>> email really wants to be represented internally as bytes.  Maybe.  Or
>> maybe just for content bodies and not headers, or maybe both.  Anyway,
>> aside from that decision, I haven't come up with an elegant way to allow
>> /output/ in both bytes and strings (input is I think theoretically
>> easier by sniffing the arguments).
>>
>The real problem I came across in storing email in a relational database
>was the inability to store messages as Unicode. Some messages have a
>body in one encoding and an attachment in another, so the only ways to
>store the messages are either as a monolithic bytes string that gets
>parsed when the individual components are required or as a sequence of
>components in the database's preferred encoding (if you want to keep the
>original encoding most relational databases won't be able to help unless
>you store the components as bytes).
 ...

I found it confusing myself, and did it wrong for a while.  Now, I
understand that essages come over the wire as bytes, either 7-bit US-ASCII
or 8-bit whatever, and are parsed at the receiver.  I think of the database
as a wire to the future, and store the data as bytes (a BLOB), letting the
future receiver parse them as it did the first time, when I cleaned the
message.  Data I care to query is extracted into fields (in UTF-8, what I
usually use for char fields).  I have no need to store messages as Unicode,
and they aren't Unicode anyway.  I have no need ever to flatten a message
to Unicode, only to US-ASCII or, for messages (spam) that are corrupt, raw
8-bit data.

If you need the data from the message, by all means extract it and store it
in whatever form is useful to the purpose of the database.  If you need the
entire message, store it intact in the database, as the bytes it is.  Email
isn't Unicode any more than a JPEG or other image types (often payloads in
a message) are Unicode.
-- 
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>

From barry at python.org  Fri Apr 10 04:26:22 2009
From: barry at python.org (Barry Warsaw)
Date: Thu, 9 Apr 2009 22:26:22 -0400
Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json
In-Reply-To: <grkodk$j4p$1@ger.gmane.org>
References: <loom.20090408T110540-221@post.gmane.org>	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<grkodk$j4p$1@ger.gmane.org>
Message-ID: <1F3DC671-746B-425C-A847-4F6CB0DB9FD0@python.org>

On Apr 9, 2009, at 8:07 AM, Steve Holden wrote:

> The real problem I came across in storing email in a relational  
> database
> was the inability to store messages as Unicode. Some messages have a
> body in one encoding and an attachment in another, so the only ways to
> store the messages are either as a monolithic bytes string that gets
> parsed when the individual components are required or as a sequence of
> components in the database's preferred encoding (if you want to keep  
> the
> original encoding most relational databases won't be able to help  
> unless
> you store the components as bytes).
>
> All in all, as you might expect from a system that's been growing up
> since 1970 or so, it can be quite intractable.

There are really two ways to look at an email message.  It's either an  
unstructured blob of bytes, or it's a structured tree of objects.   
Those objects have headers and payload.  The payload can be of any  
type, though I think it generally breaks down into "strings" for text/ 
* types and bytes for anything else (not counting multiparts).

The email package isn't a perfect mapping to this, which is something  
I want to improve.  That aside, I think storing a message in a  
database means storing some or all of the headers separately from the  
byte stream (or text?) of its payload.  That's for non-multipart  
types.  It would be more complicated to represent a message tree of  
course.

It does seem to make sense to think about headers as text header names  
and text header values.  Of course, header values can contain almost  
anything and there's an encoding to bring it back to 7-bit ASCII, but  
again, you really have two views of a header value.  Which you want  
really depends on your application.

Maybe you just care about the text of both the header name and value.   
In that case, I think you want the values as unicodes, and probably  
the headers as unicodes containing only ASCII.  So your table would be  
strings in both cases.  OTOH, maybe your application cares about the  
raw underlying encoded data, in which case the header names are  
probably still strings of ASCII-ish unicodes and the values are  
bytes.  It's this distinction (and I think the competing use cases)  
that make a true Python 3.x API for email more complicated.

Thinking about this stuff makes me nostalgic for the sloppy happy days  
of Python 2.x

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 304 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20090409/cdf11303/attachment.pgp>

From barry at python.org  Fri Apr 10 04:38:11 2009
From: barry at python.org (Barry Warsaw)
Date: Thu, 9 Apr 2009 22:38:11 -0400
Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json
In-Reply-To: <eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>
Message-ID: <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>

On Apr 9, 2009, at 11:55 AM, Daniel Stutzbach wrote:

> On Thu, Apr 9, 2009 at 6:01 AM, Barry Warsaw <barry at python.org> wrote:
> Anyway, aside from that decision, I haven't come up with an elegant  
> way to allow /output/ in both bytes and strings (input is I think  
> theoretically easier by sniffing the arguments).
>
> Won't this work? (assuming dumps() always returns a string)
>
> def dumpb(obj, encoding='utf-8', *args, **kw):
>     s = dumps(obj, *args, **kw)
>     return s.encode(encoding)

So, what I'm really asking is this.  Let's say you agree that there  
are use cases for accessing a header value as either the raw encoded  
bytes or the decoded unicode.  What should this return:

 >>> message['Subject']

The raw bytes or the decoded unicode?

Okay, so you've picked one.  Now how do you spell the other way?

The Message class probably has these explicit methods:

 >>> Message.get_header_bytes('Subject')
 >>> Message.get_header_string('Subject')

(or better names... it's late and I'm tired ;).  One of those maps to  
message['Subject'] but which is the more obvious choice?

Now, setting headers.  Sometimes you have some unicode thing and  
sometimes you have some bytes.  You need to end up with bytes in the  
ASCII range and you'd like to leave the header value unencoded if so.   
But in both cases, you might have bytes or characters outside that  
range, so you need an explicit encoding, defaulting to utf-8 probably.

 >>> Message.set_header('Subject', 'Some text', encoding='utf-8')
 >>> Message.set_header('Subject', b'Some bytes')

One of those maps to

 >>> message['Subject'] = ???

I'm open to any suggestions here!
-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 304 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20090409/952033a1/attachment.pgp>

From barry at python.org  Fri Apr 10 04:40:30 2009
From: barry at python.org (Barry Warsaw)
Date: Thu, 9 Apr 2009 22:40:30 -0400
Subject: [Email-SIG] [Python-Dev] email package Bytes vs Unicode (was
	Re: Dropping bytes "support" in json)
In-Reply-To: <grl78j$7sl$1@ger.gmane.org>
References: <loom.20090408T110540-221@post.gmane.org>	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>	<loom.20090409T043042-835@post.gmane.org>	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>	<grkodk$j4p$1@ger.gmane.org>
	<p04330100c603badeb135@[192.168.123.162]>
	<grl78j$7sl$1@ger.gmane.org>
Message-ID: <657BFEEA-04E3-418F-86C0-D2F80C75DB96@python.org>

On Apr 9, 2009, at 12:20 PM, Steve Holden wrote:

> PostgreSQL strongly encourages you to store text as encoded columns.
> Because emails lack an encoding it turns out this is a most  
> inconvenient
> storage type for it. Sadly BLOBs are such a pain in PostgreSQL that  
> it's
> easier to store the messages in external files and just use the
> relational database to index those files to retrieve content, so  
> that's
> what I ended up doing.

That's not insane for other reasons.  Do you really want to store 10MB  
of mp3 data in your database?

Which of course reminds me that I want to add an interface, probably  
to the parser and message class, to allow an application to store  
message payloads in other than memory.  Parsing and holding onto  
messages with huge payloads can kill some applications, when you might  
not care too much about the actual payload content.

Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 304 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20090409/c79c2fa9/attachment.pgp>

From barry at python.org  Fri Apr 10 05:03:35 2009
From: barry at python.org (Barry Warsaw)
Date: Thu, 9 Apr 2009 23:03:35 -0400
Subject: [Email-SIG] [Python-Dev] the email module, text,
	and bytes (was Re:  Dropping bytes "support" in json)
In-Reply-To: <20090410031151.12555.724184150.divmod.xquotient.7482@weber.divmod.com>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<grkodk$j4p$1@ger.gmane.org>
	<1F3DC671-746B-425C-A847-4F6CB0DB9FD0@python.org>
	<20090410031151.12555.724184150.divmod.xquotient.7482@weber.divmod.com>
Message-ID: <ACC56383-7F1B-4CB0-908F-E75E1390AE51@python.org>

On Apr 9, 2009, at 11:11 PM, glyph at divmod.com wrote:

> I think this is a problematic way to model bytes vs. text; it gives  
> text a special relationship to bytes which should be avoided.
>
> IMHO the right way to think about domains like this is a multi-level  
> representation.  The "low level" representation is always bytes,  
> whether your MIME type is text/whatever or application/x-i-dont-know.

This is a really good point, and I really should be clearer when  
describing my current thinking (sleep would help :).

> The thing that's "special" about text is that it's a "high level"  
> representation that the standard library can know about.  But the  
> 'email' package ought to support being extended to support other  
> types just as well.  For example, I want to ask for image/png  
> content as PIL.Image objects, not bags of bytes.  Of course this  
> presupposes some way for PIL itself to get at some bytes, but then  
> you need the email module itself to get at the bytes to convert to  
> text in much the same way.  There also needs to be layering at the  
> level of bytes->base64->some different bytes->PIL->Image.  There are  
> mail clients that will base64-encode unusual encodings so you have  
> to do that same layering for text sometimes.
>
> I'm also being somewhat handwavy with talk of "low" and "high" level  
> representations; of course there are actually multiple levels beyond  
> that.  I might want text/x-python content to show up as an AST, but  
> the intermediate DOM-parsing representation really wants to operate  
> on characters.  Similarly for a DOM and text/html content.  (Modulo  
> the usual encoding-detection weirdness present in parsers.)

When I was talking about supporting text/* content types as strings, I  
was definitely thinking about using basically the same plug-in or  
higher level or whatever API to do that as you might use to get PIL  
images from an image/gif.

> So, as long as there's a crisp definition of what layer of the MIME  
> stack one is operating on, I don't think that there's really any  
> ambiguity at all about what type you should be getting.

In that case, we really need the bytes-in-bytes-out-bytes-in-the-chewy- 
center API first, and build things on top of that.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 304 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20090409/25c444cd/attachment-0001.pgp>

From barry at python.org  Fri Apr 10 05:05:37 2009
From: barry at python.org (Barry Warsaw)
Date: Thu, 9 Apr 2009 23:05:37 -0400
Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json
In-Reply-To: <20090410025203.GA199@panix.com>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>
	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>
	<20090410025203.GA199@panix.com>
Message-ID: <663162E3-D2EB-4417-93D0-4764BC94646C@python.org>

On Apr 9, 2009, at 10:52 PM, Aahz wrote:

> On Thu, Apr 09, 2009, Barry Warsaw wrote:
>>
>> So, what I'm really asking is this.  Let's say you agree that there  
>> are
>> use cases for accessing a header value as either the raw encoded  
>> bytes or
>> the decoded unicode.  What should this return:
>>
>>>>> message['Subject']
>>
>> The raw bytes or the decoded unicode?
>
> Let's make that the raw bytes by default -- we can add a parameter to
> Message() to specify that the default where possible is unicode for
> returned values, if that isn't too painful.

I don't know whether the parameter thing will work or not, but you're  
probably right that we need to get the bytes-everywhere API first.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 304 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20090409/76fcfcb9/attachment.pgp>

From barry at python.org  Fri Apr 10 05:23:40 2009
From: barry at python.org (Barry Warsaw)
Date: Thu, 9 Apr 2009 23:23:40 -0400
Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json
In-Reply-To: <49DEBB21.70305@gmail.com>
References: <loom.20090408T110540-221@post.gmane.org>	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>	<loom.20090409T043042-835@post.gmane.org>	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>	<20090410025203.GA199@panix.com>
	<663162E3-D2EB-4417-93D0-4764BC94646C@python.org>
	<49DEBB21.70305@gmail.com>
Message-ID: <0047AD0A-7B5B-4703-96D6-BD26B9752E7D@python.org>

On Apr 9, 2009, at 11:21 PM, Nick Coghlan wrote:

> Barry Warsaw wrote:
>> I don't know whether the parameter thing will work or not, but you're
>> probably right that we need to get the bytes-everywhere API first.
>
> Given that json is a wire protocol, that sounds like the right  
> approach
> for json as well. Once bytes-everywhere works, then a text API can be
> built on top of it, but it is difficult to build a bytes API on top  
> of a
> text one.

Agreed!

> So I guess the IO library *is* the right model: bytes at the bottom of
> the stack, with text as a wrapper around it (mediated by codecs).

Yes, that's a very interesting (and proven?) model.  I don't quite see  
how we could apply that email and json, but it seems like there's a  
good idea there. ;)

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 304 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20090409/c1447da2/attachment.pgp>

From tonynelson at georgeanelson.com  Fri Apr 10 05:41:58 2009
From: tonynelson at georgeanelson.com (Tony Nelson)
Date: Thu, 9 Apr 2009 23:41:58 -0400
Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json
In-Reply-To: <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>
	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>
Message-ID: <p04330101c6046b191e4a@[192.168.123.162]>

At 22:38 -0400 04/09/2009, Barry Warsaw wrote:
 ...
>So, what I'm really asking is this.  Let's say you agree that there
>are use cases for accessing a header value as either the raw encoded
>bytes or the decoded unicode.  What should this return:
>
> >>> message['Subject']
>
>The raw bytes or the decoded unicode?

That's an easy one:  Subject: is an unstructured header, so it must be
text, thus Unicode.  We're looking at a high-level representation of an
email message, with parsed header fields and a MIME message tree.


>Okay, so you've picked one.  Now how do you spell the other way?

message.get_header_bytes('Subject')

Oh, I see that's what you picked.

>The Message class probably has these explicit methods:
>
> >>> Message.get_header_bytes('Subject')
> >>> Message.get_header_string('Subject')
>
>(or better names... it's late and I'm tired ;).  One of those maps to
>message['Subject'] but which is the more obvious choice?

Structured header fields are more of a problem.  Any header with addresses
should return a list of addresses.  I think the default return type should
depend on the data type.  To get an explicit bytes or string or list of
addresses, be explicit; otherwise, for convenience, return the appropriate
type for the particular header field name.


>Now, setting headers.  Sometimes you have some unicode thing and
>sometimes you have some bytes.  You need to end up with bytes in the
>ASCII range and you'd like to leave the header value unencoded if so.
>But in both cases, you might have bytes or characters outside that
>range, so you need an explicit encoding, defaulting to utf-8 probably.

Never for header fields.  The default is always RFC 2047, unless it isn't,
say for params.

The Message class should create an object of the appropriate subclass of
Header based on the name (or use the existing object, see other
discussion), and that should inspect its argument and DTRT or complain.

>
> >>> Message.set_header('Subject', 'Some text', encoding='utf-8')
> >>> Message.set_header('Subject', b'Some bytes')
>
>One of those maps to
>
> >>> message['Subject'] = ???

The expected data type should depend on the header field.  For Subject:, it
should be bytes to be parsed or verbatim text.  For To:, it should be a
list of addresses or bytes or text to be parsed.

The email package should be pythonic, and not require deep understanding of
dozens of RFCs to use properly.  Users don't need to know about the raw
bytes; that's the whole point of MIME and any email package.  It should be
easy to set header fields with their natural data types, and doing it with
bad data should produce an error.  This may require a bit more care in the
message parser, to always produce a parsed message with defects.
-- 
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>

From tonynelson at georgeanelson.com  Fri Apr 10 05:59:54 2009
From: tonynelson at georgeanelson.com (Tony Nelson)
Date: Thu, 9 Apr 2009 23:59:54 -0400
Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json
In-Reply-To: <1F3DC671-746B-425C-A847-4F6CB0DB9FD0@python.org>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<grkodk$j4p$1@ger.gmane.org>
	<1F3DC671-746B-425C-A847-4F6CB0DB9FD0@python.org>
Message-ID: <p04330100c6046a4bedc6@[192.168.123.162]>

At 22:26 -0400 04/09/2009, Barry Warsaw wrote:

>There are really two ways to look at an email message.  It's either an
>unstructured blob of bytes, or it's a structured tree of objects.
>Those objects have headers and payload.  The payload can be of any
>type, though I think it generally breaks down into "strings" for text/
>* types and bytes for anything else (not counting multiparts).
>
>The email package isn't a perfect mapping to this, which is something
>I want to improve.  That aside, I think storing a message in a
>database means storing some or all of the headers separately from the
>byte stream (or text?) of its payload.  That's for non-multipart
>types.  It would be more complicated to represent a message tree of
>course.

Storing an email message in a database does mean storing some of the header
fields as database fields, but the set of email header fields is open, so
any "unused" fields in a message must be stored elsewhere.  It isn't useful
to just have a bag of name/value pairs in a table.  General message MIME
payload trees don't map well to a database either, unless one wants to get
very relational.  Sometimes the database needs to represent the entire
email message, header fields and MIME tree, but only if it is an email
program and usually not even then.  Usually, the database has a specific
purpose, and can be designed for the data it cares about; it may choose to
keep the original message as bytes.


>It does seem to make sense to think about headers as text header names
>and text header values.  Of course, header values can contain almost
>anything and there's an encoding to bring it back to 7-bit ASCII, but
>again, you really have two views of a header value.  Which you want
>really depends on your application.

I think of header fields as having text-like names (the set of allowed
characters is more than just text, though defined headers don't make use of
that), but the data is either bytes or it should be parsed into something
appropriate:  text for unstructured fields like Subject:, a list of
addresses for address fields like To:.  Many of the structured header
fields have a reasonable mapping to text; certainly this is true for adress
header fields.  Content-Type header fields are barely text, they can be so
convolutedly structured, but I suppose one could flatten one of them to
text instead of bytes if the user wanted.  It's not very useful, though,
except for debugging (either by the programmer or the recipient who wants
to know what was cleaned from the message).


>Maybe you just care about the text of both the header name and value.
>In that case, I think you want the values as unicodes, and probably
>the headers as unicodes containing only ASCII.  So your table would be
>strings in both cases.  OTOH, maybe your application cares about the
>raw underlying encoded data, in which case the header names are
>probably still strings of ASCII-ish unicodes and the values are
>bytes.  It's this distinction (and I think the competing use cases)
>that make a true Python 3.x API for email more complicated.

If a database stores the Subject: header field, it would be as text.  The
various recipient address fields are a one message to many names and
addresses mapping, and need a related table of name/address fields, with
each field being text.  The original message (or whatever part of it one
preserves) should be bytes.  I don't think this complicates the email
package API; rather, it just shows where generality is needed.


>Thinking about this stuff makes me nostalgic for the sloppy happy days
>of Python 2.x

You now have the opportunity to finally unsnarl that mess.  It is not an
insurmountable opportunity.
-- 
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>

From turnbull at sk.tsukuba.ac.jp  Fri Apr 10 07:22:04 2009
From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull)
Date: Fri, 10 Apr 2009 14:22:04 +0900
Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json
In-Reply-To: <1F3DC671-746B-425C-A847-4F6CB0DB9FD0@python.org>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<grkodk$j4p$1@ger.gmane.org>
	<1F3DC671-746B-425C-A847-4F6CB0DB9FD0@python.org>
Message-ID: <87zlepf5hf.fsf@xemacs.org>

Barry Warsaw writes:

 > There are really two ways to look at an email message.  It's either an  
 > unstructured blob of bytes, or it's a structured tree of objects.

Indeed!

 > Those objects have headers and payload.  The payload can be of any  
 > type, though I think it generally breaks down into "strings" for text/ 
 > * types and bytes for anything else (not counting multiparts).

*sigh*  Why are you back-tracking?

The payload should be of an appropriate *object* type.  Atomic object
types will have their content stored as string or bytes [nb I use
Python 3 terminology throughout].  Composite types (multipart/*) won't
need string or bytes attributes AFAICS.

Start by implementing the application/octet-stream and
text/plain;charset=utf-8 object types, of course.

 > It does seem to make sense to think about headers as text header names  
 > and text header values.

I disagree.  IMHO, structured header types should have object values,
and something like

message['to'] = "Barry 'da FLUFL' Warsaw <barry at python.org>"

should be smart enough to detect that it's a string and attempt to
(flexibly) parse it into a fullname and a mailbox adding escapes, etc.
Whether these should be structured objects or they can be strings or
bytes, I'm not sure (probably bytes, not strings, though -- see next
exampl).  OTOH

message['to'] = b'''"Barry 'da.FLUFL' Warsaw" <barry at python.org>'''

should assume that the client knows what they are doing, and should
parse it strictly (and I mean "be a real bastard", eg, raise an
exception on any non-ASCII octet), merely dividing it into fullname
and mailbox, and caching the bytes for later insertion in a
wire-format message.

 > In that case, I think you want the values as unicodes, and probably  
 > the headers as unicodes containing only ASCII.  So your table would be  
 > strings in both cases.  OTOH, maybe your application cares about the  
 > raw underlying encoded data, in which case the header names are  
 > probably still strings of ASCII-ish unicodes and the values are  
 > bytes.  It's this distinction (and I think the competing use cases)  
 > that make a true Python 3.x API for email more complicated.

I don't see why you can't have the email API be specific, with
message['to'] always returning a structured_header object (or maybe
even more specifically an address_header object), and methods like

message['to'].build_header_as_text()

which returns

"""To: "Barry 'da.FLUFL' Warsaw" <barry at python.org>"""

and

message['to'].build_header_in_wire_format()

which returns

b"""To: "Barry 'da.FLUFL' Warsaw" <barry at python.org>"""

Then have email.textview.Message and email.wireview.Message which
provide a simple interface where message['to'] would invoke
.build_header_as_text() and .build_header_in_wire_format()
respectively.

 > Thinking about this stuff makes me nostalgic for the sloppy happy days  
 > of Python 2.x

Er, yeah.

Nostalgic-for-the-BITNET-days-where-everything-was-Just-EBCDIC-ly y'rs,

From janssen at parc.com  Fri Apr 10 18:35:44 2009
From: janssen at parc.com (Bill Janssen)
Date: Fri, 10 Apr 2009 09:35:44 PDT
Subject: [Email-SIG] [Python-Dev] the email module, text,
	and bytes (was Re: Dropping bytes "support" in json)
In-Reply-To: <ACC56383-7F1B-4CB0-908F-E75E1390AE51@python.org>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<grkodk$j4p$1@ger.gmane.org>
	<1F3DC671-746B-425C-A847-4F6CB0DB9FD0@python.org>
	<20090410031151.12555.724184150.divmod.xquotient.7482@weber.divmod.com>
	<ACC56383-7F1B-4CB0-908F-E75E1390AE51@python.org>
Message-ID: <92023.1239381344@parc.com>

Barry Warsaw <barry at python.org> wrote:

> In that case, we really need the
> bytes-in-bytes-out-bytes-in-the-chewy-
> center API first, and build things on top of that.

Yep.

Bill

From barry at python.org  Fri Apr 10 18:56:09 2009
From: barry at python.org (Barry Warsaw)
Date: Fri, 10 Apr 2009 12:56:09 -0400
Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json
In-Reply-To: <20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>
	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>
	<20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com>
Message-ID: <F40AE8EC-08CC-4634-AA82-264587552F47@python.org>

On Apr 10, 2009, at 1:19 AM, glyph at divmod.com wrote:

> On 02:38 am, barry at python.org wrote:
>> So, what I'm really asking is this.  Let's say you agree that there  
>> are use cases for accessing a header value as either the raw  
>> encoded bytes or the decoded unicode.  What should this return:
>>
>> >>> message['Subject']
>>
>> The raw bytes or the decoded unicode?
>
> My personal preference would be to just get deprecate this API, and  
> get rid of it, replacing it with a slightly more explicit one.
>
>   message.headers['Subject']
>   message.bytes_headers['Subject']

This is pretty darn clever Glyph.  Stop that! :)

I'm not 100% sure I like the name .bytes_headers or that .headers  
should be the decoded header (rather than have .headers return the  
bytes thingie and say .decoded_headers return the decoded thingies),  
but I do like the general approach.

>> Now, setting headers.  Sometimes you have some unicode thing and  
>> sometimes you have some bytes.  You need to end up with bytes in  
>> the ASCII range and you'd like to leave the header value unencoded  
>> if so. But in both cases, you might have bytes or characters  
>> outside that range, so you need an explicit encoding, defaulting to  
>> utf-8 probably.
>
>   message.headers['Subject'] = 'Some text'
>
> should be equivalent to
>
>   message.headers['Subject'] = Header('Some text')

Yes, absolutely.  I think we're all in general agreement that header  
values should be instances of Header, or subclasses thereof.

> My preference would be that
>
>   message.headers['Subject'] = b'Some Bytes'
>
> would simply raise an exception.  If you've got some bytes, you  
> should instead do
>
>   message.bytes_headers['Subject'] = b'Some Bytes'
>
> or
>
>   message.headers['Subject'] = Header(bytes=b'Some Bytes',  
> encoding='utf-8')
>
> Explicit is better than implicit, right?

Yes.

Again, I really like the general idea, if I might quibble about some  
of the details.  Thanks for a great suggestion.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 304 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20090410/33ffffa6/attachment.pgp>

From barry at python.org  Fri Apr 10 19:08:26 2009
From: barry at python.org (Barry Warsaw)
Date: Fri, 10 Apr 2009 13:08:26 -0400
Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json
In-Reply-To: <p04330101c6046b191e4a@[192.168.123.162]>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>
	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>
	<p04330101c6046b191e4a@[192.168.123.162]>
Message-ID: <595A42B2-0D3B-4886-960B-F16D50D0CC5A@python.org>

On Apr 9, 2009, at 11:41 PM, Tony Nelson wrote:

> At 22:38 -0400 04/09/2009, Barry Warsaw wrote:
> ...
>> So, what I'm really asking is this.  Let's say you agree that there
>> are use cases for accessing a header value as either the raw encoded
>> bytes or the decoded unicode.  What should this return:
>>
>>>>> message['Subject']
>>
>> The raw bytes or the decoded unicode?
>
> That's an easy one:  Subject: is an unstructured header, so it must be
> text, thus Unicode.  We're looking at a high-level representation of  
> an
> email message, with parsed header fields and a MIME message tree.

I'm liking Glyph's suggestion here.  We'll probably have to support  
the message['Subject'] API for backward compatibility, but in that  
case it really should be a bytes API.

>> (or better names... it's late and I'm tired ;).  One of those maps to
>> message['Subject'] but which is the more obvious choice?
>
> Structured header fields are more of a problem.  Any header with  
> addresses
> should return a list of addresses.  I think the default return type  
> should
> depend on the data type.  To get an explicit bytes or string or list  
> of
> addresses, be explicit; otherwise, for convenience, return the  
> appropriate
> type for the particular header field name.

Yes, structured headers are trickier.  In a separate message, James  
Knight makes some excellent points, which I agree with.  However the  
email package obviously cannot support every time of structured header  
possible.  It must support this through extensibility.

The obvious way is through inheritance (i.e. subclasses of Header),  
but in my experience, using inheritance of the Message class really  
doesn't work very well.  You need to pass around factories to parsing  
functions and your application tends to have its own hierarchy of  
subclasses for whatever extra things it needs.  ISTM that subclassing  
is simply not the right pattern to support extensibility in the  
Message objects or Header objects.  Yes, this leads me to think that  
all the MIME* subclasses are essentially /wrong/.

Having said all that, the email package must support structured  
headers.  Look at the insanity which is the current folding whitespace  
splitting and the impossibility of the current code to do the right  
thing for say Subject headers and Received headers, and you begin to  
see why it must be possible to extend this stuff.

>> Now, setting headers.  Sometimes you have some unicode thing and
>> sometimes you have some bytes.  You need to end up with bytes in the
>> ASCII range and you'd like to leave the header value unencoded if so.
>> But in both cases, you might have bytes or characters outside that
>> range, so you need an explicit encoding, defaulting to utf-8  
>> probably.
>
> Never for header fields.  The default is always RFC 2047, unless it  
> isn't,
> say for params.
>
> The Message class should create an object of the appropriate  
> subclass of
> Header based on the name (or use the existing object, see other
> discussion), and that should inspect its argument and DTRT or  
> complain.

>>>>> Message.set_header('Subject', 'Some text', encoding='utf-8')
>>>>> Message.set_header('Subject', b'Some bytes')
>>
>> One of those maps to
>>
>>>>> message['Subject'] = ???
>
> The expected data type should depend on the header field.  For  
> Subject:, it
> should be bytes to be parsed or verbatim text.  For To:, it should  
> be a
> list of addresses or bytes or text to be parsed.

At a higher level, yes.  At the low level, it has to be bytes.

> The email package should be pythonic, and not require deep  
> understanding of
> dozens of RFCs to use properly.  Users don't need to know about the  
> raw
> bytes; that's the whole point of MIME and any email package.  It  
> should be
> easy to set header fields with their natural data types, and doing  
> it with
> bad data should produce an error.  This may require a bit more care  
> in the
> message parser, to always produce a parsed message with defects.

I agree that we should have some higher level APIs that make it easy  
to compose email messages, and probably easy-ish to parse a byte  
stream into an email message tree.  But we can't build those without  
the lower level raw support.  I'm also convinced that this lower level  
will be the domain of those crazy enough to have the RFCs tattooed to  
the back of their eyelids.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 304 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20090410/8f9e960f/attachment.pgp>

From barry at python.org  Fri Apr 10 19:12:48 2009
From: barry at python.org (Barry Warsaw)
Date: Fri, 10 Apr 2009 13:12:48 -0400
Subject: [Email-SIG] [Python-Dev]   Dropping bytes "support" in json
In-Reply-To: <p04330100c6046a4bedc6@[192.168.123.162]>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<grkodk$j4p$1@ger.gmane.org>
	<1F3DC671-746B-425C-A847-4F6CB0DB9FD0@python.org>
	<p04330100c6046a4bedc6@[192.168.123.162]>
Message-ID: <50EC006F-CF96-45F4-AD71-73B9DE7E510E@python.org>

On Apr 9, 2009, at 11:59 PM, Tony Nelson wrote:

>> Thinking about this stuff makes me nostalgic for the sloppy happy  
>> days
>> of Python 2.x
>
> You now have the opportunity to finally unsnarl that mess.  It is  
> not an
> insurmountable opportunity.

No, it's just a full time job <wink>.  Now where did I put that hack- 
drink-coffee-twitter clone?

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 304 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20090410/15108bc1/attachment.pgp>

From barry at python.org  Fri Apr 10 19:21:45 2009
From: barry at python.org (Barry Warsaw)
Date: Fri, 10 Apr 2009 13:21:45 -0400
Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json
In-Reply-To: <87zlepf5hf.fsf@xemacs.org>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<grkodk$j4p$1@ger.gmane.org>
	<1F3DC671-746B-425C-A847-4F6CB0DB9FD0@python.org>
	<87zlepf5hf.fsf@xemacs.org>
Message-ID: <67879F1D-B386-4B9B-8203-86DB977BD7FF@python.org>

On Apr 10, 2009, at 1:22 AM, Stephen J. Turnbull wrote:

>> Those objects have headers and payload.  The payload can be of any
>> type, though I think it generally breaks down into "strings" for  
>> text/
>> * types and bytes for anything else (not counting multiparts).
>
> *sigh*  Why are you back-tracking?

I'm not.  Sleep deprivation on makes it seem like that.

> The payload should be of an appropriate *object* type.  Atomic object
> types will have their content stored as string or bytes [nb I use
> Python 3 terminology throughout].  Composite types (multipart/*) won't
> need string or bytes attributes AFAICS.

Yes, agreed.

> Start by implementing the application/octet-stream and
> text/plain;charset=utf-8 object types, of course.

Yes.  See my lament about using inheritance for this.

>> It does seem to make sense to think about headers as text header  
>> names
>> and text header values.
>
> I disagree.  IMHO, structured header types should have object values,
> and something like

While I agree, there's still a need for a higher level API that make  
it easy to do the simple things.

> message['to'] = "Barry 'da FLUFL' Warsaw <barry at python.org>"
>
> should be smart enough to detect that it's a string and attempt to
> (flexibly) parse it into a fullname and a mailbox adding escapes, etc.
> Whether these should be structured objects or they can be strings or
> bytes, I'm not sure (probably bytes, not strings, though -- see next
> exampl).  OTOH
>
> message['to'] = b'''"Barry 'da.FLUFL' Warsaw" <barry at python.org>'''
>
> should assume that the client knows what they are doing, and should
> parse it strictly (and I mean "be a real bastard", eg, raise an
> exception on any non-ASCII octet), merely dividing it into fullname
> and mailbox, and caching the bytes for later insertion in a
> wire-format message.

I agree that the Message class needs to be strict.  A parser needs to  
be lenient; see the .defects attribute introduced in the current email  
package.  Oh, and this reminds me that we still haven't talked about  
idempotency.  That's an important principle in the current email  
package, but do we need to give up on that?

>> In that case, I think you want the values as unicodes, and probably
>> the headers as unicodes containing only ASCII.  So your table would  
>> be
>> strings in both cases.  OTOH, maybe your application cares about the
>> raw underlying encoded data, in which case the header names are
>> probably still strings of ASCII-ish unicodes and the values are
>> bytes.  It's this distinction (and I think the competing use cases)
>> that make a true Python 3.x API for email more complicated.
>
> I don't see why you can't have the email API be specific, with
> message['to'] always returning a structured_header object (or maybe
> even more specifically an address_header object), and methods like
>
> message['to'].build_header_as_text()
>
> which returns
>
> """To: "Barry 'da.FLUFL' Warsaw" <barry at python.org>"""
>
> and
>
> message['to'].build_header_in_wire_format()
>
> which returns
>
> b"""To: "Barry 'da.FLUFL' Warsaw" <barry at python.org>"""
>
> Then have email.textview.Message and email.wireview.Message which
> provide a simple interface where message['to'] would invoke
> .build_header_as_text() and .build_header_in_wire_format()
> respectively.

This seems similar to Glyph's basic idea, but with a different spelling.

>> Thinking about this stuff makes me nostalgic for the sloppy happy  
>> days
>> of Python 2.x
>
> Er, yeah.
>
> Nostalgic-for-the-BITNET-days-where-everything-was-Just-EBCDIC-ly  
> y'rs,

Can I have my uucp address back now?
-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 304 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20090410/64999950/attachment-0001.pgp>

From v+python at g.nevcal.com  Fri Apr 10 20:00:54 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Fri, 10 Apr 2009 11:00:54 -0700
Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json
In-Reply-To: <F40AE8EC-08CC-4634-AA82-264587552F47@python.org>
References: <loom.20090408T110540-221@post.gmane.org>	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>	<loom.20090409T043042-835@post.gmane.org>	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>	<20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com>
	<F40AE8EC-08CC-4634-AA82-264587552F47@python.org>
Message-ID: <49DF8956.5050501@g.nevcal.com>

On approximately 4/10/2009 9:56 AM, came the following characters from 
the keyboard of Barry Warsaw:
> On Apr 10, 2009, at 1:19 AM, glyph at divmod.com wrote:
>> On 02:38 am, barry at python.org wrote:
>>> So, what I'm really asking is this.  Let's say you agree that there 
>>> are use cases for accessing a header value as either the raw encoded 
>>> bytes or the decoded unicode.  What should this return:
>>>
>>> >>> message['Subject']
>>>
>>> The raw bytes or the decoded unicode?
>>
>> My personal preference would be to just get deprecate this API, and 
>> get rid of it, replacing it with a slightly more explicit one.
>>
>>   message.headers['Subject']
>>   message.bytes_headers['Subject']
>
> This is pretty darn clever Glyph.  Stop that! :)
>
> I'm not 100% sure I like the name .bytes_headers or that .headers 
> should be the decoded header (rather than have .headers return the 
> bytes thingie and say .decoded_headers return the decoded thingies), 
> but I do like the general approach.

If one name has to be longer than the other, it should be the bytes 
version.  Real user code is more likely to want to use the text version, 
and hopefully there will be more of that type of code than 
implementations using bytes.

Of course, one could use message.header and message.bythdr and they'd be 
the same length.


-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From barry at python.org  Fri Apr 10 20:55:23 2009
From: barry at python.org (Barry Warsaw)
Date: Fri, 10 Apr 2009 14:55:23 -0400
Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json
In-Reply-To: <49DF8956.5050501@g.nevcal.com>
References: <loom.20090408T110540-221@post.gmane.org>	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>	<loom.20090409T043042-835@post.gmane.org>	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>	<20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com>
	<F40AE8EC-08CC-4634-AA82-264587552F47@python.org>
	<49DF8956.5050501@g.nevcal.com>
Message-ID: <71E1EA03-6E24-4A28-A47A-4EA2D501CC6D@python.org>

On Apr 10, 2009, at 2:00 PM, Glenn Linderman wrote:

> If one name has to be longer than the other, it should be the bytes  
> version.  Real user code is more likely to want to use the text  
> version, and hopefully there will be more of that type of code than  
> implementations using bytes.

I'm not sure we know that yet, actually.  Nothing written for Python 2  
counts, and email is too broken in 3 for any sane person to be writing  
such code for Python 3.

> Of course, one could use message.header and message.bythdr and  
> they'd be the same length.

I was trying to figure out what  a 'thdr' was that we'd want to index  
'by' it. :)

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 304 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20090410/e6382e3d/attachment.pgp>

From barry at python.org  Fri Apr 10 20:55:56 2009
From: barry at python.org (Barry Warsaw)
Date: Fri, 10 Apr 2009 14:55:56 -0400
Subject: [Email-SIG] [Python-Dev]   Dropping bytes "support" in json
In-Reply-To: <49DF8A95.4010700@voidspace.org.uk>
References: <loom.20090408T110540-221@post.gmane.org>	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>	<loom.20090409T043042-835@post.gmane.org>	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>	<20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com>	<F40AE8EC-08CC-4634-AA82-264587552F47@python.org>
	<49DF8956.5050501@g.nevcal.com> <49DF8A95.4010700@voidspace.org.uk>
Message-ID: <BBE1C6FA-6DA7-4E61-ABB7-7276AA998872@python.org>

On Apr 10, 2009, at 2:06 PM, Michael Foord wrote:

> Shouldn't headers always be text?

/me weeps

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 304 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20090410/abed0eb6/attachment.pgp>

From stephen at xemacs.org  Fri Apr 10 21:04:22 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Sat, 11 Apr 2009 04:04:22 +0900
Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json
In-Reply-To: <67879F1D-B386-4B9B-8203-86DB977BD7FF@python.org>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<grkodk$j4p$1@ger.gmane.org>
	<1F3DC671-746B-425C-A847-4F6CB0DB9FD0@python.org>
	<87zlepf5hf.fsf@xemacs.org>
	<67879F1D-B386-4B9B-8203-86DB977BD7FF@python.org>
Message-ID: <87prfkfhzd.fsf@xemacs.org>

Shouldn't this thread move lock stock and .signature to email-sig?

Barry Warsaw writes:

 > >> It does seem to make sense to think about headers as text header
 > >> names and text header values.
 > >
 > > I disagree.  IMHO, structured header types should have object values,
 > > and something like
 > 
 > While I agree, there's still a need for a higher level API that make  
 > it easy to do the simple things.

Sure.  I'm suggesting that the way to determine whether something is
simple or not is by whether it falls out naturally from correct
structure.  Ie, no operations that only a Cirque du Soleil juggler can
perform are allowed.

 > I agree that the Message class needs to be strict.  A parser needs to  
 > be lenient;

Not always.  The Postel Principle only applies to stuph coming in off
the wire.  But we're *also* going to be parsing pseudo-email
components that are being handed to us by applications (eg, the
perennial control-character-in-the-unremovable-address Mailman bug).
Our parser should Just Say No to that crap.

 > see the .defects attribute introduced in the current email  
 > package.  Oh, and this reminds me that we still haven't talked about  
 > idempotency.  That's an important principle in the current email  
 > package, but do we need to give up on that?

"Idempotency"?  I'm not sure what that means in the context of the
email package ... multiplication by zero?<wink>  Do you mean that
.parse().to_wire() should be idempotent?  Yes, I think that's a good
idea, and it shouldn't be too hard to implement by (optionally?)
caching the whole original message or individual components (headers
with all whitespace including folding cached verbatim, etc).  I think
caching has to be done, since stuff like "did the original fold with a
leading tab or a leading space, and at what column" and so on seems
kind of pointless to encode as attributes on Header objects.

[Description of MessageTextView and MessageWireView elided.]

 > This seems similar to Glyph's basic idea, but with a different spelling.

Yes.  I don't much care which way it's done, and Glyph's style of
spelling is more explicit.  But I was thinking in terms of the number
of people who are surely going to sing "Mama don' 'low no Unicodes
roun' here" and squeal "codec WTF?! outta mah face, man!"

From stephen at xemacs.org  Fri Apr 10 21:06:59 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Sat, 11 Apr 2009 04:06:59 +0900
Subject: [Email-SIG] [Python-Dev] the email module, text,
	and bytes (was Re: Dropping bytes "support" in json)
In-Reply-To: <92023.1239381344@parc.com>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<grkodk$j4p$1@ger.gmane.org>
	<1F3DC671-746B-425C-A847-4F6CB0DB9FD0@python.org>
	<20090410031151.12555.724184150.divmod.xquotient.7482@weber.divmod.com>
	<ACC56383-7F1B-4CB0-908F-E75E1390AE51@python.org>
	<92023.1239381344@parc.com>
Message-ID: <87ocv4fhv0.fsf@xemacs.org>

Bill Janssen writes:
 > Barry Warsaw <barry at python.org> wrote:
 > 
 > > In that case, we really need the
 > > bytes-in-bytes-out-bytes-in-the-chewy-
 > > center API first, and build things on top of that.
 > 
 > Yep.

Uh, I hate to rain on a parade, but isn't that how we arrived at the
*current* email package?

From barry at python.org  Fri Apr 10 21:04:01 2009
From: barry at python.org (Barry Warsaw)
Date: Fri, 10 Apr 2009 15:04:01 -0400
Subject: [Email-SIG] [Python-Dev] the email module, text,
	and bytes (was Re: Dropping bytes "support" in json)
In-Reply-To: <87ocv4fhv0.fsf@xemacs.org>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<grkodk$j4p$1@ger.gmane.org>
	<1F3DC671-746B-425C-A847-4F6CB0DB9FD0@python.org>
	<20090410031151.12555.724184150.divmod.xquotient.7482@weber.divmod.com>
	<ACC56383-7F1B-4CB0-908F-E75E1390AE51@python.org>
	<92023.1239381344@parc.com> <87ocv4fhv0.fsf@xemacs.org>
Message-ID: <F87C1713-27A1-4D2B-BA42-1AC70B77073C@python.org>

On Apr 10, 2009, at 3:06 PM, Stephen J. Turnbull wrote:

> Bill Janssen writes:
>> Barry Warsaw <barry at python.org> wrote:
>>
>>> In that case, we really need the
>>> bytes-in-bytes-out-bytes-in-the-chewy-
>>> center API first, and build things on top of that.
>>
>> Yep.
>
> Uh, I hate to rain on a parade, but isn't that how we arrived at the
> *current* email package?

Not really.  We got here because <ahem>we</ahem> were too damn sloppy  
about the distinction.

I'm going to remove python-dev from subsequent follow ups.  Please  
join us at email-sig for further discussion.

Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 304 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20090410/cb8193c0/attachment-0001.pgp>

From mark at msapiro.net  Fri Apr 10 21:34:41 2009
From: mark at msapiro.net (Mark Sapiro)
Date: Fri, 10 Apr 2009 12:34:41 -0700
Subject: [Email-SIG] Dropping bytes "support" in json
In-Reply-To: <87prfkfhzd.fsf@xemacs.org>
Message-ID: <PC187020090410123441067149807088@msapiro>

Stephen J. Turnbull wrote:

>Shouldn't this thread move lock stock and .signature to email-sig?

I'm doing my part :)


>"Idempotency"?  I'm not sure what that means in the context of the
>email package ... multiplication by zero?<wink>  Do you mean that
>.parse().to_wire() should be idempotent?  Yes, I think that's a good
>idea, and it shouldn't be too hard to implement by (optionally?)
>caching the whole original message or individual components (headers
>with all whitespace including folding cached verbatim, etc).  I think
>caching has to be done, since stuff like "did the original fold with a
>leading tab or a leading space, and at what column" and so on seems
>kind of pointless to encode as attributes on Header objects.


My response here is probably OT, but RFC 822 is the only RFC that talks
about folding by *inserting* whitespace. both RFC 2822 and RFC 5322
say folding is done by inserting <CRLF> ahead of *existing* whitespace
and unfolding is done by removing the <CRLF> (only). Thus, the
question of whether folding was with <tab> or <space> should not arise.

Of course, in terms of trying to reconstruct the original on_the_wire
message exactly, the question of where the folding occurred is still
relevant. but if we're doing the right thing, the question of what
character should follow the <CRLF> is not.

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan


From barry at python.org  Fri Apr 10 21:39:37 2009
From: barry at python.org (Barry Warsaw)
Date: Fri, 10 Apr 2009 15:39:37 -0400
Subject: [Email-SIG] Dropping bytes "support" in json
In-Reply-To: <PC187020090410123441067149807088@msapiro>
References: <PC187020090410123441067149807088@msapiro>
Message-ID: <E135AFD0-48D5-4332-925A-230B16C65F06@python.org>

On Apr 10, 2009, at 3:34 PM, Mark Sapiro wrote:

> My response here is probably OT, but RFC 822 is the only RFC that  
> talks
> about folding by *inserting* whitespace. both RFC 2822 and RFC 5322
> say folding is done by inserting <CRLF> ahead of *existing* whitespace
> and unfolding is done by removing the <CRLF> (only). Thus, the
> question of whether folding was with <tab> or <space> should not  
> arise.
>
> Of course, in terms of trying to reconstruct the original on_the_wire
> message exactly, the question of where the folding occurred is still
> relevant. but if we're doing the right thing, the question of what
> character should follow the <CRLF> is not.

+1

I /think/ the email package in Python 3.0 DTRT here, or well, at least  
does better than the one in 2.6.

Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 304 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20090410/b93af918/attachment.pgp>

From barry at python.org  Fri Apr 10 23:57:17 2009
From: barry at python.org (Barry Warsaw)
Date: Fri, 10 Apr 2009 17:57:17 -0400
Subject: [Email-SIG] Append behavior of __setitem__
Message-ID: <426749C6-27FD-4F97-BE00-08076386B2D8@python.org>

So I'm just starting to read RFC 5322 and I'm starting by skimming  
over Appendix A (differences between RFC 5322 and 2822).  I see this:

26.  No multiple occurrences of fields (except resent and received).*

Which i find very interesting, and possibly relevant to the discussion  
about changing the semantics of Message.__setitem__() to not append to  
the list of headers, as well as some of the other semantics of message  
headers (e.g. get_all()).

thinking-out-loud-ly y'rs,
-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 304 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20090410/2058c145/attachment.pgp>

From tonynelson at georgeanelson.com  Sat Apr 11 00:39:02 2009
From: tonynelson at georgeanelson.com (Tony Nelson)
Date: Fri, 10 Apr 2009 18:39:02 -0400
Subject: [Email-SIG] Append behavior of __setitem__
In-Reply-To: <426749C6-27FD-4F97-BE00-08076386B2D8@python.org>
References: <426749C6-27FD-4F97-BE00-08076386B2D8@python.org>
Message-ID: <p04330103c6057aadf5a1@[192.168.123.162]>

At 17:57 -0400 04/10/2009, Barry Warsaw wrote:
 ...
>So I'm just starting to read RFC 5322 and I'm starting by skimming
>over Appendix A (differences between RFC 5322 and 2822).

Oh, bother!

>I see this:
>
>26.  No multiple occurrences of fields (except resent and received).*
>
>Which i find very interesting, and possibly relevant to the discussion
>about changing the semantics of Message.__setitem__() to not append to
>the list of headers, as well as some of the other semantics of message
>headers (e.g. get_all()).

Thank you for mentioning this.  Darn it.
-- 
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>

From tonynelson at georgeanelson.com  Sat Apr 11 00:46:44 2009
From: tonynelson at georgeanelson.com (Tony Nelson)
Date: Fri, 10 Apr 2009 18:46:44 -0400
Subject: [Email-SIG] Append behavior of __setitem__
In-Reply-To: <p04330103c6057aadf5a1@[192.168.123.162]>
References: <426749C6-27FD-4F97-BE00-08076386B2D8@python.org>
	<p04330103c6057aadf5a1@[192.168.123.162]>
Message-ID: <p04330104c6057b642092@[192.168.123.162]>

(Fired too fast.)

At 18:39 -0400 04/10/2009, Tony Nelson wrote:
>At 17:57 -0400 04/10/2009, Barry Warsaw wrote:
> ...
>>So I'm just starting to read RFC 5322 and I'm starting by skimming
>>over Appendix A (differences between RFC 5322 and 2822).
>
>Oh, bother!

Appendix B?

 ...
>Thank you for mentioning this.  Darn it.

I note that there is also RFC 5321, "Simple Mail Transfer Protocol", which
obsoletes RFC 2821 and updates RFC 1123, "Registration of Mail and MIME
Header Fields".
-- 
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>

From barry at python.org  Sat Apr 11 00:55:34 2009
From: barry at python.org (Barry Warsaw)
Date: Fri, 10 Apr 2009 18:55:34 -0400
Subject: [Email-SIG] Append behavior of __setitem__
In-Reply-To: <p04330104c6057b642092@[192.168.123.162]>
References: <426749C6-27FD-4F97-BE00-08076386B2D8@python.org>
	<p04330103c6057aadf5a1@[192.168.123.162]>
	<p04330104c6057b642092@[192.168.123.162]>
Message-ID: <D492DC43-25F9-4E49-A1A2-D3C3C4A11548@python.org>

On Apr 10, 2009, at 6:46 PM, Tony Nelson wrote:

> (Fired too fast.)
>
> At 18:39 -0400 04/10/2009, Tony Nelson wrote:
>> At 17:57 -0400 04/10/2009, Barry Warsaw wrote:
>> ...
>>> So I'm just starting to read RFC 5322 and I'm starting by skimming
>>> over Appendix A (differences between RFC 5322 and 2822).
>>
>> Oh, bother!
>
> Appendix B?

Oops, yep!

> ...
>> Thank you for mentioning this.  Darn it.
>
> I note that there is also RFC 5321, "Simple Mail Transfer Protocol",  
> which
> obsoletes RFC 2821 and updates RFC 1123, "Registration of Mail and  
> MIME

Yeah.  We'll let the smtplib.py people worry about that one <wink>.
-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 304 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20090410/b9f565e0/attachment.pgp>

From stephen at xemacs.org  Sat Apr 11 09:43:56 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Sat, 11 Apr 2009 16:43:56 +0900
Subject: [Email-SIG]  Append behavior of __setitem__
In-Reply-To: <426749C6-27FD-4F97-BE00-08076386B2D8@python.org>
References: <426749C6-27FD-4F97-BE00-08076386B2D8@python.org>
Message-ID: <87ocv3tz2b.fsf@xemacs.org>

Barry Warsaw writes:

 > So I'm just starting to read RFC 5322 and I'm starting by skimming  
 > over Appendix A (differences between RFC 5322 and 2822).

I know Barry's a big supporter of the Postel Principle.  As a
guideline[1], how far back should we be lenient?  RFC 822 (no leading "2"
;-)?

Footnotes: 
[1]  Presumably over time we'll accrete definitely non-conforming
practices that we need to accept and do something sane with (eg, we
can't just raise ArmageddonException because we get a header with
8-bit characters in it).  But I think we also should have a plan for
formerly acceptable syntax that has been restricted in more recent
RFCs, etc.


From tonynelson at georgeanelson.com  Sat Apr 11 23:17:13 2009
From: tonynelson at georgeanelson.com (Tony Nelson)
Date: Sat, 11 Apr 2009 17:17:13 -0400
Subject: [Email-SIG] Append behavior of __setitem__
In-Reply-To: <87ocv3tz2b.fsf@xemacs.org>
References: <426749C6-27FD-4F97-BE00-08076386B2D8@python.org>
	<87ocv3tz2b.fsf@xemacs.org>
Message-ID: <p04330103c6068a93c988@[192.168.123.162]>

At 16:43 +0900 04/11/2009, Stephen J. Turnbull wrote:
>Barry Warsaw writes:
>
> > So I'm just starting to read RFC 5322 and I'm starting by skimming
> > over Appendix A (differences between RFC 5322 and 2822).
>
>I know Barry's a big supporter of the Postel Principle.  As a
>guideline[1], how far back should we be lenient?  RFC 822 (no leading "2"
>;-)?

Sure.  The header field should be parsed, if possible, and possibly add a
defect to the message.  For some header fields, the data should be added to
the previous Header instance; for others, an extra Header instance might
need to be created.

Message /generation/ should comply with what was in RFC 2822, where this
requirement was added, and also the new RFC 5322.


>Footnotes:
>[1]  Presumably over time we'll accrete definitely non-conforming
>practices that we need to accept and do something sane with (eg, we
>can't just raise ArmageddonException because we get a header with
>8-bit characters in it).  But I think we also should have a plan for
>formerly acceptable syntax that has been restricted in more recent
>RFCs, etc.

Any email parser must cope with both obsolete-* syntax and common bad
practices.  Python's already does in various places.
-- 
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>

From barry at python.org  Mon Apr 13 16:04:51 2009
From: barry at python.org (Barry Warsaw)
Date: Mon, 13 Apr 2009 10:04:51 -0400
Subject: [Email-SIG] Append behavior of __setitem__
In-Reply-To: <87ocv3tz2b.fsf@xemacs.org>
References: <426749C6-27FD-4F97-BE00-08076386B2D8@python.org>
	<87ocv3tz2b.fsf@xemacs.org>
Message-ID: <BFDA207C-E488-4AEB-A4AB-64935981A76A@python.org>

On Apr 11, 2009, at 3:43 AM, Stephen J. Turnbull wrote:

> Barry Warsaw writes:
>
>> So I'm just starting to read RFC 5322 and I'm starting by skimming
>> over Appendix A (differences between RFC 5322 and 2822).
>
> I know Barry's a big supporter of the Postel Principle.  As a
> guideline[1], how far back should we be lenient?  RFC 822 (no  
> leading "2"
> ;-)?
>
> Footnotes:
> [1]  Presumably over time we'll accrete definitely non-conforming
> practices that we need to accept and do something sane with (eg, we
> can't just raise ArmageddonException because we get a header with
> 8-bit characters in it).  But I think we also should have a plan for
> formerly acceptable syntax that has been restricted in more recent
> RFCs, etc.

We could potentially have strict and lenient modes, or possible RFC  
822, 2822, 5322 modes.  OTOH, I feel very strongly that the parser  
should accept just about any stream of bytes without throwing an  
exception.  Thinking about an application like Mailman, it's rather  
inconvenient for the parsing phase to throw any exception.  Much  
better is to register defects and then decide the disposition of  
messages based on the defect list.

OTOH, when creating messages from whole cloth, I think it's okay to  
raise exception.  You just have to be careful because often the same  
APIs are used by the parser.

Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 304 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20090413/3a2ae6d3/attachment.pgp>

From barry at python.org  Mon Apr 13 16:05:56 2009
From: barry at python.org (Barry Warsaw)
Date: Mon, 13 Apr 2009 10:05:56 -0400
Subject: [Email-SIG] Append behavior of __setitem__
In-Reply-To: <p04330103c6068a93c988@[192.168.123.162]>
References: <426749C6-27FD-4F97-BE00-08076386B2D8@python.org>
	<87ocv3tz2b.fsf@xemacs.org>
	<p04330103c6068a93c988@[192.168.123.162]>
Message-ID: <24040806-9EE5-421E-A699-BEDB627CF8D1@python.org>

On Apr 11, 2009, at 5:17 PM, Tony Nelson wrote:

> Sure.  The header field should be parsed, if possible, and possibly  
> add a
> defect to the message.  For some header fields, the data should be  
> added to
> the previous Header instance; for others, an extra Header instance  
> might
> need to be created.

I don't follow this part.

> Message /generation/ should comply with what was in RFC 2822, where  
> this
> requirement was added, and also the new RFC 5322.

+1

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 304 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20090413/5c20e644/attachment.pgp>

From barry at python.org  Mon Apr 13 16:11:09 2009
From: barry at python.org (Barry Warsaw)
Date: Mon, 13 Apr 2009 10:11:09 -0400
Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json
In-Reply-To: <A286FA62-B1F0-4DB4-BC38-9D1E0F85A92A@fuhm.net>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>
	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>
	<A286FA62-B1F0-4DB4-BC38-9D1E0F85A92A@fuhm.net>
Message-ID: <CBA855B3-2806-469D-A4A6-8AF279607A52@python.org>

On Apr 10, 2009, at 11:08 AM, James Y Knight wrote:

> Until you write a parser for every header, you simply cannot decode  
> to unicode. The only sane choices are:
> 1) raw bytes
> 2) parsed structured data

The email package does not need a parser for every header, but it  
should provide a framework that applications (or third party  
libraries) can use to extend the built-in header parsers.  A bare  
minimum for functionality requires a Content-Type parser.  I think the  
email package should also include an address header (Originator,  
Destination) parser, and a Message-ID header parser.  Possibly  
others.  The default would probably be some unstructured parser for  
headers like Subject.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 304 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20090413/912f6b87/attachment.pgp>

From barry at python.org  Mon Apr 13 16:14:04 2009
From: barry at python.org (Barry Warsaw)
Date: Mon, 13 Apr 2009 10:14:04 -0400
Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json
In-Reply-To: <49DF8956.5050501@g.nevcal.com>
References: <loom.20090408T110540-221@post.gmane.org>	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>	<loom.20090409T043042-835@post.gmane.org>	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>	<20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com>
	<F40AE8EC-08CC-4634-AA82-264587552F47@python.org>
	<49DF8956.5050501@g.nevcal.com>
Message-ID: <7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org>

On Apr 10, 2009, at 2:00 PM, Glenn Linderman wrote:

> If one name has to be longer than the other, it should be the bytes  
> version.  Real user code is more likely to want to use the text  
> version, and hopefully there will be more of that type of code than  
> implementations using bytes.
>
> Of course, one could use message.header and message.bythdr and  
> they'd be the same length.

Actually, thinking about this over the weekend, it's much better for  
message['subject'] to return a Header instance in all cases.  Use  
bytes(header) to get the raw bytes.

A good API for getting the parsed and decoded header values needs to  
take into account that it won't always be a string.  For unstructured  
headers like Subject, str(header) would work just fine.  For an  
Originator or Destination address, what does str(header) return?  And  
what would be the API for getting the set of realname/addresses out of  
the header?

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 304 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20090413/898936ae/attachment.pgp>

From barry at python.org  Mon Apr 13 16:18:12 2009
From: barry at python.org (Barry Warsaw)
Date: Mon, 13 Apr 2009 10:18:12 -0400
Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json
In-Reply-To: <87prfkfhzd.fsf@xemacs.org>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<grkodk$j4p$1@ger.gmane.org>
	<1F3DC671-746B-425C-A847-4F6CB0DB9FD0@python.org>
	<87zlepf5hf.fsf@xemacs.org>
	<67879F1D-B386-4B9B-8203-86DB977BD7FF@python.org>
	<87prfkfhzd.fsf@xemacs.org>
Message-ID: <E7678DED-1813-4560-8F5D-0C96046C7F9B@python.org>

On Apr 10, 2009, at 3:04 PM, Stephen J. Turnbull wrote:

> Shouldn't this thread move lock stock and .signature to email-sig?

Yep.  I'll try to be more conscientious about removing python-dev from  
the CC.

> "Idempotency"?  I'm not sure what that means in the context of the
> email package ... multiplication by zero?<wink>  Do you mean that
> .parse().to_wire() should be idempotent?  Yes, I think that's a good
> idea, and it shouldn't be too hard to implement by (optionally?)
> caching the whole original message or individual components (headers
> with all whitespace including folding cached verbatim, etc).  I think
> caching has to be done, since stuff like "did the original fold with a
> leading tab or a leading space, and at what column" and so on seems
> kind of pointless to encode as attributes on Header objects.

I tend to agree.  I'm also happy of there's a way to tell say the  
parser that an application doesn't care about that.  All that extra  
caching will have a memory overhead that you should only pay for if  
you care.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 304 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20090413/2abb0f15/attachment.pgp>

From barry at python.org  Mon Apr 13 16:20:19 2009
From: barry at python.org (Barry Warsaw)
Date: Mon, 13 Apr 2009 10:20:19 -0400
Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json
In-Reply-To: <87myaofh5q.fsf@xemacs.org>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>
	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>
	<1239382031.8682.11.camel@haku> <87myaofh5q.fsf@xemacs.org>
Message-ID: <203DDBFE-1B15-454A-95FC-61D863D10B97@python.org>

On Apr 10, 2009, at 3:22 PM, Stephen J. Turnbull wrote:

> Robert Brewer writes:
>
>> Syntactically, there's no sense in providing:
>>
>>    Message.set_header('Subject', 'Some text', encoding='utf-16')
>>
>> ...since you could more clearly write the same as:
>>
>>    Message.set_header('Subject', 'Some text'.encode('utf-16'))
>
> Which you now must *parse* and guess the encoding to determine how to
> RFC-2047-encode the binary mush.  I think the encoding parameter is
> necessary here.

Agreed!  In fact, it's redundant to explicitly encode the string.  So  
the first spelling is preferred.

>> But it would be far easier to do all the encoding at once in an
>> output() or serialize() method. Do different headers need different
>> encodings?
>
> You can have multiple encodings within a single header (and a na?ve
> algorithm might very well encode "The price of G?del-Escher-Bach is
> ?25" as "The price of =?ISO-8859-1?Q?G=F6del-Escher-Bach?= is
> =?ISO-8859-15?Q?=A425?=").

Isn't email just wonderful?  Please, spam and Facebook, kill it off  
once and for all, won't you?

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 304 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20090413/28d10517/attachment-0001.pgp>

From barry at python.org  Mon Apr 13 16:28:32 2009
From: barry at python.org (Barry Warsaw)
Date: Mon, 13 Apr 2009 10:28:32 -0400
Subject: [Email-SIG] [Python-Dev] headers api for email package
In-Reply-To: <49E08F8C.5030205@simplistix.co.uk>
References: <loom.20090408T110540-221@post.gmane.org>	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>	<loom.20090409T043042-835@post.gmane.org>	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>
	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>
	<49E08F8C.5030205@simplistix.co.uk>
Message-ID: <FD0E133D-4944-4FE6-B3FD-865947F48E2F@python.org>

On Apr 11, 2009, at 8:39 AM, Chris Withers wrote:

> Barry Warsaw wrote:
>> >>> message['Subject']
>> The raw bytes or the decoded unicode?
>
> A header object.

Yep.  You got there before I did. :)

>> Okay, so you've picked one.  Now how do you spell the other way?
>
> str(message['Subject'])

Yes for unstructured headers like Subject.  For structured headers...  
hmm.

> bytes(message['Subject'])

Yes.

>> Now, setting headers.  Sometimes you have some unicode thing and  
>> sometimes you have some bytes.  You need to end up with bytes in  
>> the ASCII range and you'd like to leave the header value unencoded  
>> if so.  But in both cases, you might have bytes or characters  
>> outside that range, so you need an explicit encoding, defaulting to  
>> utf-8 probably.
>> >>> Message.set_header('Subject', 'Some text', encoding='utf-8')
>> >>> Message.set_header('Subject', b'Some bytes')
>
> Where you just want "a damned valid email and stop making my life  
> hard!":
>
> Message['Subject']='Some text'

Yes.  In which case I propose we guess the encoding as 1) ascii, 2)  
utf-8, 3) wtf?

> Where you care about what encoding is used:
>
> Message['Subject']=Header('Some text',encoding='utf-8')

Yes.

> If you have bytes, for whatever reason:
>
> Message['Subject']=b'some bytes'.decode('utf-8')
>
> ...because only you know what encoding those bytes use!

So you're saying that __setitem__() should not accept raw bytes?

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 304 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20090413/9df3eb61/attachment.pgp>

From rdmurray at bitdance.com  Mon Apr 13 17:49:35 2009
From: rdmurray at bitdance.com (R. David Murray)
Date: Mon, 13 Apr 2009 11:49:35 -0400 (EDT)
Subject: [Email-SIG] [Python-Dev] headers api for email package
In-Reply-To: <FD0E133D-4944-4FE6-B3FD-865947F48E2F@python.org>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>
	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>
	<49E08F8C.5030205@simplistix.co.uk>
	<FD0E133D-4944-4FE6-B3FD-865947F48E2F@python.org>
Message-ID: <Pine.LNX.4.64.0904131117510.26362@kimball.webabinitio.net>

On Mon, 13 Apr 2009 at 10:28, Barry Warsaw wrote:
> On Apr 11, 2009, at 8:39 AM, Chris Withers wrote:
>
>> Barry Warsaw wrote:
>> > > > >  message['Subject']
>> > The raw bytes or the decoded unicode?
>> 
>> A header object.
>
> Yep.  You got there before I did. :)

+1

>> > Okay, so you've picked one.  Now how do you spell the other way?
>> 
>> str(message['Subject'])
>
> Yes for unstructured headers like Subject.  For structured headers... hmm.

Some "reasonable" printable interpretation that has no semantic meaning?

>> bytes(message['Subject'])
>
> Yes.
>
>> > Now, setting headers.  Sometimes you have some unicode thing and 
>> > sometimes you have some bytes.  You need to end up with bytes in the 
>> > ASCII range and you'd like to leave the header value unencoded if so. 
>> > But in both cases, you might have bytes or characters outside that range, 
>> > so you need an explicit encoding, defaulting to utf-8 probably.
>> > > > >  Message.set_header('Subject', 'Some text', encoding='utf-8')
>> > > > >  Message.set_header('Subject', b'Some bytes')
>> 
>> Where you just want "a damned valid email and stop making my life hard!":
>> 
>> Message['Subject']='Some text'
>
> Yes.  In which case I propose we guess the encoding as 1) ascii, 2) utf-8, 3) 
> wtf?

Given some usenet postings I've just dealt with, (3) appears to
sometimes be spelled 'x-unknown' and sometimes (in the most recent case)
'unknown-8bit'.  A quick google turns up a hit on RFC1428 for the latter,
and a bunch of trouble tickets for the former...so I think 'wtf' is
correctly spelled 'unknown-8bit'.

However, it's not supposed to be used by mail composers, who are
expected to know the encoding.  It's for mail gateways that are
transforming something and don't know the encoding.  I'm not
sure what this means for the email module, which certainly
will be used in a mail gateways....maybe it's the responsibility
of the application code to explicitly say 'unknown encoding'?

>> Where you care about what encoding is used:
>> 
>> Message['Subject']=Header('Some text',encoding='utf-8')
>
> Yes.
>
>> If you have bytes, for whatever reason:
>> 
>> Message['Subject']=b'some bytes'.decode('utf-8')
>> 
>> ...because only you know what encoding those bytes use!
>
> So you're saying that __setitem__() should not accept raw bytes?

If I'm understanding things correctly, if it did accept bytes the
person using that interface would need to do whatever encoding (eg:
encoded-word) was needed, so the interface should check that the byte
string is 8 bit clean.  But having some sort of 'setraw' method on Header
might be better for that case.

--David

From stephen at xemacs.org  Mon Apr 13 19:15:20 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Tue, 14 Apr 2009 02:15:20 +0900
Subject: [Email-SIG] [Python-Dev] headers api for email package
In-Reply-To: <FD0E133D-4944-4FE6-B3FD-865947F48E2F@python.org>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>
	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>
	<49E08F8C.5030205@simplistix.co.uk>
	<FD0E133D-4944-4FE6-B3FD-865947F48E2F@python.org>
Message-ID: <873accv5jr.fsf@xemacs.org>

Barry Warsaw writes:
 > On Apr 11, 2009, at 8:39 AM, Chris Withers wrote:
 > 
 > > Barry Warsaw wrote:
 > >> >>> message['Subject']
 > >> The raw bytes or the decoded unicode?
 > >
 > > A header object.
 > 
 > Yep.  You got there before I did. :)
 > 
 > >> Okay, so you've picked one.  Now how do you spell the other way?
 > >
 > > str(message['Subject'])
 > 
 > Yes for unstructured headers like Subject.  For structured headers...  
 > hmm.

Well, suppose we get really radical here.  *People* see email as
(rich-)text.  So ... message['Subject'] returns an object, partly to
be consistent with more complex headers' APIs, but partly to remind us
that nothing in email is as simple as it seems.  Now,
str(message['Subject']) is really for presentation to the user, right?
OK, so let's make it a presentation function!  Decode the MIME-words,
optionally unfold folded lines, optionally compress spaces, etc.  This
by default returns the subject field as a single, possibly quite long,
line.  Then a higher-level API can rewrap it, add fonts etc, for fancy
presentation.  This also suggests that we don't the field tag (ie,
"Subject") to be part of this value.

Of course a *really* smart higher-level API would access structured
headers based on their structure, not on the one-size-fits-all str()
conversion.

Then MTAs see email as a string of octets.  So guess what:

 > > bytes(message['Subject'])

gives wire format.  Yow!  I think I'm just joking.  Right?

 > >> Now, setting headers.  Sometimes you have some unicode thing and  
 > >> sometimes you have some bytes.  You need to end up with bytes in  
 > >> the ASCII range and you'd like to leave the header value unencoded  
 > >> if so.  But in both cases, you might have bytes or characters  
 > >> outside that range, so you need an explicit encoding, defaulting to  
 > >> utf-8 probably.
 > >> >>> Message.set_header('Subject', 'Some text', encoding='utf-8')
 > >> >>> Message.set_header('Subject', b'Some bytes')
 > >
 > > Where you just want "a damned valid email and stop making my life  
 > > hard!":

-1  I mean, yeah, Brother, I feel your pain but it just isn't that
easy.  If that were feasible, it would be *criminal* to have a
.set_header() method at all!  In fact,

 > > Message['Subject']='Some text'

is going to (a) need to take *only* unicodes, or (b) raise Exceptions
at the slightest provocation when handed bytes.

And things only get worse if you try to provide this interface for say
"From" (let alone "Content-Type").  Is it really worth doing the
mapping interface if it's only usable with free-form headers (ie, only
Subject among the commonly used headers)?

 > Yes.  In which case I propose we guess the encoding as 1) ascii, 2)  
 > utf-8, 3) wtf?

Uh, what guessing?  If you don't know what you have but you believe it
to be a valid header field, then presumably you got it off the wire
and it's still in bytes and you just spit it out on the wire without
trying to decode or encode it.  But as I already said, I think that's
a bad idea.  Otherwise, you should have a unicode, and you simply look
at the range of the string.  If it fits in ASCII, Bob's your uncle.
If not, Bob's your aunt (and you use UTF-8).

 > > Where you care about what encoding is used:
 > >
 > > Message['Subject']=Header('Some text',encoding='utf-8')
 > 
 > Yes.
 > 
 > > If you have bytes, for whatever reason:
 > >
 > > Message['Subject']=b'some bytes'.decode('utf-8')
 > >
 > > ...because only you know what encoding those bytes use!
 > 
 > So you're saying that __setitem__() should not accept raw bytes?

How do you distinguish "raw" bytes from "encoded bytes"?
__setitem__() shouldn't accept bytes at all.  There should be an API
which sets a .formatted_for_the_wire member, and it should have a
"validate" option (ie, when true the API attempts to parse the header
and raises an exception if it fails to do so; when false, it assumes
you know what you're doing and will send out the bytes verbatim).


From tonynelson at georgeanelson.com  Mon Apr 13 19:09:36 2009
From: tonynelson at georgeanelson.com (Tony Nelson)
Date: Mon, 13 Apr 2009 13:09:36 -0400
Subject: [Email-SIG] Append behavior of __setitem__
In-Reply-To: <BFDA207C-E488-4AEB-A4AB-64935981A76A@python.org>
References: <426749C6-27FD-4F97-BE00-08076386B2D8@python.org>
	<87ocv3tz2b.fsf@xemacs.org>
	<BFDA207C-E488-4AEB-A4AB-64935981A76A@python.org>
Message-ID: <p04330101c6091b094713@[192.168.123.162]>

At 10:04 -0400 04/13/2009, Barry Warsaw wrote:
 ...
>We could potentially have strict and lenient modes, or possible RFC
>822, 2822, 5322 modes.

Is there any need to produce emails that don't conform to the latest spec?
Those specs are crafted to produce backward-compatible messages.

>OTOH, I feel very strongly that the parser
>should accept just about any stream of bytes without throwing an
>exception.  Thinking about an application like Mailman, it's rather
>inconvenient for the parsing phase to throw any exception.  Much
>better is to register defects and then decide the disposition of
>messages based on the defect list.
>
>OTOH,

The second other hand should be the Gripping hand, as it should be the
overriding point.

>when creating messages from whole cloth, I think it's okay to
>raise exception.  You just have to be careful because often the same
>APIs are used by the parser.

APIs raise exceptions, parser catches them, makes into defects?
-- 
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>

From tonynelson at georgeanelson.com  Mon Apr 13 19:09:41 2009
From: tonynelson at georgeanelson.com (Tony Nelson)
Date: Mon, 13 Apr 2009 13:09:41 -0400
Subject: [Email-SIG] Append behavior of __setitem__
In-Reply-To: <24040806-9EE5-421E-A699-BEDB627CF8D1@python.org>
References: <426749C6-27FD-4F97-BE00-08076386B2D8@python.org>
	<87ocv3tz2b.fsf@xemacs.org> <p04330103c6068a93c988@[192.168.123.162]>
	<24040806-9EE5-421E-A699-BEDB627CF8D1@python.org>
Message-ID: <p04330102c6091bdd78d4@[192.168.123.162]>

At 10:05 -0400 04/13/2009, Barry Warsaw wrote:

>On Apr 11, 2009, at 5:17 PM, Tony Nelson wrote:
>
>>Sure. The header field should be parsed, if possible, and possibly add a
>>defect to the message. For some header fields, the data should be added
>>to the previous Header instance; for others, an extra Header instance
>>might need to be created.
>
>I don't follow this part.

When a duplicate header field is parsed, errors are made into defects.  If
duplicates are not allowed for that header field, the contents should be
added to the previous header if that is possible (Subject:, just append
with whitespace; address headers, just append addresses), or a new
(improper) Header should be created if it is not possible to add to the
previous Header (Message-ID:, Content-Type:).

OK, those examples for extra Headers aren't good; it may be that they only
produce message defects.
-- 
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>

From tonynelson at georgeanelson.com  Mon Apr 13 19:09:43 2009
From: tonynelson at georgeanelson.com (Tony Nelson)
Date: Mon, 13 Apr 2009 13:09:43 -0400
Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json
In-Reply-To: <CBA855B3-2806-469D-A4A6-8AF279607A52@python.org>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>
	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>
	<A286FA62-B1F0-4DB4-BC38-9D1E0F85A92A@fuhm.net>
	<CBA855B3-2806-469D-A4A6-8AF279607A52@python.org>
Message-ID: <p04330103c6091da6e46b@[192.168.123.162]>

At 10:11 -0400 04/13/2009, Barry Warsaw wrote:

>On Apr 10, 2009, at 11:08 AM, James Y Knight wrote:
>
>> Until you write a parser for every header, you simply cannot decode
>> to unicode. The only sane choices are:
>> 1) raw bytes
>> 2) parsed structured data
>
>The email package does not need a parser for every header, but it
>should provide a framework that applications (or third party
>libraries) can use to extend the built-in header parsers.  A bare
>minimum for functionality requires a Content-Type parser.  I think the
>email package should also include an address header (Originator,
>Destination) parser, and a Message-ID header parser.  Possibly
>others.  The default would probably be some unstructured parser for
>headers like Subject.

I think the email package should have a parser for every header.  All the
headers defined in normal mail RFCs should have their own parser, and there
would be a default parser for unhandled headers, probably the Unstructured
parser.  Users could add their own, probably by importing something module
that knew how to add its parsing to the email package parsers.
-- 
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>

From tonynelson at georgeanelson.com  Mon Apr 13 19:13:25 2009
From: tonynelson at georgeanelson.com (Tony Nelson)
Date: Mon, 13 Apr 2009 13:13:25 -0400
Subject: [Email-SIG] [Python-Dev]   Dropping bytes "support" in json
In-Reply-To: <7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>
	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>
	<20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com>
	<F40AE8EC-08CC-4634-AA82-264587552F47@python.org>
	<49DF8956.5050501@g.nevcal.com>
	<7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org>
Message-ID: <p04330104c6091e8919a1@[192.168.123.162]>

At 10:14 -0400 04/13/2009, Barry Warsaw wrote:
 ...
>Actually, thinking about this over the weekend, it's much better for
>message['subject'] to return a Header instance in all cases.  Use
>bytes(header) to get the raw bytes.

I don't agree.  I'd want it to return the appropriate type for that header:
string for Subject:, a list of addresses for To:, and so on.  Either the
user knows what to expect, or they'll learn immediately.  If they get a
Header, they have to then extract the appropriate data from it, based on
its type (but they only know the name).

OK, Header instances could have a .useful field that returned the useful
data in all instances.  But in any case, the email package should guide
users in the correct usage, rather than leaving every choice seeming equal,
when only one choice is correct.


>A good API for getting the parsed and decoded header values needs to
>take into account that it won't always be a string.  For unstructured
>headers like Subject, str(header) would work just fine.  For an
>Originator or Destination address, what does str(header) return?  And
>what would be the API for getting the set of realname/addresses out of
>the header?

msg[<headername>] would be the preferred way.

msg.get_header(<headername>).useful would return the useful data form of
any header.

msg.get_header(<headername>).addresses would return the address list from
any address Header, and raise AttributeError with other Headers.
-- 
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>

From tonynelson at georgeanelson.com  Mon Apr 13 19:09:23 2009
From: tonynelson at georgeanelson.com (Tony Nelson)
Date: Mon, 13 Apr 2009 13:09:23 -0400
Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json
In-Reply-To: <E7678DED-1813-4560-8F5D-0C96046C7F9B@python.org>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<grkodk$j4p$1@ger.gmane.org>
	<1F3DC671-746B-425C-A847-4F6CB0DB9FD0@python.org>
	<87zlepf5hf.fsf@xemacs.org>
	<67879F1D-B386-4B9B-8203-86DB977BD7FF@python.org>
	<87prfkfhzd.fsf@xemacs.org>
	<E7678DED-1813-4560-8F5D-0C96046C7F9B@python.org>
Message-ID: <p04330105c609206c8b63@[192.168.123.162]>

At 10:18 -0400 04/13/2009, Barry Warsaw wrote:
>On Apr 10, 2009, at 3:04 PM, Stephen J. Turnbull wrote:
 ...
>> "Idempotency"?  I'm not sure what that means in the context of the
>> email package ... multiplication by zero?<wink>  Do you mean that
>> .parse().to_wire() should be idempotent?  Yes, I think that's a good
>> idea, and it shouldn't be too hard to implement by (optionally?)
>> caching the whole original message or individual components (headers
>> with all whitespace including folding cached verbatim, etc).  I think
>> caching has to be done, since stuff like "did the original fold with a
>> leading tab or a leading space, and at what column" and so on seems
>> kind of pointless to encode as attributes on Header objects.
>
>I tend to agree.  I'm also happy of there's a way to tell say the
>parser that an application doesn't care about that.  All that extra
>caching will have a memory overhead that you should only pay for if
>you care.

I'd expect the caching to have very low overhead.  Message bodies will not
be cached (an extra time), only some headers (when the Header isn't
idempotent already) and the preamble and epiloge around message bodies.
-- 
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>

From stephen at xemacs.org  Mon Apr 13 20:38:27 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Tue, 14 Apr 2009 03:38:27 +0900
Subject: [Email-SIG] API for Header objects [was: Dropping bytes "support"
	in json]
In-Reply-To: <p04330104c6091e8919a1@[192.168.123.162]>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>
	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>
	<20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com>
	<F40AE8EC-08CC-4634-AA82-264587552F47@python.org>
	<49DF8956.5050501@g.nevcal.com>
	<7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org>
	<p04330104c6091e8919a1@[192.168.123.162]>
Message-ID: <87y6u4tn4s.fsf@xemacs.org>

Tony Nelson writes:

 > OK, Header instances could have a .useful field that returned the useful
 > data in all instances.  But in any case, the email package should guide
 > users in the correct usage, rather than leaving every choice seeming equal,
 > when only one choice is correct.

What do you mean by "only one choice is correct?"  For example, a
Destination field might be used for presentation (in which case the
display name are needed), or to compose a list of recipients (when
thjey should be discarded).  Some applications might prefer to receive
the combination as the original string (although that often is not
valid RFC-any), others might prefer it parsed into a pair of display
name and mailbox.

Quoth Barry Warsaw:

 > >A good API for getting the parsed and decoded header values needs to
 > >take into account that it won't always be a string.  For unstructured
 > >headers like Subject, str(header) would work just fine.  For an
 > >Originator or Destination address, what does str(header) return?

A string (not folded) of comma-separated addresses in "Display Name"
<po at box.example.com> form.

 > >And what would be the API for getting the set of
 > >realname/addresses out of the header?

Does there need to be one?  An AddressHeader object could support
indexing: message['To'][0] returns the first displayname,mailbox pair.
If you really want a list, what's wrong with list(header)?  (Yes, I
recall that you (Barry) said you don't think subclassing worked very
well, but I wonder if maybe we can't get it righter this time around.)

 > msg[<headername>] would be the preferred way.

This goes against the principle that this returns a Header object.
For one thing, I really think that there need to be some common
methods all Header objects support, like str() and to_wire_format().
Also, if this returns a list for 'To', then str(msg['To']) won't work
right: it will return the list enclosed in square brackets and the
mailbox portions will be quoted, which isn't useful.

 > msg.get_header(<headername>).useful would return the useful data form of
 > any header.

Er, shouldn't we just throw away the data that is never useful?<wink>

 > msg.get_header(<headername>).addresses would return the address list from
 > any address Header, and raise AttributeError with other Headers.

Yes, but a list of what?  Strings?  Bytes?  Displayname/mailbox pairs?

From tonynelson at georgeanelson.com  Tue Apr 14 02:58:40 2009
From: tonynelson at georgeanelson.com (Tony Nelson)
Date: Mon, 13 Apr 2009 20:58:40 -0400
Subject: [Email-SIG] API for Header objects [was: Dropping bytes
	"support" in json]
In-Reply-To: <87y6u4tn4s.fsf@xemacs.org>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>
	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>
	<20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com>
	<F40AE8EC-08CC-4634-AA82-264587552F47@python.org>
	<49DF8956.5050501@g.nevcal.com>
	<7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org>
	<p04330104c6091e8919a1@[192.168.123.162]> <87y6u4tn4s.fsf@xemacs.org>
Message-ID: <p04330109c6098b9dbdb3@[192.168.123.162]>

At 03:38 +0900 04/14/2009, Stephen J. Turnbull wrote:
>Tony Nelson writes:
>
> > OK, Header instances could have a .useful field that returned the useful
> > data in all instances.  But in any case, the email package should guide
> > users in the correct usage, rather than leaving every choice seeming equal,
> > when only one choice is correct.
>
>What do you mean by "only one choice is correct?"  For example, a
>Destination field might be used for presentation (in which case the
>display name are needed), or to compose a list of recipients (when
>thjey should be discarded).  Some applications might prefer to receive
>the combination as the original string (although that often is not
>valid RFC-any), others might prefer it parsed into a pair of display
>name and mailbox.
 ...

Assuming that by "Destination" you mean a class of Address header fields,
as there is no Destionation: header field, such header fields contain
addresses, which can be considered to contain (as the email package does) a
list of (name, email address) pairs, or, at a lower level, to also have
Comments, there is indeed only one correct choice, which is the one the
email package currently provides the diligent user.  I wish it to be the
one obvious choice, so that less study is needed to properly use the email
package.

Any use that wishes to discard the email addresses in favor of the friendly
names can do so most easily from the parsed [(name, address)], not from the
bytes.  Parsing Address header fields is hard.  Note that Address headers
are not Text, as only certain tokens -- not part of the email addresses --
can be RFC 2047-encoded.
-- 
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>

From stephen at xemacs.org  Tue Apr 14 06:48:52 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Tue, 14 Apr 2009 13:48:52 +0900
Subject: [Email-SIG] [Python-Dev]   headers api for email package
In-Reply-To: <200904140432.25953.steve@pearwood.info>
References: <loom.20090408T110540-221@post.gmane.org>
	<FD0E133D-4944-4FE6-B3FD-865947F48E2F@python.org>
	<873accv5jr.fsf@xemacs.org>
	<200904140432.25953.steve@pearwood.info>
Message-ID: <87prffu9fv.fsf@xemacs.org>

Removing Python-Dev from the addressees.

Steven D'Aprano writes:
 > On Tue, 14 Apr 2009 03:15:20 am Stephen J. Turnbull wrote:
 > 
 > > *People* see email as (rich-)text.
 > 
 > We do?

Yup.  You don't see the email, you see a *presentation* of that email.
That presentation is usually text, plus possible some other stuff
(fonts, highlighting, active links, images).  Thus the "(rich-)".

 > It's not clear what you actually mean by "(rich-)text".

I mean presentation.  I mean "human readable".  I mean Unicode.  I
mean "Do Not Feed The Program" (not for machine processing -- so your
associations with virii are completely off the mark).

 > rich-text. I guess you mean Unicode characters. Am I right?

No.  I mean presentation, which for Python purposes includes but is
not limited to Unicode.

 > Now, correct me if I'm wrong, but I don't think mail headers can 
 > actually be anything *but* bytes.

On the wire.  email's Headers have applications other than putting
bytes on the wire.

 > If you're proposing converting those bytes into characters, that's all 
 > very well and good, but what's your strategy for dealing with the 
 > inevitable wrongly-formatted headers?

Whatever you want it to be.  There are a number of such strategies,
some of which should be among the batteries we include.
Header.__str__() will need to know how to find out which is in effect,
of course.

 > If the header can't be correctly decoded into text, there still
 > needs to be a way to get to the raw bytes.

Sure.  That's what Header.__bytes__() will do.  Specifically, if you
have a Header that was parsed out of a message received over the wire,
it will return a verbatim copy of the header as received, folding
whitespace, CRLFs, and all.  If the Header was constructed (including
editing a received header), then __bytes__ will construct the wire
format, and optionally cache it as if it were a received header.  (But
this has some gotchas, see below.)

 > > ?> > bytes(message['Subject'])
 > >
 > > gives wire format. ?Yow! ?I think I'm just joking. ?Right?
 > 
 > Er, I'm not sure. Are you joking? I hope not, because it is important to 
 > be able to get to the raw, unmodified bytes that the MTA sees, without 
 > all the fancy processing you suggest.

Er, I'm not suggesting any processing in particular.  I'm suggesting
an API in which str(header) produces a text/plain rendering of the
field contents, with no folding, MIME words, or other wire format
detritus, suitable for human viewing, more or less (specifically, it
might be a rather long line).  bytes(header) produces the wire format,
either verbatim as received or as constructed based on client input.

Note that an issue here is that a received header may be bogus, in
which case you *don't* want bytes(header) to simply return the
original and then spew over the wire.  Should it raise an Exception or
"fix up" the bytes?  I don't know, and thus I wonder if this proposed
API might just be a joke, not something you can dare use in a
production application.

Of course, str() and bytes() as proposed here are not necessarily what
you want.  So there will need to be ways to access the internal
representation of Header directly (or via further specialized
formatter functions if string or bytes format is preferred to
structured objects).

 > Again, correct me if I'm wrong, but *all* valid mail headers must fit in 
 > ASCII.

Of course, that's true on the wire.  I've assumed that everybody here
is assuming STD 11 (currently RFC 822 according to rfc-editor.org)
folding of long header lines and RFC 2047 encoding of characters
outside of the restricted-ASCII repertoire (RFC 5322 at least doesn't
permit all the ASCII control characters) before putting it on the
wire.  This is basically a solved problem, though, so I didn't bother
mentioning it.  Sorry for the confusion.

But what we're talking about *here* are email APIs that may or may not
be directly connected to a display or wire.  There is no reason why
headers *must* be represented as bytes, strings, or anything else in a
Header, and no reason why the bytes or str format *must* be RFC
compatible.

I think it's quite sensible to specify "bytes(header) will be RFC
5322-conforming", but we need to specify how to handle bogus headers
that we have received and not edited.  Should we ever raise an
Exception, and if so, in what contexts?  Should we "fix up" the
bogosity somehow?  Should we delete the offensive header?  Should we
pass it on verbatim, and leave it to a higher level to verify(!) and
decide what to do about it?  Do the RFCs say anything about all this
(eg, with broken trace headers I think it's implied that we pass them
on verbatim)?

From stephen at xemacs.org  Tue Apr 14 09:00:59 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Tue, 14 Apr 2009 16:00:59 +0900
Subject: [Email-SIG] [Python-Dev] Dropping bytes "support" in json
In-Reply-To: <49E3CA6E.1070501@canterbury.ac.nz>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>
	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>
	<A286FA62-B1F0-4DB4-BC38-9D1E0F85A92A@fuhm.net>
	<CBA855B3-2806-469D-A4A6-8AF279607A52@python.org>
	<49E3CA6E.1070501@canterbury.ac.nz>
Message-ID: <87ocuzu3bo.fsf@xemacs.org>

Warning: Reply-To set to email-sig.

Greg Ewing writes:

 > Only for headers known to be unstructured, I think.
 > Completely unknown headers should be available only
 > as bytes.

Why do I get the feeling that you guys are feeling up an
elephant?<wink>

There are four things you might want to do with a header:

(1) Put it on the wire, which must be bytes (in fact, ASCII).
(2) Show it to a user (such as a rootin-tootin spam-fightin mail
    admin), which for consistency with well-behaved, implemented
    headers (ie, you might want to *gasp* *concatenate* your unknown
    header with a string), will sooner or later be string (ie,
    Unicode).
(3) (Try to) parse it, in which case an internal representation with
    some other structure may or may not be appropriate for storing the
    parsed data.
(4) Munge it, in which case an internal representation with some other
    structure may or may not be appropriate.

I see no particular reason for restricting these basic API classes for
any header.

From turnbull at sk.tsukuba.ac.jp  Tue Apr 14 11:11:53 2009
From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull)
Date: Tue, 14 Apr 2009 18:11:53 +0900
Subject: [Email-SIG] API for Header objects [was: Dropping
	bytes	"support" in json]
In-Reply-To: <p04330109c6098b9dbdb3@[192.168.123.162]>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>
	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>
	<20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com>
	<F40AE8EC-08CC-4634-AA82-264587552F47@python.org>
	<49DF8956.5050501@g.nevcal.com>
	<7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org>
	<p04330104c6091e8919a1@[192.168.123.162]>
	<87y6u4tn4s.fsf@xemacs.org>
	<p04330109c6098b9dbdb3@[192.168.123.162]>
Message-ID: <87vdp761ly.fsf@xemacs.org>

Tony Nelson writes:

 > Assuming that by "Destination" you mean a class of Address header fields,
 > as there is no Destionation: header field, such header fields contain
 > addresses, which can be considered to contain (as the email package does) a
 > list of (name, email address) pairs, or, at a lower level, to also have
 > Comments, there is indeed only one correct choice, which is the one the
 > email package currently provides the diligent user.  I wish it to be the
 > one obvious choice, so that less study is needed to properly use the email
 > package.

As you point out above, display names and comments are different.
It's *not* obvious to me that they should be confounded by default.

In any case, it would certainly be possible to implement both the
indexing feature, so that msg['To'][0] returns a (display, mailbox)
tuple, and a converter so that list(msg['to']) returns a list of such
tuples (in both cases, assuming that most users prefer not to
distinguish comments from display names).


From tonynelson at georgeanelson.com  Wed Apr 15 03:26:19 2009
From: tonynelson at georgeanelson.com (Tony Nelson)
Date: Tue, 14 Apr 2009 21:26:19 -0400
Subject: [Email-SIG] API for Header objects [was: Dropping bytes
 "support" in json]
In-Reply-To: <87vdp761ly.fsf@xemacs.org>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>
	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>
	<20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com>
	<F40AE8EC-08CC-4634-AA82-264587552F47@python.org>
	<49DF8956.5050501@g.nevcal.com>
	<7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org>
	<p04330104c6091e8919a1@[192.168.123.162]>	<87y6u4tn4s.fsf@xemacs.org>
	<p04330109c6098b9dbdb3@[192.168.123.162]> <87vdp761ly.fsf@xemacs.org>
Message-ID: <p04330100c60ae51beb95@[192.168.123.162]>

At 18:11 +0900 04/14/2009, Stephen J. Turnbull wrote:
>Tony Nelson writes:
>
> > Assuming that by "Destination" you mean a class of Address header fields,
> > as there is no Destionation: header field, such header fields contain
> > addresses, which can be considered to contain (as the email package does) a
> > list of (name, email address) pairs, or, at a lower level, to also have
> > Comments, there is indeed only one correct choice, which is the one the
> > email package currently provides the diligent user.  I wish it to be the
> > one obvious choice, so that less study is needed to properly use the email
> > package.
>
>As you point out above, display names and comments are different.
>It's *not* obvious to me that they should be confounded by default.

The examples in the RFC seem to use one or the other for the friendly name.
The problem comes when there are both.  Actually, I haven't seen comments
used, so I don't have any experience there.


>In any case, it would certainly be possible to implement both the
>indexing feature, so that msg['To'][0] returns a (display, mailbox)
>tuple, and a converter so that list(msg['to']) returns a list of such
>tuples (in both cases, assuming that most users prefer not to
>distinguish comments from display names).

Well, msg['To'] would return a list (or tuple) of addresses (which are
tuples), so msg['To'][0] would return the first such address, if any.  No
converter required.
-- 
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>

From stephen at xemacs.org  Wed Apr 15 10:47:07 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Wed, 15 Apr 2009 17:47:07 +0900
Subject: [Email-SIG] API for Header objects [was: Dropping bytes
 "support" in json]
In-Reply-To: <p04330100c60ae51beb95@[192.168.123.162]>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>
	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>
	<20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com>
	<F40AE8EC-08CC-4634-AA82-264587552F47@python.org>
	<49DF8956.5050501@g.nevcal.com>
	<7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org>
	<p04330104c6091e8919a1@[192.168.123.162]>
	<87y6u4tn4s.fsf@xemacs.org>
	<p04330109c6098b9dbdb3@[192.168.123.162]>
	<87vdp761ly.fsf@xemacs.org>
	<p04330100c60ae51beb95@[192.168.123.162]>
Message-ID: <87k55m5mno.fsf@xemacs.org>

Tony Nelson writes:

 > Well, msg['To'] would return a list (or tuple) of addresses (which
 > are tuples), so msg['To'][0] would return the first such address,
 > if any.  No converter required.

How do you propose to spell

    msg['To'].split_addresses()[0]

where the split_addresses method returns a list of addresses in their
original form?  And is it really worth losing the consistency that
str(msg[tag]) and bytes(msg[tag]) (especially the latter) do something
more or less useful regardless of whether 'tag' names a structured
field or a text field?

As I wrote elsewhere, I don't *know* that such features will be useful
or practically implementable, but I do think what you're suggesting is
premature and overly restrictive.  Especially since we are pretty sure
(due to the desire for idempotency) that internally msg['To'] will
*not* be a sequence of addresses parsed into display name and mailbox.

From tonynelson at georgeanelson.com  Thu Apr 16 02:39:52 2009
From: tonynelson at georgeanelson.com (Tony Nelson)
Date: Wed, 15 Apr 2009 20:39:52 -0400
Subject: [Email-SIG] API for Header objects [was: Dropping bytes
 "support" in json]
In-Reply-To: <87k55m5mno.fsf@xemacs.org>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>
	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>
	<20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com>
	<F40AE8EC-08CC-4634-AA82-264587552F47@python.org>
	<49DF8956.5050501@g.nevcal.com>
	<7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org>
	<p04330104c6091e8919a1@[192.168.123.162]>	<87y6u4tn4s.fsf@xemacs.org>
	<p04330109c6098b9dbdb3@[192.168.123.162]>	<87vdp761ly.fsf@xemacs.org>
	<p04330100c60ae51beb95@[192.168.123.162]> <87k55m5mno.fsf@xemacs.org>
Message-ID: <p04330102c60c294d2185@[192.168.123.162]>

At 17:47 +0900 04/15/2009, Stephen J. Turnbull wrote:
>Tony Nelson writes:
>
> > Well, msg['To'] would return a list (or tuple) of addresses (which
> > are tuples), so msg['To'][0] would return the first such address,
> > if any.  No converter required.
>
>How do you propose to spell
>
>    msg['To'].split_addresses()[0]
>
>where the split_addresses method returns a list of addresses in their
>original form?  And is it really worth losing the consistency that
>str(msg[tag]) and bytes(msg[tag]) (especially the latter) do something
>more or less useful regardless of whether 'tag' names a structured
>field or a text field?

I don't want there to be any "str(msg['tag'])" or "bytes(msg['tag'])" at
all, so there would be no loss of consistency.  Messages need flattening to
bytes, but there is no use for converting individual header fields into
bytes or strings, outside of a message.  Some header field data /is/
strings, some is lists of address pairs, and so on.  If the data for a
header field is not properly a string, a means to get it as one is wrong.

I can't imagine that .split_addresses() would provide anything in its
original form.  I'd certainly want it to split something into a list or
tuple.  As individual addresses in an Address header field are accessed
from the list returned by "msg['To']" (or other Address header field name),
there is no need to "split" them any more.


>As I wrote elsewhere, I don't *know* that such features will be useful
>or practically implementable, but I do think what you're suggesting is
>premature and overly restrictive.  Especially since we are pretty sure
>(due to the desire for idempotency) that internally msg['To'] will
>*not* be a sequence of addresses parsed into display name and mailbox.

All the grotty internals of Heaer objects would be accessible by fetching
the Header object with "msg.get_header('name')".  "msg[...]" is an
abbreviation for convenience which should not mislead users or be complex
or magical in action.  I want to be able to get and put the proper type of
data for a particular header field, and to be told when I did it wrong,
rather than just get a corrupt message.

Internally, the Header whose .useful attribute is returned by "msg['foo']"
will contain parsed data, referring to parsed tokens.  Flattening those
parsed tokens will produce the original data.  Not a problem at all, simple
to implement, in the most direct way.
-- 
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>

From rdmurray at bitdance.com  Thu Apr 16 03:19:44 2009
From: rdmurray at bitdance.com (R. David Murray)
Date: Wed, 15 Apr 2009 21:19:44 -0400 (EDT)
Subject: [Email-SIG] API for Header objects [was: Dropping bytes
 "support" in json]
In-Reply-To: <p04330102c60c294d2185@[192.168.123.162]>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>
	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>
	<20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com>
	<F40AE8EC-08CC-4634-AA82-264587552F47@python.org>
	<49DF8956.5050501@g.nevcal.com>
	<7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org>
	<p04330104c6091e8919a1@[192.168.123.162]>
	<87y6u4tn4s.fsf@xemacs.org> <p04330109c6098b9dbdb3@[192.168.123.162]>
	<87vdp761ly.fsf@xemacs.org> <p04330100c60ae51beb95@[192.168.123.162]>
	<87k55m5mno.fsf@xemacs.org> <p04330102c60c294d2185@[192.168.123.162]>
Message-ID: <Pine.LNX.4.64.0904152102010.1740@kimball.webabinitio.net>

On Wed, 15 Apr 2009 at 20:39, Tony Nelson wrote:
> Internally, the Header whose .useful attribute is returned by "msg['foo']"
> will contain parsed data, referring to parsed tokens.  Flattening those
> parsed tokens will produce the original data.  Not a problem at all, simple
> to implement, in the most direct way.

The first part of that is too magical and inconsistent for my tastes.
I want message['fooheader'] to return a Header object.  Which yes,
should contain the parsed token structure and be able to regenerate the
original bytes on demand (or vice versa, or keeping both the original
bytes and the parse tree if the parse tree is lossy).

For a header involving a list of addresses, I'd expect to get back a
Header subclass that I could iterate over to get individual Address
objects.  For other structured headers, I'd expect to get a subclass
with useful methods and attributes for accessing the structure.

And when I str the Header (for example, when presenting one or more
selected headers to a user), I would expect to get a string that a
user would expect to read, which is to say a fully-decoded-to-unicode
user-oriented representation of the structured data as one long string
(I'll do any folding formatting for presentation as needed).

Going the other way I have fewer opinions about, as I haven't written
any code to do that yet :)

--David

From stephen at xemacs.org  Thu Apr 16 08:24:47 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Thu, 16 Apr 2009 15:24:47 +0900
Subject: [Email-SIG] API for Header objects [was: Dropping bytes
 "support" in json]
In-Reply-To: <p04330102c60c294d2185@[192.168.123.162]>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>
	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>
	<20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com>
	<F40AE8EC-08CC-4634-AA82-264587552F47@python.org>
	<49DF8956.5050501@g.nevcal.com>
	<7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org>
	<p04330104c6091e8919a1@[192.168.123.162]>
	<87y6u4tn4s.fsf@xemacs.org>
	<p04330109c6098b9dbdb3@[192.168.123.162]>
	<87vdp761ly.fsf@xemacs.org>
	<p04330100c60ae51beb95@[192.168.123.162]>
	<87k55m5mno.fsf@xemacs.org>
	<p04330102c60c294d2185@[192.168.123.162]>
Message-ID: <8763h55d5c.fsf@xemacs.org>

Tony Nelson writes:

 > strings, some is lists of address pairs, and so on.  If the data
 > for a header field is not properly a string, a means to get it as
 > one is wrong.

Er, but the data for an address field is not "properly" a list of
pairs, either.  So I guess you would agree that a means to get it as
one is wrong, then?

 > All the grotty internals of Heaer objects would be accessible by
 > fetching the Header object with "msg.get_header('name')".
 > "msg[...]" is an abbreviation for convenience which should not
 > mislead users or be complex or magical in action.

A message or so back you made the point that an address header is a
rather complex object that is *not* easy to parse.  For example (this
is a trick question), in your opinion, what should

    msg['To'][0]

return if the original header was

To: Stephen J. Turnbull <stephen at xemacs.org>

?

 > Internally, the Header whose .useful attribute is returned by
 > "msg['foo']" will contain parsed data, referring to parsed tokens.
 > Flattening those parsed tokens will produce the original data.  Not
 > a problem at all, simple to implement, in the most direct way.

And horrid to use, if you mean that the internal representation will
be a full parse tree according to the augmented BNF in RFCs 822, 2822,
5322, 2045-2049, etc etc., and that the only other way to access that
data is via an arbitrarily defined .useful attribute (which, BTW, is
quite unpythonic if you intend for it to be available as msg['foo'] as
well: TOOWTDI).

From steve at pearwood.info  Thu Apr 16 15:02:13 2009
From: steve at pearwood.info (Steven D'Aprano)
Date: Thu, 16 Apr 2009 23:02:13 +1000
Subject: [Email-SIG] API for Header objects [was: Dropping bytes
	"support" in json]
In-Reply-To: <p04330102c60c294d2185@[192.168.123.162]>
References: <loom.20090408T110540-221@post.gmane.org>
	<87k55m5mno.fsf@xemacs.org>
	<p04330102c60c294d2185@[192.168.123.162]>
Message-ID: <200904162302.14641.steve@pearwood.info>

On Thu, 16 Apr 2009 10:39:52 am Tony Nelson wrote:

> I don't want there to be any "str(msg['tag'])" or "bytes(msg['tag'])"
> at all, so there would be no loss of consistency.

That's ... different.


> Messages need 
> flattening to bytes, but there is no use for converting individual
> header fields into bytes or strings, outside of a message.

Of course there is. You create each header individually, so you should 
be able to extract each header individually. Here, for example, is a 
use-case: I want to send postmaster a copy of the X-Spam-Evidence 
header so she can see why a particular piece of ham got wrongly flagged 
as spam, or visa versa:

X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; '(which': 0.03;
  'attribute': 0.04; 'objects': 0.04; 'returns': 0.05; 'split':
??0.05; ... 

I need to be able to extract just that one header, and while some 
applications (mail client?) may choose to give me the entire message as 
text and expect me to manually hunt for the relevant line and 
copy-and-paste it, other applications may wish to automatically extract 
the appropriate header and email it to postmaster at localhost. Or write 
it to a log file, or whatever. Whatever they do, they probably need it 
as a string (of characters or bytes), not a binary blob.


> Some 
> header field data /is/ strings, some is lists of address pairs, and
> so on.

But "lists of address pairs" themselves are strings.

> If the data for a header field is not properly a string, 

But it always is. 

Even badly formatted emails with corrupt headers containing binary 
characters are strings -- they're just byte (non-Unicode) strings 
containing binary characters. Your mail server might not accept it as 
part of a valid header, but it's a valid byte string.


> a  
> means to get it as one is wrong.

Email *is* text. It's built on top of a restricted range of ASCII bytes, 
which we can legitimately call "text" because it is a subset of Unicode 
text. Even if a particular header contains binary data, it must be 
encoded as ASCII text before it can be placed into the header.

X-Some-Header: \0\0\01\0\xff3G\04 

(where \0 means byte('\0') etc) is not a valid email header -- the 
binary data must be encoded as ASCII text first. So any valid header 
must have a bytes form and a Unicode form (since the restricted range 
of allowed bytes are always valid Unicode as well). Corrupted headers 
may not have a valid Unicode form, but they will always have a byte 
form -- after all, the header eventually must be written to disk in 
some mail box somewhere, and it can only do so as bytes. 

So for any header, there is always a way of writing it in bytes, and 
nearly always a way of writing it as characters. there a valid text 
version of any header. 

Furthermore, in general for arbitrary headers, we can't tell what the 
header means *except* as a text string:

X-Some-Header: AB34F8702D6

We have no way of telling whether the payload "AB34F8702D6" is a string 
of characters meaningful to some application just as they are, or 
whether it is a string encoded from binary data. We might *guess* that 
the encoding *could be* some known encoding (quoted-printable, base64, 
etc) but we can't tell unless it is a known standard header.


> I want to be able to get and put
> the proper type of data for a particular header field, and to be told
> when I did it wrong, rather than just get a corrupt message.

But in general, you can't know what the "proper type of data" is for 
arbitrary headers.

What are valid data for X-policyd-weight headers? What about 
X-Some-Random-Header-I-Just-Made-Up?


-- 
Steven D'Aprano

From tonynelson at georgeanelson.com  Thu Apr 16 20:08:59 2009
From: tonynelson at georgeanelson.com (Tony Nelson)
Date: Thu, 16 Apr 2009 14:08:59 -0400
Subject: [Email-SIG] API for Header objects [was: Dropping bytes
 "support" in json]
In-Reply-To: <8763h55d5c.fsf@xemacs.org>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>
	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>
	<20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com>
	<F40AE8EC-08CC-4634-AA82-264587552F47@python.org>
	<49DF8956.5050501@g.nevcal.com>
	<7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org>
	<p04330104c6091e8919a1@[192.168.123.162]>	<87y6u4tn4s.fsf@xemacs.org>
	<p04330109c6098b9dbdb3@[192.168.123.162]>	<87vdp761ly.fsf@xemacs.org>
	<p04330100c60ae51beb95@[192.168.123.162]>	<87k55m5mno.fsf@xemacs.org>
	<p04330102c60c294d2185@[192.168.123.162]> <8763h55d5c.fsf@xemacs.org>
Message-ID: <p04330102c60d08bfa85a@[192.168.123.162]>

At 15:24 +0900 04/16/2009, Stephen J. Turnbull wrote:
>Tony Nelson writes:
>
> > strings, some is lists of address pairs, and so on.  If the data
> > for a header field is not properly a string, a means to get it as
> > one is wrong.
>
>Er, but the data for an address field is not "properly" a list of
>pairs, either.  So I guess you would agree that a means to get it as
>one is wrong, then?

No.  The useful data for an address field is *properly* a list of pairs of
friendly name, address -- you should read RFC 5322 section 3.4.  You need
to understand this about email in order to continue this discussion, though
your confusion does bring up the important point that people have poor
understanding of email, and need guidance in how to use and compose it.
This makes it very important that the easy way of doing things be the
correct way.  With Address fields, that way is a sequence of pairs of
friendly name and address.  Though the address could be parsed further,
there is seldom any need to do so (outside of the Header parser itself).


> > All the grotty internals of Heaer objects would be accessible by
> > fetching the Header object with "msg.get_header('name')".
> > "msg[...]" is an abbreviation for convenience which should not
> > mislead users or be complex or magical in action.
>
>A message or so back you made the point that an address header is a
>rather complex object that is *not* easy to parse.

Which is exactly why the email package already has an address parser,
though it also needs a more general parser for the other header field types.

>For example (this
>is a trick question), in your opinion, what should
>
>    msg['To'][0]
>
>return if the original header was
>
>To: Stephen J. Turnbull <stephen at xemacs.org>
>
>?

('Stephen J. Turnbull', 'stephen at xemacs.org')

You must be very confused to think this is a trick question.  Try it with
the current email package's email.utils.parseaddr().  Again, see RFC5322
section 3.4.


> > Internally, the Header whose .useful attribute is returned by
> > "msg['foo']" will contain parsed data, referring to parsed tokens.
> > Flattening those parsed tokens will produce the original data.  Not
> > a problem at all, simple to implement, in the most direct way.
>
>And horrid to use, if you mean that the internal representation will
>be a full parse tree according to the augmented BNF in RFCs 822, 2822,
>5322, 2045-2049, etc etc., and that the only other way to access that
>data is via an arbitrarily defined .useful attribute (which, BTW, is
>quite unpythonic if you intend for it to be available as msg['foo'] as
>well: TOOWTDI).

You put words in my mouth.  Wny assume that I am incompetent, or a fool?
Of course the internal representation would include the full parse tree.
Of course the external interface would provide read and write access to the
relevent data.  The .useful attribute (need a better name) is the way to
read the useful part of the data extracted from the parse tree, whatever
type of data that is, which depends on the header field type, determined by
its name.  Each Header subclass would have its own other attributes.  The
.useful attribute guides users and is used by .__getitem__() to return that
data.
-- 
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>

From tonynelson at georgeanelson.com  Thu Apr 16 20:08:57 2009
From: tonynelson at georgeanelson.com (Tony Nelson)
Date: Thu, 16 Apr 2009 14:08:57 -0400
Subject: [Email-SIG] API for Header objects [was: Dropping bytes
 "support" in json]
In-Reply-To: <200904162302.14641.steve@pearwood.info>
References: <loom.20090408T110540-221@post.gmane.org>
	<87k55m5mno.fsf@xemacs.org>	<p04330102c60c294d2185@[192.168.123.162]>
	<200904162302.14641.steve@pearwood.info>
Message-ID: <p04330103c60d1bbc1f00@[192.168.123.162]>

At 23:02 +1000 04/16/2009, Steven D'Aprano wrote:
>On Thu, 16 Apr 2009 10:39:52 am Tony Nelson wrote:
>
>> I don't want there to be any "str(msg['tag'])" or "bytes(msg['tag'])"
>> at all, so there would be no loss of consistency.
>
>That's ... different.
>
>
>> Messages need
>> flattening to bytes, but there is no use for converting individual
>> header fields into bytes or strings, outside of a message.
>
>Of course there is. You create each header individually, so you should
>be able to extract each header individually. Here, for example, is a
>use-case: I want to send postmaster a copy of the X-Spam-Evidence
>header so she can see why a particular piece of ham got wrongly flagged
>as spam, or visa versa:
>
>X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; '(which': 0.03;
>  'attribute': 0.04; 'objects': 0.04; 'returns': 0.05; 'split':
>  0.05; ...
>
>I need to be able to extract just that one header, and while some
>applications (mail client?) may choose to give me the entire message as
>text and expect me to manually hunt for the relevant line and
>copy-and-paste it, other applications may wish to automatically extract
>the appropriate header and email it to postmaster at localhost. Or write
>it to a log file, or whatever. Whatever they do, they probably need it
>as a string (of characters or bytes), not a binary blob.

This example seems tortured and contrived.  Custom code to extract a single
header one time to send to someone?  Just hit "reply" and trim it yourself.
If you must, you can use .get_header('X-Spam-Evidence').flatten().  I doubt
that anyone would actually do that, outside of a debugging session.

Any automatic process for sending reflected spam should include more of the
message, using the relevent MIME type message/partial (or message/rfc822).


>> Some
>> header field data /is/ strings, some is lists of address pairs, and
>> so on.
>
>But "lists of address pairs" themselves are strings.

Wrong!  They are *lists* (or at least sequences) of address pairs of
friendly name, email address.  Just as bytes are not strings, and dicts are
not strings, and JPEC images, lists are not strings.  For better
understanding of what an Address is, see RFC 5322 (the current incarnation
of RFC x822), section 3.4, which describes both the best way and current or
obsolete practice.


>> If the data for a header field is not properly a string,
>
>But it always is.

No.  This is important, and you will not understand RFC x822 email until
you understand this:  email messages are not character strings.  They are
byte sequences.  This confusion pervades the email package only because in
Python before 3.x, bytes were represented as strings.


>Even badly formatted emails with corrupt headers containing binary
>characters are strings -- they're just byte (non-Unicode) strings
>containing binary characters. Your mail server might not accept it as
>part of a valid header, but it's a valid byte string.

Strings are not bytes.  Sequences of bytes are not strings.  Converting
between them demands an encoding.  Sometimes the encoding exists, sometimes
it mostly exists, and sometimes there is no such encoding, as for a JPEG
image, which is a structured byte sequence.

>> a means to get it as one is wrong.
>
>Email *is* text. It's built on top of a restricted range of ASCII bytes,
>which we can legitimately call "text" because it is a subset of Unicode
>text. Even if a particular header contains binary data, it must be
>encoded as ASCII text before it can be placed into the header.
 ...

No, email is not text.  Email message bodies and some header fields may
represent text.  An email message is a byte sequence.  One really needs to
understand this in order to work with email at a low level.  When one does
not understand, then the email package should lead the user in the right
direction.
-- 
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>

From rdmurray at bitdance.com  Thu Apr 16 21:42:07 2009
From: rdmurray at bitdance.com (R. David Murray)
Date: Thu, 16 Apr 2009 15:42:07 -0400 (EDT)
Subject: [Email-SIG] API for Header objects [was: Dropping bytes
 "support" in json]
In-Reply-To: <p04330103c60d1bbc1f00@[192.168.123.162]>
References: <loom.20090408T110540-221@post.gmane.org>
	<87k55m5mno.fsf@xemacs.org>
	<p04330102c60c294d2185@[192.168.123.162]>
	<200904162302.14641.steve@pearwood.info>
	<p04330103c60d1bbc1f00@[192.168.123.162]>
Message-ID: <Pine.LNX.4.64.0904161434290.1740@kimball.webabinitio.net>

On Thu, 16 Apr 2009 at 14:08, Tony Nelson wrote:
> At 23:02 +1000 04/16/2009, Steven D'Aprano wrote:
>> On Thu, 16 Apr 2009 10:39:52 am Tony Nelson wrote:
>>
>>> I don't want there to be any "str(msg['tag'])" or "bytes(msg['tag'])"
>>> at all, so there would be no loss of consistency.
>>
>> That's ... different.

Indeed.

>>> Messages need
>>> flattening to bytes, but there is no use for converting individual
>>> header fields into bytes or strings, outside of a message.
>>
>> Of course there is. You create each header individually, so you should
>> be able to extract each header individually. Here, for example, is a
>> use-case: I want to send postmaster a copy of the X-Spam-Evidence
>> header so she can see why a particular piece of ham got wrongly flagged
>> as spam, or visa versa:
>>
>> X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; '(which': 0.03;
>>  'attribute': 0.04; 'objects': 0.04; 'returns': 0.05; 'split':
>>  0.05; ...
>>
>> I need to be able to extract just that one header, and while some
>> applications (mail client?) may choose to give me the entire message as
>> text and expect me to manually hunt for the relevant line and
>> copy-and-paste it, other applications may wish to automatically extract
>> the appropriate header and email it to postmaster at localhost. Or write
>> it to a log file, or whatever. Whatever they do, they probably need it
>> as a string (of characters or bytes), not a binary blob.
>
> This example seems tortured and contrived.  Custom code to extract a single
> header one time to send to someone?  Just hit "reply" and trim it yourself.
> If you must, you can use .get_header('X-Spam-Evidence').flatten().  I doubt
> that anyone would actually do that, outside of a debugging session.
>
> Any automatic process for sending reflected spam should include more of the
> message, using the relevent MIME type message/partial (or message/rfc822).

Have you written a user interface using the email package?  I have.
In that user interface, I most definitely want to turn individual headers
into strings.  Specifically, this is a usenet news reader, and when
presenting messages I want to display _only_ the Date and From headers.
You will note that 'From' is an address header, and in this particular
use case I want to use "str(message['From'])", and I don't care two
hoots that the thing is properly a list of friendly-name address pairs.

That is not a contrived example, that's _production code_ that I
use every day.

Nor is the quoted example all that contrived...after reading it I was
considering if it would be useful to run a program over my incoming mail
to extract the X-Spam-Evidence headers and a couple other headers and
email them to me in a report daily.  It's not useful enough that I'll
write the code, I've too many other priorities, but it's potentially
useful enough (for tuning my spam filters) that I don't consider it a
contrived use case.  And if the spam gets worse I may just come back
to that idea.

>>> Some
>>> header field data /is/ strings, some is lists of address pairs, and
>>> so on.
>>
>> But "lists of address pairs" themselves are strings.
>
> Wrong!  They are *lists* (or at least sequences) of address pairs of
> friendly name, email address.  Just as bytes are not strings, and dicts are
> not strings, and JPEC images, lists are not strings.  For better
> understanding of what an Address is, see RFC 5322 (the current incarnation
> of RFC x822), section 3.4, which describes both the best way and current or
> obsolete practice.

I suspect that most or all of us do understand the RFC.

When Steve says 'but lists of address pairs are themselves strings' I hear
him saying that each element of the pair is a string.  I think you would
have to agree with that.  Unless you want them to remain as byte strings?
Or, as I would prefer, make them into Address objects with appropriate
methods and an appropriate str.  But even then, the friendly name and
address data elements of the Address should be unicode strings.

>>> If the data for a header field is not properly a string,
>>
>> But it always is.
>
> No.  This is important, and you will not understand RFC x822 email until
> you understand this:  email messages are not character strings.  They are
> byte sequences.  This confusion pervades the email package only because in
> Python before 3.x, bytes were represented as strings.

A header always has a string representation, though.  It's the one a
dumb-text UI would present to the user.  IMO the email package needs to
support building such UIs.  The string representation is also useful
for debugging (as is the bytes representation).  I see no reason
it should not be accessible through the normal Python 'str' method.
Why obfuscate access to it?

>> Even badly formatted emails with corrupt headers containing binary
>> characters are strings -- they're just byte (non-Unicode) strings
>> containing binary characters. Your mail server might not accept it as
>> part of a valid header, but it's a valid byte string.
>
> Strings are not bytes.  Sequences of bytes are not strings.  Converting
> between them demands an encoding.  Sometimes the encoding exists, sometimes
> it mostly exists, and sometimes there is no such encoding, as for a JPEG
> image, which is a structured byte sequence.

I agree with you that Unicode strings are not bytes, and that email is
encoded as (ASCII) bytes.

As for the JPEG, sure there's no encoding in the Unicode sense.  There
certainly is an encoding, though: JPEG wrapped up in the appropriate
mime type encoding.

>>> a means to get it as one is wrong.

IMO it is always appropriate to be able to get a header body as a string.
It may not be a meaningful format in which to _manipulate_ the header
body information (which is why I think message's __getitem__ needs
to return a Header object), but it is a legitimate representation for
user consumption.

>> Email *is* text. It's built on top of a restricted range of ASCII bytes,
>> which we can legitimately call "text" because it is a subset of Unicode
>> text. Even if a particular header contains binary data, it must be
>> encoded as ASCII text before it can be placed into the header.
> ...
>
> No, email is not text.  Email message bodies and some header fields may
> represent text.  An email message is a byte sequence.  One really needs to
> understand this in order to work with email at a low level.  When one does
> not understand, then the email package should lead the user in the right
> direction.

You and Steve are defining terms differently here, I think, but other
than that I suspect you are not that far apart on this particular point.

What I want the email package to do is make it easy to pass text in
and have the email package create the syntactically correct bytes
representation to go out on the wire.  I'm visualizing building the
'From' header, for example, something like this:

     message['From'] = AddressHeader(Address('John Smith', 'john at foo.com'))

and have it default to UTF-8 encoding....or maybe the encoding gets
specified when I say message.serialize('utf-8').  But as I said, I
haven't actually written code that builds messages yet.

Note that while I want to be able to do str(someHeader) to get a
string representation of a header body, I'm not so enamored of being
able to do

     message['From'] = 'John Smith <john at foo.com>'

and have it get turned into a Header or AddressHeader object.
Frankly, that looks too magical to me.

--David

From v+python at g.nevcal.com  Thu Apr 16 22:44:14 2009
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Thu, 16 Apr 2009 13:44:14 -0700
Subject: [Email-SIG] API for Header objects [was: Dropping
 bytes	"support" in json]
In-Reply-To: <200904162302.14641.steve@pearwood.info>
References: <loom.20090408T110540-221@post.gmane.org>	<87k55m5mno.fsf@xemacs.org>	<p04330102c60c294d2185@[192.168.123.162]>
	<200904162302.14641.steve@pearwood.info>
Message-ID: <49E7989E.60402@g.nevcal.com>

On approximately 4/16/2009 6:02 AM, came the following characters from 
the keyboard of Steven D'Aprano:
> On Thu, 16 Apr 2009 10:39:52 am Tony Nelson wrote:
>   
>> I don't want there to be any "str(msg['tag'])" or "bytes(msg['tag'])"
>> at all, so there would be no loss of consistency.
>>     
>
> That's ... different.
>   
>> If the data for a header field is not properly a string, 
>>     
> But it always is. 
>
> Even badly formatted emails with corrupt headers containing binary 
> characters are strings -- they're just byte (non-Unicode) strings 
> containing binary characters. Your mail server might not accept it as 
> part of a valid header, but it's a valid byte string.
>   

Wire format email headers are composed of a subset of ASCII text.  There 
should be a way to obtain them, either as bytes, or via the trivial str 
conversion of those bytes to Unicode.  Even corrupt headers containing 
binary characters should be obtainable that way.  There are no header 
encoding or decoding algorithms that cannot be reworked to function 
properly on either the raw_bytes or raw_str version of a header, since 
the numeric values and sequence of all binary octets would be preserved 
via both raw_bytes and raw_str.  *The key is to know what is in hand.*  
For both raw_bytes and raw_str, all characters would be in the range 0 - 
0xFF.  This is simple transliteration, not interpretation or parsing.  A 
non-corrupt header would have a smaller range, 0x20 - 0x7F.  Any header 
should be obtainable or settable in this form, using either bytes or str 
parameters/results.  Yes, it should be possible to create corrupt 
headers in this manner.  Useful mostly for testing, or for idempotency 
(which I also call GIGO).

However, obtaining headers in that way should be "hard", but only the 
sense of having to type more because it is part of a lower level 
interface, not the primary APIs... like  msg['tag'].raw_bytes or 
msg['tag'].raw_str... because it is actually the easiest way 
(implementation-wise) to obtain a copy of the data... but that copy may 
not be as useful as one might like.

str(msg['tag'])  or  msg['tag'].str   (or some such spelling[s]) should 
always produce a displayable form of the header.  If it is a known, 
standardized header that may contained data that was encoded for 
transmission, such encodings should be reversed, and Unicode characters 
outside the range of U+0020 - U+007F may be included.  Remember the goal 
here is "displayable".  So if the encoding is bad for a standard header, 
or a standard header is corrupt, or a non-standard header contains what 
is apparently binary gibberish, and non-displayable Unicode control 
characters are generated, they should be escaped as 7 ASCII characters 
representing a Unicode code point "\U+0017".  All such display strings 
must always have "\" converted to "\\" so that there is no ambiguity 
when interpreting strings that may contain text that looks like one of 
the escape strings.

Known standard headers should have additional APIs (these already exist 
for the most useful ones) to obtain the interesting subcomponents 
(encodings, names, addresses, MIME types, etc.).  These should have str 
parameters and results interfaces only, and specification of an encoding 
can be optional, defaulting to UTF-8 (or possibly defaulting to a 
Message-level encoding specification, which in turn may default to 
UTF-8), overridable in some of the APIs via optional parameters (some, 
because overloaded assignment APIs may not have room for such overrides, 
not having optional parameters).

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From stephen at xemacs.org  Fri Apr 17 12:04:39 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Fri, 17 Apr 2009 19:04:39 +0900
Subject: [Email-SIG] API for Header objects
In-Reply-To: <p04330103c60d1bbc1f00@[192.168.123.162]>
References: <loom.20090408T110540-221@post.gmane.org>
	<87k55m5mno.fsf@xemacs.org>
	<p04330102c60c294d2185@[192.168.123.162]>
	<200904162302.14641.steve@pearwood.info>
	<p04330103c60d1bbc1f00@[192.168.123.162]>
Message-ID: <87iql34mvc.fsf@xemacs.org>

Tony Nelson writes:

 > This example seems tortured and contrived.

Not at all.  I currently use grep, not the email package, but in fact
I extract several headers for use in mailing list moderation.  It's
getting to the point where my gradually accreting shell script doesn't
cut it (more because I'm recruiting additional moderators than because
I'm not happy with it), and if I'm going to do this in Python I
definitely want an obvious and elegant way to produce a displayable
string (ie, Unicode) because not all of the messages I get in Chinese
and Korean are spam.

 > Custom code to extract a single header one time to send to someone?

That is precisely why we want a simple readable short elegant API.

Like str(msg['To']).

This also suggests the sequence interface of msg['To'] should not
contain tuples of strings, but rather NameAddr objects (taken from the
RFC 5322 grammar).  Then to flatten a NameAddr, use str or bytes as
appropriate.  So to present a list of addressees in a moderation
interface, you could use

    recips = list(msg['To']) + list(msg['Cc'])

    # We have a utf-8 codec on stdout, between us and the wire.
    print("<ul>\n")
    for recip in recips:
        print("  <li>")
        print(htmlesc(str(recip)))
        print("</li>\n")
    print("</ul>\n")

Of course for wire protocol, you just use "bytes" instead of "str".
Hey! that's not bad, even if I do say so myself.

 > Just hit "reply" and trim it yourself.

That won't work, for several reasons.

 > If you must, you can use .get_header('X-Spam-Evidence').flatten().
 > I doubt that anyone would actually do that, outside of a debugging
 > session.

<sigh />  I do it.

 > No.  This is important, and you will not understand RFC x822 email
 > until you understand this: email messages are not character
 > strings.  They are byte sequences.  This confusion pervades the
 > email package only because in Python before 3.x, bytes were
 > represented as strings.

That's a bit generous and ungenerous at the same time.  The people who
worked on email were trying to come up with a reasonable interface
that on the one side treated wire format as bytes (Python 1.x, 2.x
str) and display format as text (Python 1.x str, oops, Python 2.x
unicode).  They failed, unfortunately, but not really because the
tools were unavailable.  They just treated the difficulties with
insufficient respect.  On the other hand, these difficulties are
inherent in the medium.  People (by which I mean nobody participating
in this thread) think of email as text.  MTAs think of email as octet
sequences.  Developers (especially Americans) have been sloppy about
that distinction for *five* decades, and because until 2000 at least
email was the sine qua non of networking, backward compatibility has
long demanded incorporating all those mistakes in current practice.

And now you're doing the same thing.  Email messages have at *least*
four ways of manifesting in our world that email-sig needs to worry
about: as byte sequences on the wire, as (mostly, anyway, and
certainly the headers) texts in our MUAs, as whatever-they-really-are,
and as the internal representation of the email package.  So depending
on which side of the argument you feel like taking, you insist
(inconsistently) that "an email is a byte string" or "a header is not
a string at all, it's a structured thingie".  But it's not that easy.

What we need to do is come up with an API that respects all of those
aspects *simultaneously*, and allows us to elegantly but accurately
change the perspective we use to view this "whatever-it-really-is".

 > No, email is not text.  Email message bodies and some header fields
 > may represent text.  An email message is a byte sequence.  One
 > really needs to understand this in order to work with email at a
 > low level.

Hm.  And here I was hoping that the email package would *implement*
the low level, leaving me free to think about high-level things.

 > When one does not understand, then the email package should lead
 > the user in the right direction.

No, thank you.  Python is a double-opt-in language.  We're all
consenting adults here.  Programmers who don't understand the RFCs are
likely to be surprised in many places, but they asked for it, they got
it.


From stephen at xemacs.org  Fri Apr 17 12:09:42 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Fri, 17 Apr 2009 19:09:42 +0900
Subject: [Email-SIG] API for Header objects [was: Dropping bytes
 "support" in json]
In-Reply-To: <p04330102c60d08bfa85a@[192.168.123.162]>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>
	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>
	<20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com>
	<F40AE8EC-08CC-4634-AA82-264587552F47@python.org>
	<49DF8956.5050501@g.nevcal.com>
	<7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org>
	<p04330104c6091e8919a1@[192.168.123.162]>
	<87y6u4tn4s.fsf@xemacs.org>
	<p04330109c6098b9dbdb3@[192.168.123.162]>
	<87vdp761ly.fsf@xemacs.org>
	<p04330100c60ae51beb95@[192.168.123.162]>
	<87k55m5mno.fsf@xemacs.org>
	<p04330102c60c294d2185@[192.168.123.162]>
	<8763h55d5c.fsf@xemacs.org>
	<p04330102c60d08bfa85a@[192.168.123.162]>
Message-ID: <87hc0n4mmx.fsf@xemacs.org>

Tony Nelson writes:

 > No.  The useful data for an address field is *properly* a list of
 > pairs of friendly name, address -- you should read RFC 5322 section
 > 3.4.

The fact that you think I didn't suggests there's really no point in
continuing to talk to you.  But I'll give it another try.

The issues we are dealing with at this point really have very little
to do with accurate implementation of the RFCs.  We all know that's
necessary, but ... it's a Simple Matter Of Programming.  At least,
that's why Postel, Crocker, et al put so much effort into writing the
RFCs, so it would be a SMOP.  I think they did a pretty good job.

I agree with you that we should make it relatively difficult to put
things that *don't* conform to the RFCs on the wire.  But that should
be the responsibility of the middleware that talks to the file system
and to the MTA.  I see no reason *at this stage* to burden MUA (in the
general sense) developers with all the RFC rules, and MDA/MTA writers
"should" only need to worry about it for error handling (__bytes__()
should normally do the job for them).  (For values of "should"
equivalent to "in my dreams", I do fear.)

 > This makes it very important that the easy way of doing things be
 > the correct way.  With Address fields, that way is

Nonsense.  You are ignoring the fact that *people* (ie, nobody
participating in this thread<wink>) read an address field *as text*,
and they type in addresses *as text*.  We do not extract and inject
this information as pickles of Header objects via Firewire sockets
implanted in their skulls.  There is *no /unique/ correct way* here.

 > >For example (this is a trick question), in your opinion, what
 > >should
 > >
 > >    msg['To'][0]
 > >
 > >return if the original header was
 > >
 > >To: Stephen J. Turnbull <stephen at xemacs.org>
 > >
 > >?
 > 
 > ('Stephen J. Turnbull', 'stephen at xemacs.org')
 > 
 > You must be very confused to think this is a trick question.
 > Try it with the current email package's email.utils.parseaddr().
 > Again, see RFC5322 section 3.4.

But section 3.4 is not relevant to the trickiness, and parseaddr is
not strictly conforming.  See the definitions of name-addr,
display-name, phrase, word, atom, and atext in sections 3.2.3, 3.2.5,
and 3.4 of the RFC you cite.  Also see the definition of special.
Finally, I commend to your attention the definition of obs-phrase in
section 4.1, and the *very* special nature of this particular gotcha
as described there.

The point is that by parsing that and claiming it's an RFC 5322
section 3.4 name-addr, you have invoked the rather magical Postel
Principle.  You either have to say "for my purpose I want magic in the
API" (which you previously denied), or you have to admit that this is
harder than it looks.

It is true that section 4.1 says that the obsolete ("interpreting")
syntax must be accepted *off the wire*.  So there certainly is a
justification for having a short obvious elegant spelling for "make an
address Header into a sequence".  But IMHO that spelling should be
"list(msg['To'])", not "msg['To']".

The rationale is that---assuming it can be implemented---several of us
would like to be able to spell "wire format" as "bytes(msg['To'])" and
"display format" as "str(msg['To'])".  I bet there are other uses that
would be well-served by such indirection.  And I would be disappointed
if we can't do way better than "msg.get_header('To').flatten()" to get
bytes---or should that be string?---out.

 > > > Internally, the Header whose .useful attribute is returned by
 > > > "msg['foo']" will contain parsed data, referring to parsed tokens.
 > > > Flattening those parsed tokens will produce the original data.  Not
 > > > a problem at all, simple to implement, in the most direct way.
 > >
 > >And horrid to use, if you mean that the internal representation will
 > >be a full parse tree according to the augmented BNF in RFCs 822, 2822,
 > >5322, 2045-2049, etc etc., and that the only other way to access that
 > >data is via an arbitrarily defined .useful attribute (which, BTW, is
 > >quite unpythonic if you intend for it to be available as msg['foo'] as
 > >well: TOOWTDI).
 > 
 > You put words in my mouth.

Of course I don't put words in your mouth.  The phrase "if you mean
that" clearly indicates that what follows is *my* understanding of the
implications of what you wrote.  I think that interpretation is quite
justifiable based on your insistence that the OOWTDI be your "sequence
of (address, display-name) pairs."

 > Wny assume that I am incompetent, or a fool?

I don't assume any such thing.  But I become less and less trustful of
your goodwill toward requirements other than your own.

 > Of course the internal representation would include the full parse tree.
 > Of course the external interface would provide read and write access to the
 > relevent data.

Note that I didn't say it wouldn't.  I said it *would*.  But I think
it's justified, by what you have written so far, to expect that it
would be an inconvenient interface (maybe even "horridly" so).

 > The .useful attribute (need a better name)

I like __getitem__(), __str_(), and __bytes__(), for starters.  I
think we do *need* multiple names, because different presentations are
"useful" in different contexts.

 > is the way to read the useful part of the data extracted from the
 > parse tree, whatever type of data that is, which depends on the
 > header field type, determined by its name.  Each Header subclass

Please remember that Barry says he doesn't like subclassing to deal
with issues of variation in header semantics, based on his experience
with it in past versions of the email package.  I'm not sure how he
plans to avoid it (I suspect he'll be forced to give it up because
what he comes up with will be horrid<wink>), but at this stage we
really shouldn't assume that we can freely subclass Header.

 > would have its own other attributes.  The .useful attribute guides
 > users and is used by .__getitem__() to return that data.

As I said before, I agree with RDM (not to mention pretty much
everybody but you that has posted on this topic) that there should be
one more level of indirection here.  Ie, __getitem__ should return a
Header object.


From stephen at xemacs.org  Fri Apr 17 12:20:30 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Fri, 17 Apr 2009 19:20:30 +0900
Subject: [Email-SIG] API for Header objects [was: Dropping bytes
 "support" in json]
In-Reply-To: <Pine.LNX.4.64.0904161434290.1740@kimball.webabinitio.net>
References: <loom.20090408T110540-221@post.gmane.org>
	<87k55m5mno.fsf@xemacs.org>
	<p04330102c60c294d2185@[192.168.123.162]>
	<200904162302.14641.steve@pearwood.info>
	<p04330103c60d1bbc1f00@[192.168.123.162]>
	<Pine.LNX.4.64.0904161434290.1740@kimball.webabinitio.net>
Message-ID: <87fxg74m4x.fsf@xemacs.org>

R. David Murray writes:

 > Note that while I want to be able to do str(someHeader) to get a
 > string representation of a header body, I'm not so enamored of being
 > able to do
 > 
 >      message['From'] = 'John Smith <john at foo.com>'
 > 
 > and have it get turned into a Header or AddressHeader object.
 > Frankly, that looks too magical to me.

+1

Well, that would make it easy to write scripts that parse lists of
addresses and do things with them.  Eg, a mailing list manager's "mass
subscribe" interface.  That would be nice ... but on reflection it's
clear that we would want that to be parsed *strictly*.  So it raises
exceptions, which must be caught and handled, etc etc.  In other
words, it's actually not so easy to write scripts, no matter what you
do, and you also want to be able to specify what kind of magical
fixups (the ever-popular "display-name with unquoted period"
immediately comes to mind as one example) are acceptable, and which
are not, not to mention encoding for non-ASCII text.

How about unstructured header bodies, like "Subject"?  Should we allow
it, for convenience, or not, for consistency?

How about unknown fields, eg "X-Are-We-Not-Structured-No-We-Are-Devo"?

I think, in the first draft, we should be *consistent* in both cases.


From rdmurray at bitdance.com  Fri Apr 17 13:19:31 2009
From: rdmurray at bitdance.com (R. David Murray)
Date: Fri, 17 Apr 2009 07:19:31 -0400 (EDT)
Subject: [Email-SIG] API for Header objects [was: Dropping bytes
 "support" in json]
In-Reply-To: <87fxg74m4x.fsf@xemacs.org>
References: <loom.20090408T110540-221@post.gmane.org>
	<87k55m5mno.fsf@xemacs.org>
	<p04330102c60c294d2185@[192.168.123.162]>
	<200904162302.14641.steve@pearwood.info>
	<p04330103c60d1bbc1f00@[192.168.123.162]>
	<Pine.LNX.4.64.0904161434290.1740@kimball.webabinitio.net>
	<87fxg74m4x.fsf@xemacs.org>
Message-ID: <Pine.LNX.4.64.0904170659580.1740@kimball.webabinitio.net>

On Fri, 17 Apr 2009 at 19:20, Stephen J. Turnbull wrote:
> R. David Murray writes:
>
> > Note that while I want to be able to do str(someHeader) to get a
> > string representation of a header body, I'm not so enamored of being
> > able to do
> >
> >      message['From'] = 'John Smith <john at foo.com>'
> >
> > and have it get turned into a Header or AddressHeader object.
> > Frankly, that looks too magical to me.
>
> +1
>
> Well, that would make it easy to write scripts that parse lists of
> addresses and do things with them.  Eg, a mailing list manager's "mass
> subscribe" interface.  That would be nice ... but on reflection it's
> clear that we would want that to be parsed *strictly*.  So it raises
> exceptions, which must be caught and handled, etc etc.  In other
> words, it's actually not so easy to write scripts, no matter what you
> do, and you also want to be able to specify what kind of magical
> fixups (the ever-popular "display-name with unquoted period"
> immediately comes to mind as one example) are acceptable, and which
> are not, not to mention encoding for non-ASCII text.
>
> How about unstructured header bodies, like "Subject"?  Should we allow
> it, for convenience, or not, for consistency?
>
> How about unknown fields, eg "X-Are-We-Not-Structured-No-We-Are-Devo"?
>
> I think, in the first draft, we should be *consistent* in both cases.

Yes, I think consistency is good.  Since I'm visualizing a message as
being a container for headers (as well as the body...but we aren't
talking about that right now), I would expect that I'd have to
put Header objects into it.  I don't think the "overhead" of
having to do

     message['Subject'] = Header('subject string')

is very large, and the code feels better to me that way (if I get
Headers out, I should be putting Headers in..."explicit is better
than implicit").

As for parsing a list of addresses in a manager interface...presumably
we want to provide address-parsing tools to make that job easier.
Folding address parsing in to a header setting operation would make
that scripting task _harder_.  Header creation should provide an easy
way to pass in user address input as Unicode strings, but underlying
that should be a more atomic address parsing interface.

--David

From stephen at xemacs.org  Fri Apr 17 17:13:06 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Sat, 18 Apr 2009 00:13:06 +0900
Subject: [Email-SIG] API for Header objects [was: Dropping bytes
 "support" in json]
In-Reply-To: <Pine.LNX.4.64.0904170659580.1740@kimball.webabinitio.net>
References: <loom.20090408T110540-221@post.gmane.org>
	<87k55m5mno.fsf@xemacs.org>
	<p04330102c60c294d2185@[192.168.123.162]>
	<200904162302.14641.steve@pearwood.info>
	<p04330103c60d1bbc1f00@[192.168.123.162]>
	<Pine.LNX.4.64.0904161434290.1740@kimball.webabinitio.net>
	<87fxg74m4x.fsf@xemacs.org>
	<Pine.LNX.4.64.0904170659580.1740@kimball.webabinitio.net>
Message-ID: <877i1j48l9.fsf@xemacs.org>

R. David Murray writes:

 > put Header objects into it.  I don't think the "overhead" of
 > having to do
 > 
 >      message['Subject'] = Header('subject string')

Hm.  Should a Header know which header it is?  Ie, should that be

    message['Subject'] = Header('subject', 'subject string')

?  (I assume you would be less than in love with having the assignment
magically stuffing "Subject" into the Header as it gets assigned.)

From rdmurray at bitdance.com  Fri Apr 17 17:21:34 2009
From: rdmurray at bitdance.com (R. David Murray)
Date: Fri, 17 Apr 2009 11:21:34 -0400 (EDT)
Subject: [Email-SIG] API for Header objects [was: Dropping bytes
 "support" in json]
In-Reply-To: <877i1j48l9.fsf@xemacs.org>
References: <loom.20090408T110540-221@post.gmane.org>
	<87k55m5mno.fsf@xemacs.org>
	<p04330102c60c294d2185@[192.168.123.162]>
	<200904162302.14641.steve@pearwood.info>
	<p04330103c60d1bbc1f00@[192.168.123.162]>
	<Pine.LNX.4.64.0904161434290.1740@kimball.webabinitio.net>
	<87fxg74m4x.fsf@xemacs.org>
	<Pine.LNX.4.64.0904170659580.1740@kimball.webabinitio.net>
	<877i1j48l9.fsf@xemacs.org>
Message-ID: <Pine.LNX.4.64.0904171116010.1740@kimball.webabinitio.net>

On Sat, 18 Apr 2009 at 00:13, Stephen J. Turnbull wrote:
> R. David Murray writes:
>
> > put Header objects into it.  I don't think the "overhead" of
> > having to do
> >
> >      message['Subject'] = Header('subject string')
>
> Hm.  Should a Header know which header it is?  Ie, should that be
>
>    message['Subject'] = Header('subject', 'subject string')
>
> ?  (I assume you would be less than in love with having the assignment
> magically stuffing "Subject" into the Header as it gets assigned.)

Hmm.  Probably.  But:

     message.addHeader(Header('subject', 'subject string'))

would seem sensible.  That looses the nice collections interface...but
if a Header knows its keyword then it makes sense.

However, I'm not convinced a Header should know its keyword.  After all,
the only difference between a From: header and a To: header is the
keyword, and one can easily imagine wanting to do something like:

     replymessage['to'] = frommessage['from']

--David

From barry at python.org  Fri Apr 17 18:38:20 2009
From: barry at python.org (Barry Warsaw)
Date: Fri, 17 Apr 2009 12:38:20 -0400
Subject: [Email-SIG] API for Header objects [was: Dropping bytes
	"support" in json]
In-Reply-To: <Pine.LNX.4.64.0904171116010.1740@kimball.webabinitio.net>
References: <loom.20090408T110540-221@post.gmane.org>
	<87k55m5mno.fsf@xemacs.org>
	<p04330102c60c294d2185@[192.168.123.162]>
	<200904162302.14641.steve@pearwood.info>
	<p04330103c60d1bbc1f00@[192.168.123.162]>
	<Pine.LNX.4.64.0904161434290.1740@kimball.webabinitio.net>
	<87fxg74m4x.fsf@xemacs.org>
	<Pine.LNX.4.64.0904170659580.1740@kimball.webabinitio.net>
	<877i1j48l9.fsf@xemacs.org>
	<Pine.LNX.4.64.0904171116010.1740@kimball.webabinitio.net>
Message-ID: <14BC0DB2-1F8B-4872-BD55-E9824D9D43BA@python.org>

Folks, just a quick followup explanation on why I've been radio silent  
on this sig.  Taxes, work, and personal stuff just caught up to me  
this week.  I'm hoping to find some time this weekend to review and  
respond to the blizzard of messages on this thread.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 304 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20090417/c608cb21/attachment.pgp>

From tonynelson at georgeanelson.com  Fri Apr 17 19:25:59 2009
From: tonynelson at georgeanelson.com (Tony Nelson)
Date: Fri, 17 Apr 2009 13:25:59 -0400
Subject: [Email-SIG] API for Header objects [was: Dropping bytes
 "support" in json]
In-Reply-To: <14BC0DB2-1F8B-4872-BD55-E9824D9D43BA@python.org>
References: <loom.20090408T110540-221@post.gmane.org>
	<87k55m5mno.fsf@xemacs.org>	<p04330102c60c294d2185@[192.168.123.162]>
	<200904162302.14641.steve@pearwood.info>
	<p04330103c60d1bbc1f00@[192.168.123.162]>
	<Pine.LNX.4.64.0904161434290.1740@kimball.webabinitio.net>
	<87fxg74m4x.fsf@xemacs.org>
	<Pine.LNX.4.64.0904170659580.1740@kimball.webabinitio.net>
	<877i1j48l9.fsf@xemacs.org>
	<Pine.LNX.4.64.0904171116010.1740@kimball.webabinitio.net>
	<14BC0DB2-1F8B-4872-BD55-E9824D9D43BA@python.org>
Message-ID: <p04330103c60e6afcedbb@[192.168.123.162]>

At 12:38 -0400 04/17/2009, Barry Warsaw wrote:

>Folks, just a quick followup explanation on why I've been radio silent
>on this sig.  Taxes, work, and personal stuff just caught up to me
>this week.  I'm hoping to find some time this weekend to review and
>respond to the blizzard of messages on this thread.

Nothing else?  A Python release?

Your adult supervision will be welcome.  I, for one, did not learn
everything I needed from kindergarten.

Also, we need some new threads.  We've been keeping everthing in this JSON
thread.
-- 
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>

From tonynelson at georgeanelson.com  Fri Apr 17 19:26:16 2009
From: tonynelson at georgeanelson.com (Tony Nelson)
Date: Fri, 17 Apr 2009 13:26:16 -0400
Subject: [Email-SIG] API for Header objects [was: Dropping bytes
 "support" in json]
In-Reply-To: <877i1j48l9.fsf@xemacs.org>
References: <loom.20090408T110540-221@post.gmane.org>
	<87k55m5mno.fsf@xemacs.org>	<p04330102c60c294d2185@[192.168.123.162]>
	<200904162302.14641.steve@pearwood.info>
	<p04330103c60d1bbc1f00@[192.168.123.162]>
	<Pine.LNX.4.64.0904161434290.1740@kimball.webabinitio.net>
	<87fxg74m4x.fsf@xemacs.org>
	<Pine.LNX.4.64.0904170659580.1740@kimball.webabinitio.net>
	<877i1j48l9.fsf@xemacs.org>
Message-ID: <p04330102c60e691f7dcb@[192.168.123.162]>

At 00:13 +0900 04/18/2009, Stephen J. Turnbull wrote:
>R. David Murray writes:
>
> > put Header objects into it.  I don't think the "overhead" of
> > having to do
> >
> >      message['Subject'] = Header('subject string')
>
>Hm.  Should a Header know which header it is?  Ie, should that be
>
>    message['Subject'] = Header('subject', 'subject string')
 ...

How about:

    message['Subject'] = 'subject string'
    message['To'] = ('joe', 'joe123 at foo.com')

Since the Header does indeed know what it is, a Subject: Header could
expect a string for input, and an Address header could expect an address
tuple or list of them (and cope with possibly needing to coerce the
addr-spec into bytes with the ASCII codec).

Internally, Message.__setitem__() would look up the name, making and
assigning the proper Headere subclass if missing, and pass that object the
data.  The Header subclass knows what type of data it expects and raises
(ValueError?) if it gets something inappropriate.
-- 
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>

From rdmurray at bitdance.com  Fri Apr 17 19:32:11 2009
From: rdmurray at bitdance.com (R. David Murray)
Date: Fri, 17 Apr 2009 13:32:11 -0400 (EDT)
Subject: [Email-SIG] API for Header objects [was: Dropping bytes
 "support" in json]
In-Reply-To: <p04330102c60e691f7dcb@[192.168.123.162]>
References: <loom.20090408T110540-221@post.gmane.org>
	<87k55m5mno.fsf@xemacs.org>
	<p04330102c60c294d2185@[192.168.123.162]>
	<200904162302.14641.steve@pearwood.info>
	<p04330103c60d1bbc1f00@[192.168.123.162]>
	<Pine.LNX.4.64.0904161434290.1740@kimball.webabinitio.net>
	<87fxg74m4x.fsf@xemacs.org>
	<Pine.LNX.4.64.0904170659580.1740@kimball.webabinitio.net>
	<877i1j48l9.fsf@xemacs.org> <p04330102c60e691f7dcb@[192.168.123.162]>
Message-ID: <Pine.LNX.4.64.0904171331360.1740@kimball.webabinitio.net>

On Fri, 17 Apr 2009 at 13:26, Tony Nelson wrote:
> At 00:13 +0900 04/18/2009, Stephen J. Turnbull wrote:
>> R. David Murray writes:
>>
>>> put Header objects into it.  I don't think the "overhead" of
>>> having to do
>>>
>>>      message['Subject'] = Header('subject string')
>>
>> Hm.  Should a Header know which header it is?  Ie, should that be
>>
>>    message['Subject'] = Header('subject', 'subject string')
> ...
>
> How about:
>
>    message['Subject'] = 'subject string'
>    message['To'] = ('joe', 'joe123 at foo.com')

Like I said, personally I find that too magical for my tastes.

--David

From tonynelson at georgeanelson.com  Fri Apr 17 19:37:17 2009
From: tonynelson at georgeanelson.com (Tony Nelson)
Date: Fri, 17 Apr 2009 13:37:17 -0400
Subject: [Email-SIG] API for Header objects [was: Dropping bytes
 "support" in json]
In-Reply-To: <87hc0n4mmx.fsf@xemacs.org>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>
	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>
	<20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com>
	<F40AE8EC-08CC-4634-AA82-264587552F47@python.org>
	<49DF8956.5050501@g.nevcal.com>
	<7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org>
	<p04330104c6091e8919a1@[192.168.123.162]>	<87y6u4tn4s.fsf@xemacs.org>
	<p04330109c6098b9dbdb3@[192.168.123.162]>	<87vdp761ly.fsf@xemacs.org>
	<p04330100c60ae51beb95@[192.168.123.162]>	<87k55m5mno.fsf@xemacs.org>
	<p04330102c60c294d2185@[192.168.123.162]>	<8763h55d5c.fsf@xemacs.org>
	<p04330102c60d08bfa85a@[192.168.123.162]> <87hc0n4mmx.fsf@xemacs.org>
Message-ID: <p04330101c60e68b66511@[192.168.123.162]>

At 19:09 +0900 04/17/2009, Stephen J. Turnbull wrote:
>Tony Nelson writes:
>
> > No.  The useful data for an address field is *properly* a list of
> > pairs of friendly name, address -- you should read RFC 5322 section
> > 3.4.
>
>The fact that you think I didn't suggests there's really no point in
>continuing to talk to you.  But I'll give it another try.
>
>The issues we are dealing with at this point really have very little
>to do with accurate implementation of the RFCs.  We all know that's
>necessary, but ... it's a Simple Matter Of Programming.  At least,
>that's why Postel, Crocker, et al put so much effort into writing the
>RFCs, so it would be a SMOP.  I think they did a pretty good job.
>
>I agree with you that we should make it relatively difficult to put
>things that *don't* conform to the RFCs on the wire.  But that should
>be the responsibility of the middleware that talks to the file system
>and to the MTA.  I see no reason *at this stage* to burden MUA (in the
>general sense) developers with all the RFC rules, and MDA/MTA writers
>"should" only need to worry about it for error handling (__bytes__()
>should normally do the job for them).  (For values of "should"
>equivalent to "in my dreams", I do fear.)

You are insisting on is so burdening them.  I propose lifting that burden.


> > This makes it very important that the easy way of doing things be
> > the correct way.  With Address fields, that way is
>
>Nonsense.  You are ignoring the fact that *people* (ie, nobody
>participating in this thread<wink>) read an address field *as text*,
>and they type in addresses *as text*.  We do not extract and inject
>this information as pickles of Header objects via Firewire sockets
>implanted in their skulls.  There is *no /unique/ correct way* here.

If only "People" did that in a way that survived transport.

>
> > >For example (this is a trick question), in your opinion, what
> > >should
> > >
> > >    msg['To'][0]
> > >
> > >return if the original header was
> > >
> > >To: Stephen J. Turnbull <stephen at xemacs.org>
> > >
> > >?
> >
> > ('Stephen J. Turnbull', 'stephen at xemacs.org')
> >
> > You must be very confused to think this is a trick question.
> > Try it with the current email package's email.utils.parseaddr().
> > Again, see RFC5322 section 3.4.
>
>But section 3.4 is not relevant to the trickiness, and parseaddr is
>not strictly conforming.  See the definitions of name-addr,
>display-name, phrase, word, atom, and atext in sections 3.2.3, 3.2.5,
>and 3.4 of the RFC you cite.  Also see the definition of special.
>Finally, I commend to your attention the definition of obs-phrase in
>section 4.1, and the *very* special nature of this particular gotcha
>as described there.

What parseaddr() doesn't support is groups.  I haven't seen groups used,
though.  It does support Comments when a name-addr is not present.

I still don't see any trick.  "Stephen J. Turnbull" has always been
accepted as a display-name, RFC 822 notwithstanding.  Any useful
implementation must take such things into account, even if conformance
would have required the display-name (or at least the ".") to have bee
quoted.


>The point is that by parsing that and claiming it's an RFC 5322
>section 3.4 name-addr, you have invoked the rather magical Postel
>Principle.  You either have to say "for my purpose I want magic in the
>API" (which you previously denied), or you have to admit that this is
>harder than it looks.
 ...

No.  You want to make it hard for the user of the email package.  I want to
make it easy for the user of the email package.  How hard it is for the
programmer is not an issue, but thank you for your concern.
-- 
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>

From tonynelson at georgeanelson.com  Fri Apr 17 19:37:43 2009
From: tonynelson at georgeanelson.com (Tony Nelson)
Date: Fri, 17 Apr 2009 13:37:43 -0400
Subject: [Email-SIG] API for Header objects
In-Reply-To: <87iql34mvc.fsf@xemacs.org>
References: <loom.20090408T110540-221@post.gmane.org>
	<87k55m5mno.fsf@xemacs.org>	<p04330102c60c294d2185@[192.168.123.162]>
	<200904162302.14641.steve@pearwood.info>
	<p04330103c60d1bbc1f00@[192.168.123.162]> <87iql34mvc.fsf@xemacs.org>
Message-ID: <p04330100c60e6237de14@[192.168.123.162]>

At 19:04 +0900 04/17/2009, Stephen J. Turnbull wrote:
>Tony Nelson writes:
>
> > This example seems tortured and contrived.
>
>Not at all.  I currently use grep, not the email package, but in fact
>I extract several headers for use in mailing list moderation.  It's
>getting to the point where my gradually accreting shell script doesn't
>cut it (more because I'm recruiting additional moderators than because
>I'm not happy with it), and if I'm going to do this in Python I
>definitely want an obvious and elegant way to produce a displayable
>string (ie, Unicode) because not all of the messages I get in Chinese
>and Korean are spam.

Now /that/ is a use case.  Spam headers are a poor one in any case, as
there are so many different ones.


> > Custom code to extract a single header one time to send to someone?
>
>That is precisely why we want a simple readable short elegant API.
>
>Like str(msg['To']).

Would that return the display-name (friendly name) for the listed mailboxes
in one string, presumbably separated by commas?  How would you get the
addr-specs?  How would you get both?  Use bytes() to flatten all the data,
or just the addr-specs?


>This also suggests the sequence interface of msg['To'] should not
>contain tuples of strings, but rather NameAddr objects (taken from the
>RFC 5322 grammar).  Then to flatten a NameAddr, use str or bytes as
>appropriate.  So to present a list of addressees in a moderation
>interface, you could use

I was a bit sloppy.  The tuples would be character string, byte string:  in
2.x, unicode and string; in 3.x, string and bytes.  Flattening to bytes
(2.x: string) for export would be ._flatten().

In practice, the display-names and addr-specs may have had "defects" when
parsing the message.  Addr-specs are supposed to be ASCII, but the
local-part sometimes isn't.  Display-names don't always RFC 2047 decode
properly, or may have non-ASCII characters in them.

>    recips = list(msg['To']) + list(msg['Cc'])
>
>    # We have a utf-8 codec on stdout, between us and the wire.
>    print("<ul>\n")
>    for recip in recips:
>        print("  <li>")
>        print(htmlesc(str(recip)))
>        print("</li>\n")
>    print("</ul>\n")
>
>Of course for wire protocol, you just use "bytes" instead of "str".
>Hey! that's not bad, even if I do say so myself.

You wouldn't like

    for name, addr in msg['To'] + msg['Cc'] + msg['Bcc']:

instead?  str(addr) should work (IIUC Py3K) if addr is ASCII, as it should be.


>...People (by which I mean nobody participating
>in this thread) think of email as text. ...

No, they don't.  You have to ask them the right questions.  Sure they'll
say text, but they really expect styled fancy structured colored text with
pictures and links and attached documents.  Roughly, they think of email as
web pages (archived HTML, if they knew the word).  Only the most
sophisticated or old and stubborn think of it otherwise.

>MTAs think of email as octet sequences.

Well, disagree with the MTA, and the MTA wins.  Messages on the wire /are/
bytes.

>Developers (especially Americans) have been sloppy about
>that distinction for *five* decades, and because until 2000 at least
>email was the sine qua non of networking, backward compatibility has
>long demanded incorporating all those mistakes in current practice.
>
>And now you're doing the same thing.  Email messages have at *least*
>four ways of manifesting in our world that email-sig needs to worry
>about: as byte sequences on the wire, as (mostly, anyway, and
>certainly the headers) texts in our MUAs, as whatever-they-really-are,
>and as the internal representation of the email package.  So depending
>on which side of the argument you feel like taking, you insist
>(inconsistently) that "an email is a byte string" or "a header is not
>a string at all, it's a structured thingie".  But it's not that easy.
>
>What we need to do is come up with an API that respects all of those
>aspects *simultaneously*, and allows us to elegantly but accurately
>change the perspective we use to view this "whatever-it-really-is".

That's why my proposal is so good, as it does this.


> > No, email is not text.  Email message bodies and some header fields
> > may represent text.  An email message is a byte sequence.  One
> > really needs to understand this in order to work with email at a
> > low level.
>
>Hm.  And here I was hoping that the email package would *implement*
>the low level, leaving me free to think about high-level things.

You have that now, and it is terribly hard to use.


> > When one does not understand, then the email package should lead
> > the user in the right direction.
>
>No, thank you.  Python is a double-opt-in language.  We're all
>consenting adults here.  Programmers who don't understand the RFCs are
>likely to be surprised in many places, but they asked for it, they got
>it.

Battery materials included!  Build your own batteries if you can learn how!
Some have done it in as little as two years.

There are other languages competing with Python, and users can choose to
use them instead.  Python's email package needs to stop requiring years of
study to use correctly.
-- 
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>

From rdmurray at bitdance.com  Fri Apr 17 22:25:42 2009
From: rdmurray at bitdance.com (R. David Murray)
Date: Fri, 17 Apr 2009 16:25:42 -0400 (EDT)
Subject: [Email-SIG] API for Header objects [was: Dropping bytes
 "support" in json]
In-Reply-To: <p04330101c60e68b66511@[192.168.123.162]>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>
	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>
	<20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com>
	<F40AE8EC-08CC-4634-AA82-264587552F47@python.org>
	<49DF8956.5050501@g.nevcal.com>
	<7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org>
	<p04330104c6091e8919a1@[192.168.123.162]>
	<87y6u4tn4s.fsf@xemacs.org> <p04330109c6098b9dbdb3@[192.168.123.162]>
	<87vdp761ly.fsf@xemacs.org> <p04330100c60ae51beb95@[192.168.123.162]>
	<87k55m5mno.fsf@xemacs.org> <p04330102c60c294d2185@[192.168.123.162]>
	<8763h55d5c.fsf@xemacs.org> <p04330102c60d08bfa85a@[192.168.123.162]>
	<87hc0n4mmx.fsf@xemacs.org> <p04330101c60e68b66511@[192.168.123.162]>
Message-ID: <Pine.LNX.4.64.0904171610130.1740@kimball.webabinitio.net>

On Fri, 17 Apr 2009 at 13:37, Tony Nelson wrote:
> At 19:09 +0900 04/17/2009, Stephen J. Turnbull wrote:
>> Tony Nelson writes:
>>
>> I agree with you that we should make it relatively difficult to put
>> things that *don't* conform to the RFCs on the wire.  But that should
>> be the responsibility of the middleware that talks to the file system
>> and to the MTA.  I see no reason *at this stage* to burden MUA (in the
>> general sense) developers with all the RFC rules, and MDA/MTA writers
>> "should" only need to worry about it for error handling (__bytes__()
>> should normally do the job for them).  (For values of "should"
>> equivalent to "in my dreams", I do fear.)
>
> You are insisting on is so burdening them.  I propose lifting that burden.

I don't see how Stephen and my proposals burden the developer
more than yours.  In fact, I'm pretty sure it's the opposite
way around.

>>> This makes it very important that the easy way of doing things be
>>> the correct way.  With Address fields, that way is
>>
>> Nonsense.  You are ignoring the fact that *people* (ie, nobody
>> participating in this thread<wink>) read an address field *as text*,
>> and they type in addresses *as text*.  We do not extract and inject
>> this information as pickles of Header objects via Firewire sockets
>> implanted in their skulls.  There is *no /unique/ correct way* here.
>
> If only "People" did that in a way that survived transport.

I don't understand that comment.  It's the email package's job to provide
a way for the programmer (the user of the email package's API) to allow
the text entered by the user (the person actually sending and receiving
messages) to survive transport.

--David

From stephen at xemacs.org  Sat Apr 18 09:02:28 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Sat, 18 Apr 2009 16:02:28 +0900
Subject: [Email-SIG] API for Header objects
In-Reply-To: <p04330100c60e6237de14@[192.168.123.162]>
References: <loom.20090408T110540-221@post.gmane.org>
	<87k55m5mno.fsf@xemacs.org>
	<p04330102c60c294d2185@[192.168.123.162]>
	<200904162302.14641.steve@pearwood.info>
	<p04330103c60d1bbc1f00@[192.168.123.162]>
	<87iql34mvc.fsf@xemacs.org>
	<p04330100c60e6237de14@[192.168.123.162]>
Message-ID: <87y6ty30mz.fsf@xemacs.org>

Tony Nelson writes:

 > > > Custom code to extract a single header one time to send to someone?
 > >
 > >That is precisely why we want a simple readable short elegant API.
 > >
 > >Like str(msg['To']).
 > 
 > Would that return the display-name (friendly name) for the listed mailboxes
 > in one string, presumbably separated by commas?  How would you get the
 > addr-specs?  How would you get both?  Use bytes() to flatten all the data,
 > or just the addr-specs?

Who knows?  Who cares?  AFAICS it's a SMOP.  Are we not hackers?
We'll design the internal representation to be lossless, then flatten
it out in a simple, straightforward way, as a newline-less,
comma-separated string.  The question at issue here is "how does an
email client request flattening to bytes? to str?"  Not "how does
email do those things?"

 > I was a bit sloppy.  The tuples would be character string, byte string:  in
 > 2.x, unicode and string; in 3.x, string and bytes.  Flattening to bytes
 > (2.x: string) for export would be ._flatten().

You're still being sloppy.  In this part of the thread we were talking
about a *text* representation ("extract a single header ... to send to
someone", where the header in question is intended to be
human-readable).  Why are we suddenly using bytes?

 > In practice, the display-names and addr-specs may have had "defects" when
 > parsing the message.

So what?  That is, yes, we all already know that, and you don't need
to repeat it unless you're going to tell us something new.  Like, how
are we going to represent the choice of how to deal with them in the
API?  What choices are we going to offer?

 > You wouldn't like
 > 
 >     for name, addr in msg['To'] + msg['Cc'] + msg['Bcc']:
 > 
 > instead? 

Not necessarily, because that requires me to special-case situations
where name is None, at least, and maybe cases where addr is None (that
depends on what the parser does with a header like

    To: me at home.com, , you at earth.li

of course.)

 > >...People (by which I mean nobody participating
 > >in this thread) think of email as text. ...
 > 
 > No, they don't.

Will you please cut this out?  Everything you say is true.  The
problem is that you are inconsistently choosing half-truths for the
purpose of winning a debate, rather than trying to design a coherent
API.  The latter is the purpose of this SIG, not the former.

I do not have a lot of confidence that I'm *right*.  However, the idea
that bytes(object) gives wire format, str(object) gives a simple text
presentation, and more complex presentation requires either massaging
str(object) or direct access to the internal representation of object
is a unifying theme in my (so far partial) proposal.

You have no unity, just confidence that your API for Headers that are
structured as address lists is "right".  If you continue with that
approach on a Header type by Header type basis, experience suggests
that you *will* end up with a horrid API.

 > >What we need to do is come up with an API that respects all of those
 > >aspects *simultaneously*, and allows us to elegantly but accurately
 > >change the perspective we use to view this "whatever-it-really-is".
 > 
 > That's why my proposal is so good, as it does this.

But only for the To: header.  There's no generality to it, you will
propose a different representation for the "useful data" of other
headers, and you still don't deal with the fact that what's useful to
you may not serve the needs of others.

 > >Hm.  And here I was hoping that the email package would *implement*
 > >the low level, leaving me free to think about high-level things.
 > 
 > You have that now, and it is terribly hard to use.

So let's fix the implementation to be easy to use.  I see no proof whatsoever
 > 
 > 
 > > > When one does not understand, then the email package should lead
 > > > the user in the right direction.
 > >
 > >No, thank you.  Python is a double-opt-in language.  We're all
 > >consenting adults here.  Programmers who don't understand the RFCs are
 > >likely to be surprised in many places, but they asked for it, they got
 > >it.
 > 
 > Battery materials included!  Build your own batteries if you can learn how!
 > Some have done it in as little as two years.
 > 
 > There are other languages competing with Python, and users can choose to
 > use them instead.  Python's email package needs to stop requiring years of
 > study to use correctly.

The API you propose will require such study, I'm pretty sure.  If you
want to convince me otherwise (and I believe I'm representative in
this), you need to show how your approach will lead to regularity and
coherence in the API for *all* headers, not just the example-du-jour.


From stephen at xemacs.org  Sat Apr 18 11:14:29 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Sat, 18 Apr 2009 18:14:29 +0900
Subject: [Email-SIG] API for Header objects
In-Reply-To: <p04330101c60e68b66511@[192.168.123.162]>
References: <loom.20090408T110540-221@post.gmane.org>
	<ca471dc20904081736j2d80d924p6b30bab66666625f@mail.gmail.com>
	<loom.20090409T043042-835@post.gmane.org>
	<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org>
	<eae285400904090855n539cf97cx29dd25dbd1898470@mail.gmail.com>
	<07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org>
	<20090410051902.12555.1059181741.divmod.xquotient.7720@weber.divmod.com>
	<F40AE8EC-08CC-4634-AA82-264587552F47@python.org>
	<49DF8956.5050501@g.nevcal.com>
	<7DF370A6-88E4-4710-9CF8-B0B3D7249383@python.org>
	<p04330104c6091e8919a1@[192.168.123.162]>
	<87y6u4tn4s.fsf@xemacs.org>
	<p04330109c6098b9dbdb3@[192.168.123.162]>
	<87vdp761ly.fsf@xemacs.org>
	<p04330100c60ae51beb95@[192.168.123.162]>
	<87k55m5mno.fsf@xemacs.org>
	<p04330102c60c294d2185@[192.168.123.162]>
	<8763h55d5c.fsf@xemacs.org>
	<p04330102c60d08bfa85a@[192.168.123.162]>
	<87hc0n4mmx.fsf@xemacs.org>
	<p04330101c60e68b66511@[192.168.123.162]>
Message-ID: <87ws9i2uiy.fsf@xemacs.org>

Tony Nelson writes:

 > You are insisting on is so burdening them.  I propose lifting that burden.

I disagree.  You have made no concrete arguments that what I propose
is a burden, except in the long-since discredited sense of a few extra
keystrokes for a single use-case.  However, you *could* in principle
do so, because I've proposed (the principles for) a fairly generic API.

Specifically, to assess the burden of understanding, so far I've
implicitly accepted the existing API for Messages:

to get one instance of a header:            msg[tag]
to get the payload:                         msg.payload

which is clearly flawed (see footnote [1] and the random thoughts,
below) but not fatally so (and it's not obvious what would be better),
and proposed a *generic* API for all types in email:

to get the wire format (validated) of obj:  bytes(obj)
to get a text display (unformatted) of obj: str(obj)
to get access to all attributes of obj:     obj

Guess what?  We already have an API nearly sufficient[1] for reading
and generating Unicode text/plain messages!  It requires four (count
them, *four*) identifiers not in Python itself: the classes Message,
Header, and Payload, and the Message attribute .payload.  Is that
burdensome?  Very well then, it is burdensome.  I *joyfully* impose
that burden on you, and of course accept it myself.

(I'm cheating a little bit, because I've ignored the issue of how to
get valid data into structured Headers when generating a new message.
But you haven't addressed that issue for the case of "msg['To'] shall
return a list of (display-name, mailbox) tuples" yet, either, and I
can use whatever method you define so there's no additional burden.)

I've also suggested for many object types where structuring as a
sequence makes sense:

to get a sequence of subobjects of obj:     list(obj)

With that addition, I think we're almost ready to write a mailing list
manager.<0.5 wink>

[[ Some random thoughts apropos this outline ]]

It may make sense to apply the list API to msg['Received'], returning
the list of values of 'Received' headers.  I think it does *not* make
sense to apply it to msg['Resent-To'], as resent headers generally
come in blocks, and the API should reflect that, I think.  That being
so, I wonder if it *really* makes sense for msg[tag] to return the
list of all instances of the tag field instead of a more or less
arbitrary individual, even in the case of a header defined to be
unique by the RFCs.

Then we'd need an API for accessing blocks (maybe for parsed incoming
only, rather than something mutable for setting on outgoing
messages).  Something like

to get the list of blocks of resent headers: msg.blocks['Resent']

where you'd have to define each type of block to the parser.  Each
block would be a dictionary of the related headers, so you could get
the most recent Resent-To field with msg.blocks['Resent'][0]['To'].

[[ end random thoughts ]]

 > What parseaddr() doesn't support is groups.  I haven't seen groups used,
 > though.  It does support Comments when a name-addr is not present.
 > 
 > I still don't see any trick.  "Stephen J. Turnbull" has always been
 > accepted as a display-name, RFC 822 notwithstanding.

Not when *validating* a header generator or user input!

Note that what you're implying is that your standard of correct is not
the RFCs, it's "what has always been done."  That's problematic.  In
fact, the RFC process is carefully designed to account for "what has
always been done."  And in this case, the RFC authors have had *four*
chances to accept 'Stephen J. Turnbull' as valid syntax, and they have
refused every single time.

 > You want to make it hard for the user of the email package.

The first time I let it slide.  Now that you've repeated it, I think
you owe me an apology.


Footnotes: 
[1]  What's missing is a way to handle multiple instances of a given
field.  This is a defect in the Message class, not the Header class,
and we haven't really discussed Message at all, so I beg the reader's
indulgence.


From stephen at xemacs.org  Sat Apr 18 11:45:43 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Sat, 18 Apr 2009 18:45:43 +0900
Subject: [Email-SIG] API for Header objects [was: Dropping bytes
 "support" in json]
In-Reply-To: <p04330102c60e691f7dcb@[192.168.123.162]>
References: <loom.20090408T110540-221@post.gmane.org>
	<87k55m5mno.fsf@xemacs.org>
	<p04330102c60c294d2185@[192.168.123.162]>
	<200904162302.14641.steve@pearwood.info>
	<p04330103c60d1bbc1f00@[192.168.123.162]>
	<Pine.LNX.4.64.0904161434290.1740@kimball.webabinitio.net>
	<87fxg74m4x.fsf@xemacs.org>
	<Pine.LNX.4.64.0904170659580.1740@kimball.webabinitio.net>
	<877i1j48l9.fsf@xemacs.org>
	<p04330102c60e691f7dcb@[192.168.123.162]>
Message-ID: <87vdp22t2w.fsf@xemacs.org>

Tony Nelson writes:

 > How about:
 > 
 >     message['Subject'] = 'subject string'
 >     message['To'] = ('joe', 'joe123 at foo.com')
 > 
 > Since the Header does indeed know what it is

How?  Since there's no explicit constructor, in fact the Header
doesn't know what it is until the Message tells it what it is.  That
means that the registry of Header types must be known to the Message
class.  That may not be a burden on the clients of email, but I can't
see it as a warm fuzzy for the maintainers of email.

 > Internally, Message.__setitem__() would look up the name, making
 > and assigning the proper Headere subclass if missing, and pass that
 > object the data.  The Header subclass knows what type of data it
 > expects and raises (ValueError?) if it gets something
 > inappropriate.

I don't think the FLUFL will accept that.  First of all, "Mama don'
'low no raisin's round heya."  Second, the duck-typing on the 'To'
example is a little hairy.  Since in your model message['To']
*produces* a sequence of pairs when evaluated, I would expect it to
require a sequence of pairs on input.  But you seem to be suggesting
that if it's a sequence but not a sequence of sequences, it should
handle that by assuming it's a (name,addr) pair.  Heck, maybe we
should do something reasonable if it's not a sequence.

I think this is *way* too magical for the basic API of email.

I'm also bothered by the complexity of having Message accept
responsibility for the validity of data to be input to Header, then
delegate that responsiblity back to Header.  Users don't care, I
suspect, but maintainers will.  There will be an extra frame for
Message in any trace causing by Header raising ValueError (or
whatever), which is annoying since Message should be just passing the
argument on ... but it needs to be remembered or looked up, and there
will always be the temptation for developers to add a little smarts to
Message's handling of Header's arguments (especially in derived
classes, eg, a ListPost class that optionally prepends "[listname] "
to the front of a post).

Finally, I know that if I'm away from the email package for more than
24 hours, I'll forget which order the display name and address come
in.


From barry at python.org  Wed Apr 22 19:09:19 2009
From: barry at python.org (Barry Warsaw)
Date: Wed, 22 Apr 2009 13:09:19 -0400
Subject: [Email-SIG] [issue1078919] Email.Header encodes non-ASCII
	content incorrectly
In-Reply-To: <1240416199.6.0.32712734439.issue1078919@psf.upfronthosting.co.za>
References: <1240416199.6.0.32712734439.issue1078919@psf.upfronthosting.co.za>
Message-ID: <1DFFDE3A-29F5-45D3-A228-9B779AE076C1@python.org>

On Apr 22, 2009, at 12:03 PM, Daniel Diniz wrote:

>
> Changes by Daniel Diniz <ajaksu at gmail.com>:
>
>
> ----------
> keywords: +easy

I say nothing in email is easy.  :-O

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 304 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20090422/584a9b07/attachment.pgp>

From janssen at parc.com  Fri Apr 24 22:31:09 2009
From: janssen at parc.com (Bill Janssen)
Date: Fri, 24 Apr 2009 13:31:09 PDT
Subject: [Email-SIG] Message instances compare as False
Message-ID: <79012.1240605069@parc.com>

I spent the morning finding and fixing a problem in my IMAP server,
which was caused by the fact that certain message instances evaluate as
"False", if they have no headers.  But that's a valid thing to have as
part of a multipart body, I believe.  So I had some code that looked
like this:

   foo = msg1 or msg2

and I was getting msg2 even though I thought I should be getting msg1.

Might make sense to explicitly bind __nonzero__ in this class.

Bill