From v+python at g.nevcal.com  Mon Feb  1 20:06:34 2010
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Mon, 01 Feb 2010 11:06:34 -0800
Subject: [Email-SIG] Thoughts on the general API, and the Header API.
In-Reply-To: <4B6245E8.3060402@g.nevcal.com>
References: <20100125201034.190CC1BC4B4@kimball.webabinitio.net>
	<4B5E3D73.1070900@g.nevcal.com>
	<20100126025146.CF0AC1BC2FF@kimball.webabinitio.net>
	<4B5E6B47.9090307@g.nevcal.com> <4B6245E8.3060402@g.nevcal.com>
Message-ID: <4B67263A.9090705@g.nevcal.com>

Another thought occurred to me regarding this "Access API"... an IMAP 
implementation could defer obtaining data parts from the server until 
requested, under the covers of this same API.  Of course, for devices 
with limited resources, that would probably be the optimal approach, but 
for devices with lots of resources, an IMAP implementation might also 
want to offer other options.


On approximately 1/28/2010 6:20 PM, came the following characters from 
the keyboard of Glenn Linderman:
> On approximately 1/25/2010 8:10 PM, came the following characters from 
> the keyboard of Glenn Linderman:
>>> That's true.  The Bytes and String versions of binary MIME parts,
>>> which are likely to be the large ones, will probably have a common
>>> representation for the payload, and could potentially point to the same
>>> object.  That breaking of of the expectation that 'encode' and 'decode'
>>> return new objects (in analogy to how encode and decode of 
>>> strings/bytes
>>> works) might not be a good thing, though.
>>
>> Well, one generator could provide the expectation that everything is 
>> new; another could provide different expectations.  The differences 
>> between them, and the tradeoffs would be documented, of course, were 
>> both provided.  I'm not convinced that treating headers and data 
>> exactly the same at all times is a good thing... a convenient option 
>> at times, perhaps, but I can see it as a serious inefficiency in many 
>> use cases involving large data.
>>
>> This deserves a bit more thought/analysis/discussion, perhaps.  More 
>> than I have time for tonight, but I may reply again, perhaps after 
>> others have responded, if they do. 
>
> I guess no one else is responding here at the moment.  Read the ideas 
> below, and then afterward, consider building the APIs you've suggested 
> on top of them.  And then, with the full knowledge that the messages 
> may be either in fast or slow storage, I think that you'll agree that 
> converting the whole tree in one swoop isn't always appropriate... all 
> headers, probably could be.  Data, because of its size, should 
> probably be done on demand.
>
>
> In earlier discussions about the registry, there was the idea of 
> having a registry for transport encoding handling, and a registry for 
> MIME encoding handling.  There were also vague comments about doing an 
> external storage protocol "somehow", but it was a vague concept to be 
> defined later, or at least I don't recall any definitions.
>
> Given a raw bytes representation of an incoming email, mail servers 
> need to choose how to handle it... this may need to be a dynamic 
> choice based on current server load, as well as the obvious static 
> server resources, as well as configured limits.
>
> Unfortunately, the SMTP protocol does not require predeclaration of 
> the size of the incoming DATA part, so servers cannot enforce size 
> limits until they are exceeded.  So as the data streams in, a dynamic 
> adjustment to the handling strategy might be appropriate.  Gateways 
> may choose to route messages, and stall the input until the output 
> channel is ready to receive it, and basically "pass through" the data, 
> with limited need to buffer messages on disk... unless the output 
> channel doesn't respond... then they might reject the message.  An 
> SMTP server should be willing to act as a store-and-forward server, 
> and also must do individual delivery of messages to each RCPT (or at 
> least one per destination domain), so must have a way of dealing with 
> large messages, probably via disk buffering.  The case of disk 
> buffering and retrying generally means that the whole message, not 
> just the large data parts, must be stored on disk, so the external 
> storage protocol should be able to deal with that case.
>
> The minimal external storage format capability is to store the 
> received bytestream to disk, associate it with the envelope 
> information, and be able to retrieve it in whole later.  This would 
> require having the whole thing in RAM at those two points in time, 
> however, and doesn't solve the real problem.  Incremental writing and 
> reading to the external storage would be much more useful.  Even more 
> useful, would be "partially parsed" seek points.
>
> An external storage system that provides "partially parsed" 
> information could include:
>
> 1) envelope information.  This section is useful to SMTP servers, but 
> not other email tools, so should be optional.  This could be a copy of 
> the received RCPT command texts, complete with CRLF endings.
>
> 2) header information.  This would be everything between DATA and the 
> first CRLF CRLF sequence.
>
> 3) data.  Pre-MIME this would simply be the rest of the message, but 
> post-MIME it would be usefully more complex.  If MIME headers can be 
> observed and parsed as the data passes through, then additional 
> metadata could be saved that could enhance performance of the later 
> processing steps.  Such additional metadata could include the 
> beginning of each MIME part, the end of the headers for that part, and 
> the end of the data for that part.
>
> The result of saving that information would mean that minimal data 
> (just headers) would need to be read in create a tree representing the 
> email, the rest could be left in external storage until it is 
> accessed... and then obtained directly from there when needed, and 
> converted to the form required by the request... either the whole 
> part, or some piece in a buffer.
>
> So there could be a variety of external storage systems... one that 
> stores in memory, one that stores on disk per the ideas above, and a 
> variety that retain some amount of cached information about the email, 
> even though they store it all on disk.  Sounds like this could be a 
> plug-in, or an attribute of a message object creation.
>
> But to me, it sounds like the foundation upon which the whole email 
> lib should be built, not something that is shoveled in later.
>
> A further note about access to data parts... clearly "data for the 
> whole MIME part" could be provided, but even for a single part that 
> could be large.  So access to smaller chunks might be desired.
>
> The data access/conversion functions, therefore, should support a 
> buffer-at-a-time access interface.  Base64 supports random access 
> easily, unless it contains characters not in the 64, that are to be 
> ignored, that could throw off the size calculations.  So maybe 
> providing sequential buffer-at-a-time access with rewind is the best 
> that can be done -- quoted-printable doesn't support random access 
> very well, and neither would some sort of compression or encryption 
> technique -- they usually like to start from the beginning -- and 
> those are the sorts of things that I would consider likely to be 
> standardized in the future, to reduce the size of the payload, and to 
> increase the security of the payload.
>

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From rdmurray at bitdance.com  Mon Feb  1 23:05:33 2010
From: rdmurray at bitdance.com (R. David Murray)
Date: Mon, 01 Feb 2010 17:05:33 -0500
Subject: [Email-SIG] Thoughts on the general API, and the Header API.
In-Reply-To: <4B67263A.9090705@g.nevcal.com>
References: <20100125201034.190CC1BC4B4@kimball.webabinitio.net>
	<4B5E3D73.1070900@g.nevcal.com>
	<20100126025146.CF0AC1BC2FF@kimball.webabinitio.net>
	<4B5E6B47.9090307@g.nevcal.com> <4B6245E8.3060402@g.nevcal.com>
	<4B67263A.9090705@g.nevcal.com>
Message-ID: <20100201220533.AC7101FC475@kimball.webabinitio.net>

On Mon, 01 Feb 2010 11:06:34 -0800, Glenn Linderman <v+python at g.nevcal.com> wrote:
> Another thought occurred to me regarding this "Access API"... an IMAP 
> implementation could defer obtaining data parts from the server until 
> requested, under the covers of this same API.  Of course, for devices 
> with limited resources, that would probably be the optimal approach, but 
> for devices with lots of resources, an IMAP implementation might also 
> want to offer other options.

I like your thought about treating memory as just another backing
store and designing the API accordingly.  I will keep it in mind
as I go along.

> On approximately 1/28/2010 6:20 PM, came the following characters from 
> the keyboard of Glenn Linderman:
> > I guess no one else is responding here at the moment.  Read the ideas 
> > below, and then afterward, consider building the APIs you've suggested 
> > on top of them.  And then, with the full knowledge that the messages 
> > may be either in fast or slow storage, I think that you'll agree that 
> > converting the whole tree in one swoop isn't always appropriate... all 
> > headers, probably could be.  Data, because of its size, should 
> > probably be done on demand.

I hope the fact that no one is responding means that they think I'm at
least on the right track :)

I've committed a skeleton of the new Header classes to the
lp:python-email6 repository, along with my testing framework.  More test
cases to come.

--David

From chr-gaillard at orange.fr  Sat Feb  6 17:37:14 2010
From: chr-gaillard at orange.fr (Chr GAILLARD)
Date: Sat,  6 Feb 2010 17:37:14 +0100 (CET)
Subject: [Email-SIG] DJANGO 111  manage.py syncdb  (no module named mime)
Message-ID: <4022707.49492.1265474234268.JavaMail.www@wwinf1g17>

Hello

Sorry I don't speak english.

I use Django first time.

When I do "manage.py syncdb"

....
import email.mime
ImportError: No module named mime
Python26\Lib\email\__init__.py (line 118)

REMARK: mime is an dossier not a python file.

others imports are before and OK?
I don't think

Great Thanks for your attention and help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/email-sig/attachments/20100206/529766c0/attachment.htm>

From chr-gaillard at orange.fr  Mon Feb  8 16:14:47 2010
From: chr-gaillard at orange.fr (Chr GAILLARD)
Date: Mon,  8 Feb 2010 16:14:47 +0100 (CET)
Subject: [Email-SIG] =?utf-8?q?DJANGO_111_=C2=A0manage=2Epy_syncdb_=C2=A0?=
	=?utf-8?q?=28no_module_named_mime=29?=
Message-ID: <7187151.18105.1265642087460.JavaMail.www@wwinf1g02>


All is now OK.
Sorry for previous email.
Thanks


Hello

Sorry I don't speak english.

I use Django first time.

When I do "manage.py syncdb"

....
import email.mime
ImportError: No module named mime
Python26\Lib\email\__init__.py (line 118)

REMARK: mime is an dossier not a python file.

others imports are before and OK?
I don't think

Great Thanks for your attention and help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/email-sig/attachments/20100208/5204baec/attachment.htm>

From imgrey at gmail.com  Thu Feb 18 17:12:52 2010
From: imgrey at gmail.com (Vitaliyi)
Date: Thu, 18 Feb 2010 18:12:52 +0200
Subject: [Email-SIG] memory counsumption
Message-ID: <3aac341002180812v75fcbca5r63d85e7562f768f2@mail.gmail.com>

Good Day

I tried to feed 1.5Mb email message to Header(), it consumed about !Gb
of memory and them was killed:

string = read('email_message').decode('utf-8')
decode_header(Header(string))

strace -f -c -p showed:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 97.98    0.001504           9       173           munmap
  1.04    0.000016           0       124           mremap
  0.98    0.000015           0       248           mmap2
  0.00    0.000000           0        75           brk
------ ----------- ----------- --------- --------- ----------------
100.00    0.001535                   620           total


Could you please tell where to look to find a solution for this issue ?

From mark at msapiro.net  Thu Feb 18 18:26:08 2010
From: mark at msapiro.net (Mark Sapiro)
Date: Thu, 18 Feb 2010 09:26:08 -0800
Subject: [Email-SIG] memory counsumption
In-Reply-To: <3aac341002180812v75fcbca5r63d85e7562f768f2@mail.gmail.com>
Message-ID: <PC1952010021809260809376544e971@msapiro>

Vitaliyi wrote:
>
>I tried to feed 1.5Mb email message to Header(), it consumed about !Gb
>of memory and them was killed:
>
>string = read('email_message').decode('utf-8')
>decode_header(Header(string))
>
[...]
>
>Could you please tell where to look to find a solution for this issue ?


Start with the documentation. The Header() constructor accepts a single
header value, not an entire email message. the decode_header function
also accepts a single header value, not a Header instance.

You maybe want something like

msg = email.message_from_file(open('email_message'))
subj = msg['subject']

Then you could do things like email.header.Header(subj) to create a
Header instance or decode_header(subj) to decode an RFC 2447 encoded
subject.

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan


From barry at python.org  Sat Feb 20 03:23:52 2010
From: barry at python.org (Barry Warsaw)
Date: Fri, 19 Feb 2010 21:23:52 -0500
Subject: [Email-SIG] Thoughts on the general API, and the Header API.
In-Reply-To: <20100125201034.190CC1BC4B4@kimball.webabinitio.net>
References: <20100125201034.190CC1BC4B4@kimball.webabinitio.net>
Message-ID: <20100219212352.536c82fe@freewill.wooz.org>

On Jan 25, 2010, at 03:10 PM, R. David Murray wrote:

>After setting it aside for a bit, I had what I think is a little epiphany:
>our need is to deal with messages (and parts of messages) that could be
>in either bytes form or text form.  The things we need to do with them
>are similar regardless of their form, and so we have been talking about a
>"dual API": one method for bytes and a parallel method for text.
>
>What if we recognize that we have two different data types, bytes messages
>and text messages?  Then the "dual API" becomes a more uniform, almost
>single, API, but with two possible underlying data types.

I really like this, especially because it kind of mirrors the transformations
between bytes and strings.  I have one suggestion that might clean up the API
and make some other things possible or easier.

>In the context specifically of the proposed new Header object, I propose
>that we have a StringHeader and a BytesHeader, and an API that looks
>something like this:
>
>StringHeader
>
>    properties:
>        raw_header (None unless from_full_header was used)
>        raw_name
>        raw_value
>        name
>        value
>
>    __init__(name, value)
>    from_full_header(header)
>    serialize(max_line_len=78,
>              newline='\n',
>              use_raw_data_if_possible=False)
>    encode(charset='utf-8')
>
>BytesHeader would be exactly the same, with the exception of the signature
>for serialize and the fact that it has a 'decode' method rather than an
>'encode' method.  Serialize would be different only in the fact that
>it would have an additional keyword parameter, must_be_7bit=True.

The one thing that I think is unwieldy is the signature of the serialize() and
deserialize() methods.  I've been thinking about "policy" objects that can be
used to control formatting and I think that perhaps substituting an API like
this might work:

serialize(policy=None)
deserialize(policy=None)

The idea is that the policy object would describe how and when to fold header
lines, what EOL characters to use, but also such choices such as whether to
use raw data if possible, and must_be_7bit.  A first order improvement is that
it would be much easier to pass the policy object up and down the call stack
than a slew of independent parameters.

Further, it might be interesting to allow policy objects in the generator,
which would control default formatting options, and on Message objects in the
hierarchy which would control formatting for that Message and all the ones
below it in the tree (unless overridden by a policy object on a sub-message).
Maybe headers themselves also support policy objects.

I think this could be interesting for supporting output of the same message
tree to different destinations.  E.g. if the message is being output directly
to an SMTP server, you'd stick a policy object on there that had the RFC 5321
required EOL, but you'd have a different policy object for output to a web
server.

>(Encoding or decoding a Message would cause the Message to recursively
>encode or decode its subparts.  This means you are making a complete
>new copy of the Message in memory.  If you don't want to do that you
>can walk the Message and convert it piece by piece (we could provide a
>generator that does this).)

It sounds like there's overlap between the encoding/decoding API and the
serialize/deserialize API.  Are you thinking along those lines?  Differences
in signature could be papered over with the policy objects.

>Subclasses of these classes for structured headers would have additional
>methods that would return either specialized object types (datetimes,
>address objects) or bytes/strings, and these may or may not exist in
>both Bytes and String forms (that depends on the use cases, I think).

Is it crackful to think about the policy object also containing a MIME type
registry for conversion to the specialized object types?

>So, those are my thoughts, and I'm sure I haven't thought of all the
>corner cases.  The biggest question is, does it seem like this general
>scheme is worth pursuing? 

Definitely!  I think it's a great idea.

-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 835 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/email-sig/attachments/20100219/5dca7193/attachment.pgp>

From v+python at g.nevcal.com  Sat Feb 20 04:34:26 2010
From: v+python at g.nevcal.com (Glenn Linderman)
Date: Fri, 19 Feb 2010 19:34:26 -0800
Subject: [Email-SIG] Thoughts on the general API, and the Header API.
In-Reply-To: <20100219212352.536c82fe@freewill.wooz.org>
References: <20100125201034.190CC1BC4B4@kimball.webabinitio.net>
	<20100219212352.536c82fe@freewill.wooz.org>
Message-ID: <4B7F5842.9060901@g.nevcal.com>

On approximately 2/19/2010 6:23 PM, came the following characters from 
the keyboard of Barry Warsaw:
> Is it crackful to think about the policy object also containing a MIME 
> type
> registry for conversion to the specialized object types?
>    

While the MIME type registry (and other registries) were (I think) 
conceptualized as global objects, having them be "just objects" means 
you could have as many as you want, for different purposes, and means 
that you could pass them in to the encoding and decoding methods, and 
might even solve issues with different threads wanting different 
registries concurrently... they could have them.

I like the idea, although clearly it needs to be fleshed out a bit.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


From rdmurray at bitdance.com  Sat Feb 20 06:50:38 2010
From: rdmurray at bitdance.com (R. David Murray)
Date: Sat, 20 Feb 2010 00:50:38 -0500
Subject: [Email-SIG] Thoughts on the general API, and the Header API.
In-Reply-To: <20100219212352.536c82fe@freewill.wooz.org>
References: <20100125201034.190CC1BC4B4@kimball.webabinitio.net>
	<20100219212352.536c82fe@freewill.wooz.org>
Message-ID: <20100220055038.D70461FD23B@kimball.webabinitio.net>

On Fri, 19 Feb 2010 21:23:52 -0500, Barry Warsaw <barry at python.org> wrote:
> On Jan 25, 2010, at 03:10 PM, R. David Murray wrote:
> The one thing that I think is unwieldy is the signature of the serialize() and
> deserialize() methods.  I've been thinking about "policy" objects that can be
> used to control formatting and I think that perhaps substituting an API like
> this might work:
> 
> serialize(policy=None)
> deserialize(policy=None)

I love the idea of policy objects.  I'm clear on what they do for
serialization.  What do you visualize them doing for deserialization
(parsing)?

> I think this could be interesting for supporting output of the same message
> tree to different destinations.  E.g. if the message is being output directly
> to an SMTP server, you'd stick a policy object on there that had the RFC 5321
> required EOL, but you'd have a different policy object for output to a web
> server.

Yes, this was my intent in providing the newline and max_line_length
parameters, but a policy object is a much cleaner way to do that.
Especially since we can then provide premade policy objects to support
common output scenarios such as SMTP and HTTP.

> >(Encoding or decoding a Message would cause the Message to recursively
> >encode or decode its subparts.  This means you are making a complete
> >new copy of the Message in memory.  If you don't want to do that you
> >can walk the Message and convert it piece by piece (we could provide a
> >generator that does this).)
> 
> It sounds like there's overlap between the encoding/decoding API and the
> serialize/deserialize API.  Are you thinking along those lines?  Differences
> in signature could be papered over with the policy objects.

No, I'm thinking of encode/decode as exactly parallel to encode/decode
on string/bytes.  In my prototype API, for example,  StringHeader
values are unicode, and do *not* contain any rfc2047 encoded words.
decoding a BytesHeader decodes the RFC2047 stuff.  Contrawise, encoding
a StringHeader does the RFC2047 encoding (using whatever charset you
specify or utf-8 by default).  (This means you lose the ability to piece
together headers from bits in different charsets, but what is the actual
use case for that?  And in any case, there will be a way to get at the
underlying header-translation machinery to do it if you really need to.)

Serializing a StringHeader, in my design, produces *text* not bytes.
This is to support the use case of using the email package to manipulate
generic 'name:value // body' formatted data in unicode form (presumably
utf-8 on disk).

To get something that is RFC compliant, you have to encode the StringMessage
object (and thus the headers) to a BytesMessage object, and then
serialize that.  (That's where the incremental encoder may be needed).

The advantage of doing it this way is we support all possible combinations
of input and output format via two strictly parallel interfaces and
their encode/decode methods.

Hmm.  It occurs to me now that another possible way to do this would be to
put the output data format into the policy object.  Then you could
serialize a StringMessage object, and it would know to do the string
to bytes conversion as it went along doing the serialization.
I don't think that would eliminate the need for encode/decode methods:
first, that's what serialize would use when converting for output,
and second, you will sometimes want to manipulate, eg, individual
header values, and it seems like the natural way to do that is something like
this:

    mybytesmessage['subject'].decode().value

You don't want to serialize using a to-string policy object, because
what you want is the decoded value, and you can't do

    mybytesmessage['subject'].value.decode()

because you have to rfc2047 decode....

Hmm.  Here's a thought: could we write an rfc2047 codec?  Then we
could use that second, more python-intuitive form like this:

    mybytesmessage['subject'].value.decode('mimeheader')

Well, looking at that I'm not sure it's better :(

> >Subclasses of these classes for structured headers would have additional
> >methods that would return either specialized object types (datetimes,
> >address objects) or bytes/strings, and these may or may not exist in
> >both Bytes and String forms (that depends on the use cases, I think).
> 
> Is it crackful to think about the policy object also containing a MIME type
> registry for conversion to the specialized object types?

Oooh.  I *like* that idea.  I dislike global registries.  Like Glenn
says, this could make a lot of things safer threading-wise, and
certainly makes things more flexible.  I was worrying that there
might be a case of a complex app needing the registry to have
different states in different parts of the app, and while I don't
have an actual use-case in mind, this would make that a non-problem.

> >So, those are my thoughts, and I'm sure I haven't thought of all the
> >corner cases.  The biggest question is, does it seem like this general
> >scheme is worth pursuing?
> 
> Definitely!  I think it's a great idea.

Thanks.  The repository (lp:python-email6) contains the beginnings
of the implementation of the StringHeader and BytesHeader classes.
I'm currently working on fleshing out the part where it says "this
is a temporary hack, need to handle folding encoded words", which is,
needless to say, a bit complicated...I may set that aside for a bit and
work on the policy object stuff.  Though I also need to put a bunch more
tests into the test database...

--David

From barry at python.org  Sun Feb 21 20:07:32 2010
From: barry at python.org (Barry Warsaw)
Date: Sun, 21 Feb 2010 14:07:32 -0500
Subject: [Email-SIG] Thoughts on the general API, and the Header API.
In-Reply-To: <20100220055038.D70461FD23B@kimball.webabinitio.net>
References: <20100125201034.190CC1BC4B4@kimball.webabinitio.net>
	<20100219212352.536c82fe@freewill.wooz.org>
	<20100220055038.D70461FD23B@kimball.webabinitio.net>
Message-ID: <20100221140732.71aa3670@freewill.wooz.org>

On Feb 20, 2010, at 12:50 AM, R. David Murray wrote:

>> serialize(policy=None)
>> deserialize(policy=None)
>
>I love the idea of policy objects.  I'm clear on what they do for
>serialization.  What do you visualize them doing for deserialization
>(parsing)?

As Glenn points out, they could contain the MIME type registry for producing
more specific instance types.  I also think they'll serve as a container for
any other configuration variables that we'll find convenient for controlling
the parsing process.  E.g. we might enable strict parsing this way.  It's
basically just a hand-wavy way of saying, let's define the API in terms of the
policy object to keep our signatures small and sane (at the cost of course of
making the policy objects huge and insane ;).

>Yes, this was my intent in providing the newline and max_line_length
>parameters, but a policy object is a much cleaner way to do that.
>Especially since we can then provide premade policy objects to support
>common output scenarios such as SMTP and HTTP.

+1

>> It sounds like there's overlap between the encoding/decoding API and the
>> serialize/deserialize API.  Are you thinking along those lines?  Differences
>> in signature could be papered over with the policy objects.
>
>No, I'm thinking of encode/decode as exactly parallel to encode/decode
>on string/bytes.  In my prototype API, for example,  StringHeader
>values are unicode, and do *not* contain any rfc2047 encoded words.
>decoding a BytesHeader decodes the RFC2047 stuff.  Contrawise, encoding
>a StringHeader does the RFC2047 encoding (using whatever charset you
>specify or utf-8 by default).

Make sense, thanks.  Yep, we probably don't need the policy API for that.  It
makes we wonder whether 'serialize' and 'deserialize' are the right names for
functionality we've traditionally called 'parsing' and 'generating'.  But we
can paint that bikeshed later.

>(This means you lose the ability to piece together headers from bits in
>different charsets, but what is the actual use case for that?  And in any
>case, there will be a way to get at the underlying header-translation
>machinery to do it if you really need to.)

The degenerate case is to mix ASCII and non-ASCII header chunks, which I think
is fairly common.  Of course the RFCs allow it, so we have to support it, even
if doing so is via a different API.

>Serializing a StringHeader, in my design, produces *text* not bytes.
>This is to support the use case of using the email package to manipulate
>generic 'name:value // body' formatted data in unicode form (presumably
>utf-8 on disk).
>
>To get something that is RFC compliant, you have to encode the StringMessage
>object (and thus the headers) to a BytesMessage object, and then
>serialize that.  (That's where the incremental encoder may be needed).
>
>The advantage of doing it this way is we support all possible combinations
>of input and output format via two strictly parallel interfaces and
>their encode/decode methods.

This all sounds great.

>Hmm.  It occurs to me now that another possible way to do this would be to
>put the output data format into the policy object.

Indeed, that's an interesting idea.

>Then you could serialize a StringMessage object, and it would know to do the
>string to bytes conversion as it went along doing the serialization.  I don't
>think that would eliminate the need for encode/decode methods: first, that's
>what serialize would use when converting for output, and second, you will
>sometimes want to manipulate, eg, individual header values, and it seems like
>the natural way to do that is something like this:
>
>    mybytesmessage['subject'].decode().value
>
>You don't want to serialize using a to-string policy object, because
>what you want is the decoded value, and you can't do
>
>    mybytesmessage['subject'].value.decode()
>
>because you have to rfc2047 decode....

I'm with ya!

>Hmm.  Here's a thought: could we write an rfc2047 codec?  Then we
>could use that second, more python-intuitive form like this:
>
>    mybytesmessage['subject'].value.decode('mimeheader')
>
>Well, looking at that I'm not sure it's better :(

Yeah.

>Thanks.  The repository (lp:python-email6) contains the beginnings
>of the implementation of the StringHeader and BytesHeader classes.
>I'm currently working on fleshing out the part where it says "this
>is a temporary hack, need to handle folding encoded words", which is,
>needless to say, a bit complicated...I may set that aside for a bit and
>work on the policy object stuff.  Though I also need to put a bunch more
>tests into the test database...

+1
-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 835 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/email-sig/attachments/20100221/4d3ba159/attachment.pgp>

From rdmurray at bitdance.com  Mon Feb 22 05:47:54 2010
From: rdmurray at bitdance.com (R. David Murray)
Date: Sun, 21 Feb 2010 23:47:54 -0500
Subject: [Email-SIG] Thoughts on the general API, and the Header API.
In-Reply-To: <20100221140732.71aa3670@freewill.wooz.org>
References: <20100125201034.190CC1BC4B4@kimball.webabinitio.net>
	<20100219212352.536c82fe@freewill.wooz.org>
	<20100220055038.D70461FD23B@kimball.webabinitio.net>
	<20100221140732.71aa3670@freewill.wooz.org>
Message-ID: <20100222044754.225881FD05D@kimball.webabinitio.net>

On Sun, 21 Feb 2010 14:07:32 -0500, Barry Warsaw <barry at python.org> wrote:
> On Feb 20, 2010, at 12:50 AM, R. David Murray wrote:
> 
> >> serialize(policy=None)
> >> deserialize(policy=None)
> >
> >I love the idea of policy objects.  I'm clear on what they do for
> >serialization.  What do you visualize them doing for deserialization
> >(parsing)?
> 
> As Glenn points out, they could contain the MIME type registry for producing
> more specific instance types.  I also think they'll serve as a container for

Arg.  I was of course writing that email late at night and sleep
deprived or I'd have noticed that :)

> any other configuration variables that we'll find convenient for controlling
> the parsing process.  E.g. we might enable strict parsing this way.  It's
> basically just a hand-wavy way of saying, let's define the API in terms of
> the policy object to keep our signatures small and sane (at the cost of course
> of making the policy objects huge and insane ;).

Sounds good.

> Make sense, thanks.  Yep, we probably don't need the policy API for that.  It
> makes we wonder whether 'serialize' and 'deserialize' are the right names for
> functionality we've traditionally called 'parsing' and 'generating'.  But we
> can paint that bikeshed later.

Yes.  I'm thinking if serialization as the replacement for generating,
with the idea that the 'generator' api at the top level will be
convenience functions wrapped around the serialization API.  But we can
deal with that when I get up to that level.

> >(This means you lose the ability to piece together headers from bits in
> >different charsets, but what is the actual use case for that?  And in any
> >case, there will be a way to get at the underlying header-translation
> >machinery to do it if you really need to.)
> 
> The degenerate case is to mix ASCII and non-ASCII header chunks, which I think
> is fairly common.  Of course the RFCs allow it, so we have to support it, even
> if doing so is via a different API.

I'd better talk about what I'm thinking about in that regard.  My notion
is that the serializer will actually try to minimize the amount of
encoded text (modulo caring about how long the encoded bits are when
the RFC2047 chrome is included) and putting anything that can be put in
ascii in ascii.  But also using us-ascii encoded words to do things like
wrap tokens that won't fit in 77 chars and even to preserve whitespace
in unstructured headers in certain situations (this bit would be the
more controversial bit, I think).  So combining ascii chunks and chunks
encoded in the charset specified to the encode method happens naturally.
You could also modify the value of a BytesHeader, stuffing into it ascii
or encoded words created 'manually' using a low level function I plan
to expose.  So I think that's the 'different API', and I think it fits
in pretty logically, I think.

If you want to control *exactly* how the encoded words appear, then I
think it would be reasonable to also require that you do your own header
wrapping, which means using the low level tools to build the encoded
words, putting in the appropriate folding yourself, adding the fieldname
on the front, passing the result to BytesHeader.from_full_header,
and using a policy that says to use the raw header data.

--David