[Email-SIG] rfc822 parser (the elephant has landed)

Fri Jun 10 22:42:49 CEST 2011

On Jun 08, 2011, at 06:46 PM, R. David Murray wrote:

>One of my ideas is to eventually decouple the header dictionary from the
>Message.  That is, you access the headers through msg.headers instead
>of directly on msg.  At that point we could get away with changing
>the semantics of __setitem__, and have msg.headers[X] be 'replace'.
>Having append be spelled 'msg.headers.append(X)' seems slightly more
>natural than having replace spelled msg.headers.replace(X), so that's
>what I'd be in favor of.

I agree that it probably does make sense to eventually relegate the headers to
msg.headers.  But I think you'll want both .append() and .replace() methods
for explicitness, with one of them being mapped to __setitem__() for
convenience.  Heck, as is pointed out elsewhere, __setitem__() will probably
be mapped to .magical_rfc_compliant_manipulation_of_header(X, policy) anyway.

>An alternative would be to take the uniqueness check out of __setitem__
>and do that check only at message generation time, if the policy says to
>do so.  I'd prefer that the immediate raise be available as an option,
>myself, since it seems like it would catch programming errors sooner
>and thus make for a better user experience.

Definitely.

>>   Also, while some fields like CC allow only occurrence, it can contain
>>   multiple values in that single field.  Is it totally insane to say that
>>   `msg['cc'] = 'address'` would append `address` to the existing value?  It
>>   probably is, but having to do that manually also kind of sucks.
>
>Yeah I think that would be insane :).  But += isn't and I want to support
>that, as you note later.

+=1!

>>   Some headers have other constraints (RFC 5322, $3.6).  For example
>>   Message-ID can technically appear zero times, but "SHOULD be present".  Part
>>   of me thinks it should be out of scope for email6 to enforce this, and I'm
>>   not sure where that would get enforced anyway, but I'm just wondering if
>>   you've thought about that.
>
>That one I think can only be enforced when the message is known to be
>"complete", which would be when it is transmitted.  So the generator
>could have a policy setting that controls whether or not a lack of 
>a Message-ID is a raisable error.

It might also make sense for Messages to have a .validate(policy) method.  The
application using email6 should essentially know when it's done parsing or
manipulating the message, so it could call .validate() at that point.

>> * Datetimes: \o/.  It will be awesome when I can `msg['date'] = a_datetime`.
>>   While it does seem reasonable that a naive datetime uses -0000, it should
>>   also be very easy for folks to add a Date header that references the local
>>   timezone, since I suspect that will be a more common use case than UTC.  I
>>   don't know what the answer for that is though.
>
>Well, Alexander has an answer (a function that returns an aware localtime
>in the datetime module) but hasn't gotten consensus on adding it.
>Perhaps I'll add such a function to email6, at least for the field trials.

Nice.

>> * As for header parsing, have you looked at the pyparsing module?  I don't
>>   write many parsers, and have no direct experience with pyparsing, but I keep
>>   hearing really good things about it.  OTOH, it's not in the stdlib, so it
>>   would present problems if email6 were to adopt it.  Still, I don't envy this
>>   part of the job, and I sympathize with the rabbit-hole effect of "just one
>>   more little thing..." ;)  Oh, and I'm just blown away impressed by the work
>>   you've done on the parser.
>
>I thought about pyparsing (though I haven't tried it out myself), but
>I think its scope is much wider than email6 needs, and getting it in to
>the stdlib should be an independent project if doing so seems worthwhile.
>I don't think email6 should depend on anything not already in the stdlib.

Agreed.

>In any case, at this point I think the hard part of the parser is done,
>and everything else is incremental additions and tweaks.
>
>Something I didn't say in my blog post is that I'm thinking of marking
>rfc822_parser as a private module for the 3.3 release, but that a long
>term goal would be to expose it, if it proves to be worthwhile and useful
>apart from its internal use in email6.  I think there are occasions when
>programs need to do non-email rfc822 parsing, where it could come in handy
>(perhaps with a few API tweaks to optionally suppress  email-specific hacks).

Again, agreed.  There are *lots* of file formats that follow rfc822 style
layouts.  One that I'm particularly interested in these days is Debian control
files.  It's essentially rfc822 headers with no bodies, with sections
separated by a blank line.  It would be kind of neat if the stdlib could help
me parse those.

>Yes.  Headers are immutable, so 'append' is not the appropriate operation
>for this.  + or += is.  What I'm thinking is that the current Mailbox
>and Group objects should be enhanced so that there is a nice API for
>creating them from various kinds of input data, and an explicit AddresList
>object added, and then they can be passed around, summed, and maybe even
>subtracted with each other and with AddressList valued header fields.

Sounds good to me.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/email-sig/attachments/20110610/ffc82eec/attachment.pgp>