[Email-SIG] rfc822 parser (the elephant has landed)
Barry Warsaw
barry at python.org
Fri Jun 10 22:42:49 CEST 2011
On Jun 08, 2011, at 06:46 PM, R. David Murray wrote:
>One of my ideas is to eventually decouple the header dictionary from the
>Message. That is, you access the headers through msg.headers instead
>of directly on msg. At that point we could get away with changing
>the semantics of __setitem__, and have msg.headers[X] be 'replace'.
>Having append be spelled 'msg.headers.append(X)' seems slightly more
>natural than having replace spelled msg.headers.replace(X), so that's
>what I'd be in favor of.
I agree that it probably does make sense to eventually relegate the headers to
msg.headers. But I think you'll want both .append() and .replace() methods
for explicitness, with one of them being mapped to __setitem__() for
convenience. Heck, as is pointed out elsewhere, __setitem__() will probably
be mapped to .magical_rfc_compliant_manipulation_of_header(X, policy) anyway.
>An alternative would be to take the uniqueness check out of __setitem__
>and do that check only at message generation time, if the policy says to
>do so. I'd prefer that the immediate raise be available as an option,
>myself, since it seems like it would catch programming errors sooner
>and thus make for a better user experience.
Definitely.
>> Also, while some fields like CC allow only occurrence, it can contain
>> multiple values in that single field. Is it totally insane to say that
>> `msg['cc'] = 'address'` would append `address` to the existing value? It
>> probably is, but having to do that manually also kind of sucks.
>
>Yeah I think that would be insane :). But += isn't and I want to support
>that, as you note later.
+=1!
>> Some headers have other constraints (RFC 5322, $3.6). For example
>> Message-ID can technically appear zero times, but "SHOULD be present". Part
>> of me thinks it should be out of scope for email6 to enforce this, and I'm
>> not sure where that would get enforced anyway, but I'm just wondering if
>> you've thought about that.
>
>That one I think can only be enforced when the message is known to be
>"complete", which would be when it is transmitted. So the generator
>could have a policy setting that controls whether or not a lack of
>a Message-ID is a raisable error.
It might also make sense for Messages to have a .validate(policy) method. The
application using email6 should essentially know when it's done parsing or
manipulating the message, so it could call .validate() at that point.
>> * Datetimes: \o/. It will be awesome when I can `msg['date'] = a_datetime`.
>> While it does seem reasonable that a naive datetime uses -0000, it should
>> also be very easy for folks to add a Date header that references the local
>> timezone, since I suspect that will be a more common use case than UTC. I
>> don't know what the answer for that is though.
>
>Well, Alexander has an answer (a function that returns an aware localtime
>in the datetime module) but hasn't gotten consensus on adding it.
>Perhaps I'll add such a function to email6, at least for the field trials.
Nice.
>> * As for header parsing, have you looked at the pyparsing module? I don't
>> write many parsers, and have no direct experience with pyparsing, but I keep
>> hearing really good things about it. OTOH, it's not in the stdlib, so it
>> would present problems if email6 were to adopt it. Still, I don't envy this
>> part of the job, and I sympathize with the rabbit-hole effect of "just one
>> more little thing..." ;) Oh, and I'm just blown away impressed by the work
>> you've done on the parser.
>
>I thought about pyparsing (though I haven't tried it out myself), but
>I think its scope is much wider than email6 needs, and getting it in to
>the stdlib should be an independent project if doing so seems worthwhile.
>I don't think email6 should depend on anything not already in the stdlib.
Agreed.
>In any case, at this point I think the hard part of the parser is done,
>and everything else is incremental additions and tweaks.
>
>Something I didn't say in my blog post is that I'm thinking of marking
>rfc822_parser as a private module for the 3.3 release, but that a long
>term goal would be to expose it, if it proves to be worthwhile and useful
>apart from its internal use in email6. I think there are occasions when
>programs need to do non-email rfc822 parsing, where it could come in handy
>(perhaps with a few API tweaks to optionally suppress email-specific hacks).
Again, agreed. There are *lots* of file formats that follow rfc822 style
layouts. One that I'm particularly interested in these days is Debian control
files. It's essentially rfc822 headers with no bodies, with sections
separated by a blank line. It would be kind of neat if the stdlib could help
me parse those.
>Yes. Headers are immutable, so 'append' is not the appropriate operation
>for this. + or += is. What I'm thinking is that the current Mailbox
>and Group objects should be enhanced so that there is a nice API for
>creating them from various kinds of input data, and an explicit AddresList
>object added, and then they can be passed around, summed, and maybe even
>subtracted with each other and with AddressList valued header fields.
Sounds good to me.
-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/email-sig/attachments/20110610/ffc82eec/attachment.pgp>
More information about the Email-SIG
mailing list