[Email-SIG] email.header.decode_header eats my spaces

Barry Warsaw barry at python.org
Wed Mar 28 18:02:25 CEST 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mar 28, 2007, at 11:25 AM, Stephen J. Turnbull wrote:

> Idempotency is a test, not a requirement.  The requirement is "first,
> do no harm".  Ie, if you process the header, the result should be as
> much "like" the original as possible.  This is not actually
> implementable (different people will have different opinions about
> what that means, except only *really different* people will have the
> opinion that idempotency is undesirable<wink>), but the email package
> should make it possible for people to get pretty close without
> rewriting the package.

I agree that idempotency can't be a hard requirement; there are too  
many constraints, too much variability in the inputs, and too many  
ambiguities in the rfcs.  This is exactly like our stance on MIME  
parsing and generating, where broken MIME can break idempotency.

But I think we can do better than we currently do by opting to  
preserve whitespace when we break lines instead of substituting  
existing whitespace for continuation_ws.

>>> continuation_ws should be used only when we're forced
>>> to break at a non-existing FWS location, e.g. if we've split a  
>>> non-ascii
>>> header or at a non-whitespace header-specific syntactic break.   
>>> In the
>>> case of RFC 2047 headers, the FWS gets consumed anyway so it isn't
>>> idempotentially (?!) significant.
>
> Only in RFC 2047 conformant MUAs.  IMHO, RFC 2047 conformance is a
> requirement, but it's not sufficient.  There are too many MUAs out
> that that do not correctly handle headers folded between encoded words
> (eg, Kyle Jones's VM).  I don't know if you *should* care, but I think
> that RFC 2047 is (unfortunately) insufficient grounds for refusing to
> care at this stage.

Oh, I know all about VM.  I think the first bug I sent to Kyle on  
that has got to be approaching its 10th anniversary. :)

It's a no-win situation if we try to care about broken MUAs.  OTOH,  
let's have some pity on the poor MUA authors, 'cause the rfcs don't  
make it easy for them. ;).  Still, I think there's no perfect  
solution if we try to also support non-conformant MUAs.

> AFAICS the implication is that you need to make a judicious choice of
> the default for continuation_ws.

Combined with the preference to preserve existing fws when present,  
and not insert continuation_ws unless absolutely necessary.

>> Well, this will surely break my contribution on Mailman 2.2
>> CookHeaders.py where unifying the code for subject prefix munging for
>> both ascii and rfc2047.  :-(
>
> I don't see why it should, although there might be technical reasons
> why it would.  What I want, and what I think Barry is proposing, is
> simply that the email package never does anything to disturb FWS by
> default.

Correct.

> If you munge a header (even as trivially as removing a "Re:" prefix),
> you must accept responsibility for formatting the result.  At that
> point, I see no reason why the email package shouldn't help you
> "reflow" a header if that's desirable in your application---but the
> application should have to request that explicitly.  It shouldn't be
> implicit in the setting of continuation_ws.
>
>> May be we should add a option for email.header.Header(), like
>> idempotent=Ture/False.  ;-)
>
> I think it would be better to add an option, or even a hook function,
> for formatting.  For example, I often use a docstring-like convention
> for long subject headers, where the gist is in the first line, and the
> rest is formatted nicely (ie, indented to align with the initial
> character of the first line of the subject).  It would be nice if that
> kind of thing could be done with an application-supplied function (of
> course email could provide a number of common ones itself).

I've been thinking about something like this too, not just for  
headers, but also for message bodies.  One of the things that comes  
up often is the request to use wire-protocol line separators for  
lines within the body, so you could take the output of a Message and  
spew it directly on a port-25 socket for example.  I've always taken  
the position that the email package should use native line endings  
and that protocol modules such as smtplib and nntplib would do the  
line-ending transformation.  But for a variety of reasons, this isn't  
satisfying, and it's a use case I think email package should handle.

Of course, doing this means a radical redesign of some of the classes  
in the email package.  I'd be happy to go down that road because I  
think it will give people important options, though of course we're  
now talking new features (i.e. Python 2.6) not bug fixes (Python 2.5  
and earlier).

For example, an rfc2047 formatter could get involved during  
Header.append().  It might accept two word chunks and return the  
whitespace to insert between them.  Different formatters could be  
used for different interpretations of rfc2047.

Similarly, the formatter could get involved for breaking long lines.   
It could decide not to break them at all, or would return the two  
lines broken and formatted.

We'd need a mini-library of useful formatters, and we'd need to  
choose some reasonable defaults.  We'd need to design a good api,  
figuring out where the hook points ought to be.  I'm up for it, but  
it's a lot of work, so I'd need to get help from this group on  
getting there.  Who's up for some pair programming? :)

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Darwin)

iQCVAwUBRgqRknEjvBPtnXfVAQJ4JwQAkHo07eF5i3EawH5RN0MyduNrYyJBPjeK
5qU9uxRdPYMLlIIMDUk5PILryobzyomWwsXjzPuPjDcOFAuUN5Md5leKh/KHyJ0+
oeevd/tHZJXY2qxAK6VnmrFFYLelwmFWvk+/1QORAgaPJld+wmbVbS0NeSZ2BkZg
NwYx+fbTkxE=
=lPkF
-----END PGP SIGNATURE-----


More information about the Email-SIG mailing list