Part of RFC 822 ignored by email module

Fri Jan 21 01:38:28 EST 2011

On Jan 20, 9:58 am, Dennis Lee Bieber <wlfr... at ix.netcom.com> wrote:
> On Thu, 20 Jan 2011 10:08:40 -0500, Bob Kline <bkl... at rksystems.com>
> declaimed the following in gmane.comp.python.general:
>
>
>
> > I just noticed that the following passage in RFC 822:
>
> >          The process of moving  from  this  folded   multiple-line
> >          representation  of a header field to its single line represen-
> >          tation is called "unfolding".  Unfolding  is  accomplished  by
> >          regarding   CRLF   immediately  followed  by  a  LWSP-char  as
> >          equivalent to the LWSP-char.
>
> > is not being honored by the email module.  The following two invocations
> > of message_from_string() should return the same value, but that's not
> > what happens:
>
> >  >>> import email
> >  >>> email.message_from_string("Subject: blah").get('SUBJECT')
> > 'blah'
> >  >>> email.message_from_string("Subject:\n blah").get('SUBJECT')
> > ' blah'
>
> > Note the space in front of the second value returned, but missing from
> > the first.  Can someone convince me that this is not a bug?
>
>         I'd first be concerned about the absence of the line ending sequence
> specified by the RFC: carriage return (CR; \r) followed by line feed
> (LF; \n).
>
>         \n by itself is not an RFC compliant line ending (even if writing to
> a file on Windows in text mode converts \n into \r\n). Though it does
> appear the module accepts it as such.

Well, I think that message_from_string() would have to accept any
RFC822-compatible message, since it's documented as such, but it
doesn't necessarily have to reject technically invalid ones.

Well, the RFC is concerned mainly with the transmission format, and
doesn't require libraries to force the user to input CRLF, so in
general there's no reason \n isn't acceptable if the library allows
it.

However, message_from_string() is part of the transmission (it's
documented as being able to accept RFC822-formatted messages) so it
has to respect RFC 822 formatting.  But I think it is ok for it to
accept non-compliant messages that use LF or CR.  (Any tool that isn't
brain dead should, too.)  The RFC doesn't specifically prohibit this.

>         Secondly, nothing in the spec says it trims leading whitespace from
> the unwrapped lines. In fact, all it says is that the line ending itself
> is removed from the string.
>
> >>> email.message_from_string("Subject:\r\n     blah").get('SUBJECT')
>
> '     blah'

I don't think this behavior is covered in the RFC since this happens
after transmission.  I.e., message_from_string "receives" the RFC822-
compliant message, then it's mostly free to do what it wants with it,
including stripping leading whitespace of header values.

>         However, the module does appear to trim leading whitespace that
> occurs between the : and text (and the line end is considered for that
> trimming, but not any whitespace after it).
>
> >>> email.message_from_string("Subject:      blah\r\n     blah").get('SUBJECT')
> 'blah\r\n     blah'
> >>> email.message_from_string("Subject:      blah\r\n     blah").get('SUBJECT')
> 'blah\r\n     blah'
> >>> email.message_from_string("Subject:      blah\r\n     blah   ").get('SUBJECT')
>
> 'blah\r\n     blah   '>>> email.message_from_string("Subject: \r\n     blah   ").get('SUBJECT')
> '     blah   '

In this case the RFC does address the behavior downstream of
transmission: it says that the whitespace following a folded line is
equivalent to that line with just the whitespace, yet the email module
treats them differently.  This is unquestionably a bug.

Carl Banks