[Spambayes] Email package and the CRLF pair

Paul Moore lists at morpheus.demon.co.uk
Sat Apr 19 17:59:53 EDT 2003


"Meyer, Tony" <T.A.Meyer at massey.ac.nz> writes:

>> Not an expert, but I'll do my best...
>
> Well, your comments matched those of *the* expert, so that's not bad!

Blind luck :-)

>> Interesting. Does it say anything about line terminators in 
>> the body? It probably should, as email is a pure-text medium, 
>> so you should be considering line termination for the whole 
>> message, not just the headers.
>
> I hadn't bothered to look that far, but yes it does.  It (RFC2822) says:
>     "CR and LF MUST only occur together as CRLF; they MUST NOT appear
>      independently in the body."

OK, that's definitive. But reality differs. Look at any mbox file on a
Unix system and you'll see LF terminators.

Actually, if you look at RFC822, section 1.1 (Scope), you'll see:

          This standard specifies a syntax for text messages that  are
     sent  among  computer  users, within the framework of "electronic
     mail".

and later:

     Note:  This standard is NOT intended to dictate the internal for-
            mats  used  by sites,
 
So this is pretty clear that RFC822 defines a network format, and not
a local file, or other, format. And mandating a specific line
termination convention is crucial for wire transfer formats.

You *could* argue that RFC822 has nothing to say outside the context
of network transfers, and so saying that the email package conforms to
RFC822 is meaningless. But that's hair splitting (which is one of my
hobbies, but I try not to inflict it on others :-))

The practical fact is that "colloquial" use of the term "RFC822"
refers to the header and body structure, but not such things as the
line termination, or the mandatory headers, etc. And the email package
works with that "colloquial" version.

As a general set of rules (which aren't stated anywhere) it's probably
fair to say that:

   1. Modules which manipulate internet-format data (like email)
      should work with line terminators of \n internally (just like
      Python strings do).
   2. Modules which transmit files across TCP/IP should canonicalise
      any form of line ending to CRLF.
   3. Modules which present data *received* from TCP/IP (like POP3)
      should convert data to \n line endings before returning it to
      the program.
   4. Reading from the filesystem should be handled like (3), and
      should support files opened in text or binary modes (or
      universal newline mode in Python 2.3)
   5. Writing to the filesystem should be done by assuming the data
      uses \n internally (the above rules make this true) and writing
      either in binary format (which leaves LFs in the files, ie Unix
      format) or in text format (which converts the \n characters to
      the platform native newline sequence).

This is basically "be lenient in what you accept, and strict in what
you send", plus "use \n internally as a line terminator".

The only places I know of where these rules don't work currently are
the imaplib bug you just raised, and a bug in the mailbox module I
raised a long time ago (http://www.python.org/sf/586899) which
basically notes that passing a file open in text mode to the mailbox
constructor doesn't work.

It might be nice to document these rules (or the right ones, rather
than just my unsubstantiated opinions :-)) somewhere. But I don't know
where, so I'm not volunteering :-)

Paul.
-- 
This signature intentionally left blank



More information about the Spambayes mailing list