From esj at harvee.org  Tue Jun  9 22:58:20 2009
From: esj at harvee.org (Eric S. Johansson)
Date: Tue, 09 Jun 2009 16:58:20 -0400
Subject: [Email-SIG] header api
Message-ID: <4A2ECCEC.6040805@harvee.org>

have a specific question on headers and the related api.  according to another 
listmember (which I've forgotten), all headers should be used at most once. 
I've seen many apps that use the same header repeatedly to hold info specific to 
that app.  what is the preferred way of storing multiple lines of application 
specific info?  unique headers? multi-line header?  to what extent should the 
api support/enforce this info management ideal?

From stephen at xemacs.org  Sun Jun 14 18:33:39 2009
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Mon, 15 Jun 2009 01:33:39 +0900
Subject: [Email-SIG]  header api
In-Reply-To: <4A2ECCEC.6040805@harvee.org>
References: <4A2ECCEC.6040805@harvee.org>
Message-ID: <8763eyojvg.fsf@uwakimon.sk.tsukuba.ac.jp>

Eric S. Johansson writes:

 > have a specific question on headers and the related api.  according to another 
 > listmember (which I've forgotten), all headers should be used at
 > most once. 

This is false.  A few standard headers may appear only once.  Most
headers may appear multiple times.  See the table in section 3.6 of
RFC 5322.  It's not clear from the wording whether the maximum is a
"SHOULD NOT" or a "MUST NOT" appear more than once.

 > I've seen many apps that use the same header repeatedly to hold info specific to 
 > that app.  what is the preferred way of storing multiple lines of application 
 > specific info?  unique headers? multi-line header?  to what extent should the 
 > api support/enforce this info management ideal?

Conceptually, each header is a single variable, containing a single
logical line (which may be *folded* into several physical lines).  The
MUA VM is written in Lisp.  It keeps its highly structured internal
bookkeeping data in a Lisp list, which works fine because Lisp doesn't
care about whitespace at all (and it's very unusual for such a header
to be used by anything but VM).  MIME headers often contain multiple
parameters, separated by semicolons.  There are other such conventions
you could use.  On the other hand, if you are sending these headers
through the mail system, then you need to be aware that older MTAs and
filtering programs may rebreak lines at inopportune places; you cannot
be sure that data structured into lines will not be corrupted in that
way (for example, the email package itself has some such, er,
"undesigned features").  Also, if you use multiple instances of the
same header (say "X-App-Data"), you cannot guarantee that the headers
will not be reordered by some intermediate MTA or MUA.  (I believe RFC
5322 forbids that, but sufficently old versions of the Internet
Message Format standard did not.)

IMO the email package should allow the app to request warnings if an
incoming message is not standard-conforming, and should strongly
discourage (but not necessarily make impossible) construction of
messages that exceed the limits on the number of certain headers that
are allowed.


From mark at msapiro.net  Mon Jun 15 19:17:09 2009
From: mark at msapiro.net (Mark Sapiro)
Date: Mon, 15 Jun 2009 10:17:09 -0700
Subject: [Email-SIG] [Mailman-Users] Garbled headers - was: gmail marks
 mailman confirmation mail as spam...
In-Reply-To: <871vpmcg5b.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <200906081341.03500.repsons@gmail.com>	<4A351376.8070902@msapiro.net>	<4A352F76.4020309@msapiro.net>	<200906141718.07047.repsons@gmail.com>	<4A3535D4.507@msapiro.net>	<4A35B3A8.8080609@msapiro.net>
	<871vpmcg5b.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <4A368215.5080002@msapiro.net>

I am trying to move this thread to email-sig at python.org since the
underlying issue is in the email package. Further, since as of Mailman
2.1.12, we no longer install a Mailman specific version of the email
package, it really has to be addressed in the email package.

Stephen J. Turnbull wrote:
> Mark Sapiro writes:
> 
>  > I think there is a minor bug in decode_header() in that it won't
>  > recognize a RFC 2047 encoded word in a comment if the encoded word is
>  > not separated by whitespace from the ")" that terminates the comment.
>  > However, this is the only place where an encoded word need not be
>  > followed by whitespace or the end of the header.
> 
> Indeed that's a bug.  I gather that you're saying that this bug is not
> the cause of the OP's problem, though?


Correct.


>  > The Subject: header above is non-compliant in two respects. It is too
>  > long.  [...]  However, decode_header will accept it anyway and do
>  > the right thing.
> 
> As it should, according to the Postel Principle.  Anyway, IIRC the
> length limit is a SHOULD NOT, not a MUST NOT, right?


The RFC (8|28|53)22 limits are MUST BE <= 998 and SHOULD BE <= 78. RFC
2047 seems to want to impose stricter limits on encoded words, but
unfortunately does not use the defined terms MUST and SHOULD. Section 2
says in part:

   An 'encoded-word' may not be more than 75 characters long, including
   'charset', 'encoding', 'encoded-text', and delimiters.  If it is
   desirable to encode more text than will fit in an 'encoded-word' of
   75 characters, multiple 'encoded-word's (separated by CRLF SPACE) may
   be used.

   While there is no limit to the length of a multiple-line header
   field, each line of a header field that contains one or more
   'encoded-word's is limited to 76 characters.

so it is not clear whether these are 'recommendations' or
'requirements'. In any case, email.header.decode_header() is not
enforcing any limits so we are being generous in what we accept in this
respect.


>  > real problem is item (1) in section 5 of the RFC says in part:
>  > 
>  >     Ordinary ASCII text and 'encoded-word's may appear together in the
>  >     same header field.  However, an 'encoded-word' that appears in a
>  >     header field defined as '*text' MUST be separated from any adjacent
>  >     'encoded-word' or 'text' by 'linear-white-space'.
>  > 
>  > The header above does not comply with this.
> 
> Agreed, but I think that by default[1] email should try to parse this
> header as the user intended it.  It's not like encoded-words are that
> easy to confuse with intended text; it's unlikely that changing
> 'linear-white-space' above to 'linear-white-space or specials' would
> harm anyone.


I fully agree. There is a regexp (ecre) in email/header.py that ends
with the lookahead assertion "(?=[ \t]|$)". Even in "strict mode", I
think the lookahead needs to accept ")" as well as space and tab, but I
think by default, it should just be removed.


>  > This is a problem with the MUA (mail client) that encoded the Subject:
>  > header in the first place.
> 
> Agreed, but I think following the Postel Principle here is likely to
> do less harm than adhering strictly to the RFC.


I agree here too, and note that some MUAs (all three I tried including
mutt and Thunderbird) decode the original header as intended.


> That said, I'm not in a position to contribute code, and this is a
> pretty invasive change, so the user is unlikely to see a version of
> Mailman that handles this any time soon.  They are likely to have more
> luck switching clients.
> 
> Footnotes: 
> [1]  Ie, there should be an option to be strict.
> 
> 


-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan


From rdmurray at bitdance.com  Thu Jun 18 19:47:35 2009
From: rdmurray at bitdance.com (R. David Murray)
Date: Thu, 18 Jun 2009 13:47:35 -0400 (EDT)
Subject: [Email-SIG] fixing the current email module
Message-ID: <Pine.LNX.4.64.0906181327030.32513@kimball.webabinitio.net>

So, designing a new interface is one thing.  Making the current
interface usable in py3k is another.  I presume that the latter
is desirable?

I'm porting a small application that uses the email module to py3k.
I've run into two problems, one of which was already reported, the other
of which was not:

     http://bugs.python.org/issue4661
     http://bugs.python.org/issue6302

(Then there's the whole string issues relating to email and unicode
organized under Issue1685453, but I'm going to ignore those for the
moment.)

I'd like to try fixing these, but there are design issues involved.
The fundamental one is, what format should 'message' be handling message
data in?  4661 addresses this obliquely, and we've talked about this
somewhat at the higher design level.  But the question before me is,
how to fix feedparser, message, and decode_header so that I can actually
parse a message and display it correctly.

I need to be able to feed bytes to feedparser, that much is clear.
I've implemented a proof-of-concept fix that has feedparser handle all
its input as bytes, has message decode headers and values using the
ASCII codec if handled bytes, and has decode_header expect strings and
consistently return bytes.

With this fix in place my application works.  But of course, the
email module tests do not pass, and I don't know what other use
cases I have broken.

My specific question, as posted in issue4661, is: is there any
use case for passing strings to feedparser that is not a design
error waiting to trap the programmer?

--David