From esj at harvee.org Tue Jun 9 22:58:20 2009 From: esj at harvee.org (Eric S. Johansson) Date: Tue, 09 Jun 2009 16:58:20 -0400 Subject: [Email-SIG] header api Message-ID: <4A2ECCEC.6040805@harvee.org> have a specific question on headers and the related api. according to another listmember (which I've forgotten), all headers should be used at most once. I've seen many apps that use the same header repeatedly to hold info specific to that app. what is the preferred way of storing multiple lines of application specific info? unique headers? multi-line header? to what extent should the api support/enforce this info management ideal? From stephen at xemacs.org Sun Jun 14 18:33:39 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Mon, 15 Jun 2009 01:33:39 +0900 Subject: [Email-SIG] header api In-Reply-To: <4A2ECCEC.6040805@harvee.org> References: <4A2ECCEC.6040805@harvee.org> Message-ID: <8763eyojvg.fsf@uwakimon.sk.tsukuba.ac.jp> Eric S. Johansson writes: > have a specific question on headers and the related api. according to another > listmember (which I've forgotten), all headers should be used at > most once. This is false. A few standard headers may appear only once. Most headers may appear multiple times. See the table in section 3.6 of RFC 5322. It's not clear from the wording whether the maximum is a "SHOULD NOT" or a "MUST NOT" appear more than once. > I've seen many apps that use the same header repeatedly to hold info specific to > that app. what is the preferred way of storing multiple lines of application > specific info? unique headers? multi-line header? to what extent should the > api support/enforce this info management ideal? Conceptually, each header is a single variable, containing a single logical line (which may be *folded* into several physical lines). The MUA VM is written in Lisp. It keeps its highly structured internal bookkeeping data in a Lisp list, which works fine because Lisp doesn't care about whitespace at all (and it's very unusual for such a header to be used by anything but VM). MIME headers often contain multiple parameters, separated by semicolons. There are other such conventions you could use. On the other hand, if you are sending these headers through the mail system, then you need to be aware that older MTAs and filtering programs may rebreak lines at inopportune places; you cannot be sure that data structured into lines will not be corrupted in that way (for example, the email package itself has some such, er, "undesigned features"). Also, if you use multiple instances of the same header (say "X-App-Data"), you cannot guarantee that the headers will not be reordered by some intermediate MTA or MUA. (I believe RFC 5322 forbids that, but sufficently old versions of the Internet Message Format standard did not.) IMO the email package should allow the app to request warnings if an incoming message is not standard-conforming, and should strongly discourage (but not necessarily make impossible) construction of messages that exceed the limits on the number of certain headers that are allowed. From mark at msapiro.net Mon Jun 15 19:17:09 2009 From: mark at msapiro.net (Mark Sapiro) Date: Mon, 15 Jun 2009 10:17:09 -0700 Subject: [Email-SIG] [Mailman-Users] Garbled headers - was: gmail marks mailman confirmation mail as spam... In-Reply-To: <871vpmcg5b.fsf@uwakimon.sk.tsukuba.ac.jp> References: <200906081341.03500.repsons@gmail.com> <4A351376.8070902@msapiro.net> <4A352F76.4020309@msapiro.net> <200906141718.07047.repsons@gmail.com> <4A3535D4.507@msapiro.net> <4A35B3A8.8080609@msapiro.net> <871vpmcg5b.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4A368215.5080002@msapiro.net> I am trying to move this thread to email-sig at python.org since the underlying issue is in the email package. Further, since as of Mailman 2.1.12, we no longer install a Mailman specific version of the email package, it really has to be addressed in the email package. Stephen J. Turnbull wrote: > Mark Sapiro writes: > > > I think there is a minor bug in decode_header() in that it won't > > recognize a RFC 2047 encoded word in a comment if the encoded word is > > not separated by whitespace from the ")" that terminates the comment. > > However, this is the only place where an encoded word need not be > > followed by whitespace or the end of the header. > > Indeed that's a bug. I gather that you're saying that this bug is not > the cause of the OP's problem, though? Correct. > > The Subject: header above is non-compliant in two respects. It is too > > long. [...] However, decode_header will accept it anyway and do > > the right thing. > > As it should, according to the Postel Principle. Anyway, IIRC the > length limit is a SHOULD NOT, not a MUST NOT, right? The RFC (8|28|53)22 limits are MUST BE <= 998 and SHOULD BE <= 78. RFC 2047 seems to want to impose stricter limits on encoded words, but unfortunately does not use the defined terms MUST and SHOULD. Section 2 says in part: An 'encoded-word' may not be more than 75 characters long, including 'charset', 'encoding', 'encoded-text', and delimiters. If it is desirable to encode more text than will fit in an 'encoded-word' of 75 characters, multiple 'encoded-word's (separated by CRLF SPACE) may be used. While there is no limit to the length of a multiple-line header field, each line of a header field that contains one or more 'encoded-word's is limited to 76 characters. so it is not clear whether these are 'recommendations' or 'requirements'. In any case, email.header.decode_header() is not enforcing any limits so we are being generous in what we accept in this respect. > > real problem is item (1) in section 5 of the RFC says in part: > > > > Ordinary ASCII text and 'encoded-word's may appear together in the > > same header field. However, an 'encoded-word' that appears in a > > header field defined as '*text' MUST be separated from any adjacent > > 'encoded-word' or 'text' by 'linear-white-space'. > > > > The header above does not comply with this. > > Agreed, but I think that by default[1] email should try to parse this > header as the user intended it. It's not like encoded-words are that > easy to confuse with intended text; it's unlikely that changing > 'linear-white-space' above to 'linear-white-space or specials' would > harm anyone. I fully agree. There is a regexp (ecre) in email/header.py that ends with the lookahead assertion "(?=[ \t]|$)". Even in "strict mode", I think the lookahead needs to accept ")" as well as space and tab, but I think by default, it should just be removed. > > This is a problem with the MUA (mail client) that encoded the Subject: > > header in the first place. > > Agreed, but I think following the Postel Principle here is likely to > do less harm than adhering strictly to the RFC. I agree here too, and note that some MUAs (all three I tried including mutt and Thunderbird) decode the original header as intended. > That said, I'm not in a position to contribute code, and this is a > pretty invasive change, so the user is unlikely to see a version of > Mailman that handles this any time soon. They are likely to have more > luck switching clients. > > Footnotes: > [1] Ie, there should be an option to be strict. > > -- Mark Sapiro The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan From rdmurray at bitdance.com Thu Jun 18 19:47:35 2009 From: rdmurray at bitdance.com (R. David Murray) Date: Thu, 18 Jun 2009 13:47:35 -0400 (EDT) Subject: [Email-SIG] fixing the current email module Message-ID: So, designing a new interface is one thing. Making the current interface usable in py3k is another. I presume that the latter is desirable? I'm porting a small application that uses the email module to py3k. I've run into two problems, one of which was already reported, the other of which was not: http://bugs.python.org/issue4661 http://bugs.python.org/issue6302 (Then there's the whole string issues relating to email and unicode organized under Issue1685453, but I'm going to ignore those for the moment.) I'd like to try fixing these, but there are design issues involved. The fundamental one is, what format should 'message' be handling message data in? 4661 addresses this obliquely, and we've talked about this somewhat at the higher design level. But the question before me is, how to fix feedparser, message, and decode_header so that I can actually parse a message and display it correctly. I need to be able to feed bytes to feedparser, that much is clear. I've implemented a proof-of-concept fix that has feedparser handle all its input as bytes, has message decode headers and values using the ASCII codec if handled bytes, and has decode_header expect strings and consistently return bytes. With this fix in place my application works. But of course, the email module tests do not pass, and I don't know what other use cases I have broken. My specific question, as posted in issue4661, is: is there any use case for passing strings to feedparser that is not a design error waiting to trap the programmer? --David