From nando at acapela.com.br Wed Feb 13 14:32:13 2008 From: nando at acapela.com.br (Nando) Date: Wed, 13 Feb 2008 11:32:13 -0200 Subject: [Email-SIG] Patch: Improve recognition of attachment file name Message-ID: <47B2F15D.9030709@acapela.com.br> Greetings, Mr. Barry Warsaw and all other Pythonistas, How do you like this little patch? $ svn diff Index: message.py =================================================================== --- message.py (revision 60758) +++ message.py (working copy) @@ -671,7 +671,10 @@ filename = self.get_param('filename', missing, 'content-disposition') if filename is missing: filename = self.get_param('name', missing, 'content-disposition') + # nando: Some messages specify the file name of attachment this way: if filename is missing: + filename = self.get_param('name', missing, 'content-type') + if filename is missing: return failobj return utils.collapse_rfc2231_value(filename).strip() This is the first time I collaborate this way, so if there is anything else I can do to help, let me know, cause I am sort of ignorant. -- Nando Florestan =============== [skype] nandoflorestan [phone] + 55 (11) 3675-3038 [mobile] + 55 (11) 9820-5451 [internet] http://oui.com.br/ [? Capela] http://acapela.com.br/ [location] S?o Paulo - SP - Brasil From nando at acapela.com.br Wed Feb 13 18:20:30 2008 From: nando at acapela.com.br (Nando) Date: Wed, 13 Feb 2008 15:20:30 -0200 Subject: [Email-SIG] Patch: Improve recognition of attachment file name, with encodings In-Reply-To: <47B2F15D.9030709@acapela.com.br> References: <47B2F15D.9030709@acapela.com.br> Message-ID: <47B326DE.7030607@acapela.com.br> I have a second suggestion to that same Message.get_filename() method. It needs to understand filenames that come with text encodings. The proposed patch is in the attached text file. Thank you for your time... Nando Florestan =============== [skype] nandoflorestan [phone] + 55 (11) 3675-3038 [mobile] + 55 (11) 9820-5451 [internet] http://oui.com.br/ [? Capela] http://acapela.com.br/ [location] S?o Paulo - SP - Brasil Nando wrote: > Greetings, Mr. Barry Warsaw and all other Pythonistas, > > How do you like this little patch? > > $ svn diff > Index: message.py > =================================================================== > --- message.py (revision 60758) > +++ message.py (working copy) > @@ -671,7 +671,10 @@ > filename = self.get_param('filename', missing, > 'content-disposition') > if filename is missing: > filename = self.get_param('name', missing, > 'content-disposition') > + # nando: Some messages specify the file name of attachment this > way: > if filename is missing: > + filename = self.get_param('name', missing, 'content-type') > + if filename is missing: > return failobj > return utils.collapse_rfc2231_value(filename).strip() > > > This is the first time I collaborate this way, so if there is anything > else I can do to help, let me know, cause I am sort of ignorant. > > -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: diff.txt Url: http://mail.python.org/pipermail/email-sig/attachments/20080213/fc27c5fe/attachment.txt From stephen at xemacs.org Wed Feb 13 21:47:54 2008 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 14 Feb 2008 05:47:54 +0900 Subject: [Email-SIG] Patch: Improve recognition of attachment file name, with encodings In-Reply-To: <47B326DE.7030607@acapela.com.br> References: <47B2F15D.9030709@acapela.com.br> <47B326DE.7030607@acapela.com.br> Message-ID: <87tzkc7fd1.fsf@uwakimon.sk.tsukuba.ac.jp> Nando writes: > I have a second suggestion to that same Message.get_filename() method. > > It needs to understand filenames that come with text encodings. It does, already, by use of .collapse_rfc2231_value. That uses RFC 2231 however, not RFC 2047, as you propose. Use of RFC 2047 encodings in parameters is specifically forbidden by that standard. > + # nando: Some messages specify the file name of attachment this way: > if filename is missing: > + filename = self.get_param('name', missing, 'content-type') > + if filename is missing: > return failobj > + """The following line takes care of cases such as this: > +Content-Disposition: attachment; > + filename="=?ISO-8859-1?Q?z=C7D-_Zoltan=5Fchunk=5F5.wmv?=" > + """ > + filename = decode_header(filename)[0][0] > return utils.collapse_rfc2231_value(filename).strip() I feel your pain; Japanese MUAs do this kind of thing all the time, too. However, decoding such garbage should not be done without specific permission from a human user, because it's forbidden by the standard. From janssen at parc.com Thu Feb 14 02:33:14 2008 From: janssen at parc.com (Bill Janssen) Date: Wed, 13 Feb 2008 17:33:14 PST Subject: [Email-SIG] Patch: Improve recognition of attachment file name, with encodings In-Reply-To: <87tzkc7fd1.fsf@uwakimon.sk.tsukuba.ac.jp> References: <47B2F15D.9030709@acapela.com.br> <47B326DE.7030607@acapela.com.br> <87tzkc7fd1.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <08Feb13.173321pst."58696"@synergy1.parc.xerox.com> > > + # nando: Some messages specify the file name of attachment this way: > > if filename is missing: > > + filename = self.get_param('name', missing, 'content-type') > > + if filename is missing: > > return failobj > > + """The following line takes care of cases such as this: > > +Content-Disposition: attachment; > > + filename="=?ISO-8859-1?Q?z=C7D-_Zoltan=5Fchunk=5F5.wmv?=" > > + """ > > + filename = decode_header(filename)[0][0] > > return utils.collapse_rfc2231_value(filename).strip() > > I feel your pain; Japanese MUAs do this kind of thing all the time, > too. However, decoding such garbage should not be done without > specific permission from a human user, because it's forbidden by the > standard. Would it be possible to make this a configurable option, so that if the user enables it, it's done? Bill From stephen at xemacs.org Thu Feb 14 07:12:51 2008 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 14 Feb 2008 15:12:51 +0900 Subject: [Email-SIG] Patch: Improve recognition of attachment file name, with encodings In-Reply-To: <08Feb13.173321pst."58696"@synergy1.parc.xerox.com> References: <47B2F15D.9030709@acapela.com.br> <47B326DE.7030607@acapela.com.br> <87tzkc7fd1.fsf@uwakimon.sk.tsukuba.ac.jp> <08Feb13.173321pst."58696"@synergy1.parc.xerox.com> Message-ID: <87odak6p7g.fsf@uwakimon.sk.tsukuba.ac.jp> Bill Janssen writes: > Would it be possible to make this a configurable option, so that if > the user enables it, it's done? I don't like it at all, but it has to be on the table, because I get such malformed messages daily. I don't think it's going to stop. Users of the email module are going to want to read their mail, they're going to want to read the file names, so they know where they're saving attachments and what the content probably is. From nando at acapela.com.br Thu Feb 14 12:20:20 2008 From: nando at acapela.com.br (Nando) Date: Thu, 14 Feb 2008 09:20:20 -0200 Subject: [Email-SIG] Just give me the decoded header? Message-ID: <47B423F4.2080201@acapela.com.br> Gentlemen, please consider the following ipython session: In [98]: m = email.message_from_file(f) In [99]: print m["subject"] =?utf-8?b?W291aS5jb20uYnJdIENhcnTDo28gZGUgY3LDqWRpdG8gdGVyw6EgbGVn?= =?utf-8?b?aXNsYcOnw6NvIGVzcGVjw61maWNh?= It gives me the raw subject header value. Now of course I just wanted the header in unicode. So I have to do: In [100]: from email.header import decode_header In [101]: decode_header(m["subject"]) Out[101]: [('[oui.com.br] Cart\xc3\xa3o de cr\xc3\xa9dito ter\xc3\xa1 legisla\xc3\xa7\xc3\xa3o espec\xc3\xadfica', 'utf-8')] In [102]: print decode_header(m["subject"])[0][0] [oui.com.br] Cart?o de cr?dito ter? legisla??o espec?fica My questions are: 1) Why does not it currently return the *decoded* header? 2) Would it break too many apps if we changed it? 2.1) If it would, can we add a function such as message.getheader("subject") for this? 2.1.1) Would you like me to propose a patch with the obvious implementation? Sometimes, for things more or less like this, I just feel like *subclassing* Message. But I can't. The MIME parser is wired to create Messages. I don't think I can tell it to create a MyMessageSubclass. This also happens with the convenience function email.message_from_file(f). It creates a Message. I *think* I could make it into a class method of Message, then I would be able to call MyMessage.from_file(). Is this idea -- making things more object-oriented -- interesting for you? For starters, isn't it high time Message became a new-style class by inheriting from object? -- Nando Florestan =============== [skype] nandoflorestan [phone] + 55 (11) 3675-3038 [mobile] + 55 (11) 9820-5451 [internet] http://oui.com.br/ [? Capela] http://acapela.com.br/ [location] S?o Paulo - SP - Brasil From mark at msapiro.net Thu Feb 14 21:42:28 2008 From: mark at msapiro.net (Mark Sapiro) Date: Thu, 14 Feb 2008 12:42:28 -0800 Subject: [Email-SIG] Just give me the decoded header? In-Reply-To: <47B423F4.2080201@acapela.com.br> References: <47B423F4.2080201@acapela.com.br> Message-ID: <47B4A7B4.6090602@msapiro.net> Nando wrote: > > Sometimes, for things more or less like this, I just feel like > *subclassing* Message. But I can't. The MIME parser is wired to create > Messages. I don't think I can tell it to create a MyMessageSubclass. > This also happens with the convenience function > email.message_from_file(f). It creates a Message. I *think* I could make > it into a class method of Message, then I would be able to call > MyMessage.from_file(). Is this idea -- making things more > object-oriented -- interesting for you? You can do this now, albeit somewhat differently. See the _class argument at and the _factory argument at . e.g. if your mymessage module defines a MyMessage class as a sub class of email.message.Message, you can do import email import mymessage f = open('/path/to/message/file') msg = email.message_from_file(f, mymessage.MyMessage) to create a MyMessage instance. You can also do import email import mymessage p = email.parser.Parser(mymessage.MyMessage) to create a parser which will create MyMessage instances. -- Mark Sapiro The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan From stephen at xemacs.org Thu Feb 14 22:26:11 2008 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 15 Feb 2008 06:26:11 +0900 Subject: [Email-SIG] Just give me the decoded header? In-Reply-To: <47B423F4.2080201@acapela.com.br> References: <47B423F4.2080201@acapela.com.br> Message-ID: <87ir0r6xho.fsf@uwakimon.sk.tsukuba.ac.jp> Nando writes: > My questions are: > 1) Why does not it currently return the *decoded* header? Because Message is an implementation of RFC 2822, which says nothing about decoding headers. It is very helpful to model your programs directly on the standards the claim to conform to. Why restrict the base interface to such a low-level API? Well, Internet email is an ancient system going back to RFC 561 at least (published in 1973), and many things that seem unnecessary today with modern technology remain necessary because you cannot know what generation of technology you are communicating with (or even if the remote user is a dog, as the famous joke goes). Often optimizations in modern programs depend on assumptions about standard conformance. > 2) Would it break too many apps if we changed it? It probably would. Multiply decoding headers will probably result in passing non-ASCII to the ASCII codec, and boom! you're down. For example, Mailman is vulnerable to this. > 2.1) If it would, can we add a function such as > message.getheader("subject") for this? You could, but why would you need that particular implementation? > Sometimes, for things more or less like this, I just feel like > *subclassing* Message. Why do that? In my experience, you will eventually find a need to pass the original Message to some routine (or even the original message, in digital signing applications). If you want to work with a SmartMessage so that it contains the same data but returns the decoded headers, just include the original Message as an attribute: import email class SmartMessage(Object): def __init__(self,email_message): self.raw_message = email_message def __getitem__(self,key): return email.header.decode_header(self.raw_message[key]) etc. However, the problem you're going to run into is that this kind of behavior (whether implemented as a subclass or by enveloping the raw_message attribute) will make it impossible for apps to distinguish between Messages and SmartMessages in contexts where it matters. > But I can't. The MIME parser is wired to create > Messages. I don't think I can tell it to create a MyMessageSubclass. Again, why do you want to? Everything you need to implement the behavior you want is in the Message already. > For starters, isn't it high time Message became a new-style class by > inheriting from object? Sure, but code speaks louder than words. Nobody has been willing to speak up yet. :-( From hpj at urpla.net Fri Feb 15 13:47:23 2008 From: hpj at urpla.net (Hans-Peter Jansen) Date: Fri, 15 Feb 2008 13:47:23 +0100 Subject: [Email-SIG] Just give me the decoded header? In-Reply-To: <47B423F4.2080201@acapela.com.br> References: <47B423F4.2080201@acapela.com.br> Message-ID: <200802151347.23900.hpj@urpla.net> Am Donnerstag, 14. Februar 2008 schrieb Nando: > Gentlemen, please consider the following ipython session: > > > In [98]: m = email.message_from_file(f) > > In [99]: print m["subject"] > =?utf-8?b?W291aS5jb20uYnJdIENhcnTDo28gZGUgY3LDqWRpdG8gdGVyw6EgbGVn?= > =?utf-8?b?aXNsYcOnw6NvIGVzcGVjw61maWNh?= > > > It gives me the raw subject header value. Now of course I just wanted > the header in unicode. So I have to do: > > > In [100]: from email.header import decode_header > > In [101]: decode_header(m["subject"]) > Out[101]: > [('[oui.com.br] Cart\xc3\xa3o de cr\xc3\xa9dito ter\xc3\xa1 > legisla\xc3\xa7\xc3\xa3o espec\xc3\xadfica', > 'utf-8')] Nando, you're just a lucky camper in that case. How would you handle a mixture of say: big5, euc_jp, koi8_r _and_ utf-8 encodings. Please don't claim, that this is unlikely. Sure it is, but never the less, it happens, and does your code gets this pathological case right? Wait, let's normalize them - but how do we handle encoding failures? Remember, there are way too many MUAs, mailing list managers, email gateways, autoresponder, etc. out there, which get this wrong! Next you ask for email.Message to reparse email addresses to conform to RFC 2822, and voila, you created a unmanageable creature called Frankenstein.. If you think about the consequences, you will understand, that Barry and friends will do _everything_ to keep this can o'worms closed in this context. Pete From nando at acapela.com.br Sat Feb 16 22:09:06 2008 From: nando at acapela.com.br (Nando) Date: Sat, 16 Feb 2008 19:09:06 -0200 Subject: [Email-SIG] Patch: Improve recognition of attachment file name, with encodings In-Reply-To: <87odak6p7g.fsf@uwakimon.sk.tsukuba.ac.jp> References: <47B2F15D.9030709@acapela.com.br> <47B326DE.7030607@acapela.com.br> <87tzkc7fd1.fsf@uwakimon.sk.tsukuba.ac.jp> <08Feb13.173321pst."58696"@synergy1.parc.xerox.com> <87odak6p7g.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <47B750F2.8040805@acapela.com.br> Decoding of RFC 2047 encoded filenames... I attach an updated patch. Now it is off by default, but can be enabled by flipping a flag. I have updated the docstring for the get_filename() method. Let me know if I am forgetting something. Two questions: 1) I have done this for the get_filename() method only. The flag that needs to be set is called *garbage_filename_decoding*. Look, it says "filename" in there. But are there any other parameters where the improper usage of RFC 2047 also commonly occurs? If so, maybe a single flag for all of them would be more appropriate... 2) Is there some flaw in decode_header()? Something that Thunderbird displays as "Eduardo & M?nica" is being decoded with the wrong character in place of the ?: repr(decode_header(m["subject"])[0][0]) 'Eduardo & M\xf4nica' The header being tested is: Subject: =?iso-8859-1?Q?Eduardo_&_M=F4nica?= In case we are again doing the Right Thing, then why does Thunderbird display it the way it was intended? I am not familiar with the RFCs. When I read Stephen Turnbull's message explaining that these are in fact malformed messages, I was very worried. (I want the email library to just work...) Fortunately we can do the right thing by default, while still supporting decoding of the malformed messages. I hope you can approve this small patch... Nando Florestan =============== [skype] nandoflorestan [phone] + 55 (11) 3675-3038 [mobile] + 55 (11) 9820-5451 [internet] http://oui.com.br/ [? Capela] http://acapela.com.br/ [location] S?o Paulo - SP - Brasil Stephen J. Turnbull wrote: > Bill Janssen writes: > > > Would it be possible to make this a configurable option, so that if > > the user enables it, it's done? > > I don't like it at all, but it has to be on the table, because I get > such malformed messages daily. I don't think it's going to stop. > Users of the email module are going to want to read their mail, > they're going to want to read the file names, so they know where > they're saving attachments and what the content probably is. > From nando at acapela.com.br Sat Feb 16 22:25:13 2008 From: nando at acapela.com.br (Nando) Date: Sat, 16 Feb 2008 19:25:13 -0200 Subject: [Email-SIG] Patch: Improve recognition of attachment file name, with encodings In-Reply-To: <47B750F2.8040805@acapela.com.br> References: <47B2F15D.9030709@acapela.com.br> <47B326DE.7030607@acapela.com.br> <87tzkc7fd1.fsf@uwakimon.sk.tsukuba.ac.jp> <08Feb13.173321pst."58696"@synergy1.parc.xerox.com> <87odak6p7g.fsf@uwakimon.sk.tsukuba.ac.jp> <47B750F2.8040805@acapela.com.br> Message-ID: <47B754B9.5000705@acapela.com.br> Looks like I forgot to attach the patch. Sorry. Here it is. Nando Florestan =============== [skype] nandoflorestan [phone] + 55 (11) 3675-3038 [mobile] + 55 (11) 9820-5451 [internet] http://oui.com.br/ [? Capela] http://acapela.com.br/ [location] S?o Paulo - SP - Brasil Nando wrote: > Decoding of RFC 2047 encoded filenames... I attach an updated patch. Now > it is off by default, but can be enabled by flipping a flag. I have > updated the docstring for the get_filename() method. Let me know if I am > forgetting something. > > Two questions: > > 1) I have done this for the get_filename() method only. The flag that > needs to be set is called *garbage_filename_decoding*. Look, it says > "filename" in there. But are there any other parameters where the > improper usage of RFC 2047 also commonly occurs? If so, maybe a single > flag for all of them would be more appropriate... > > 2) Is there some flaw in decode_header()? Something that Thunderbird > displays as "Eduardo & M?nica" is being decoded with the wrong character > in place of the ?: > repr(decode_header(m["subject"])[0][0]) > 'Eduardo & M\xf4nica' > The header being tested is: > Subject: =?iso-8859-1?Q?Eduardo_&_M=F4nica?= > In case we are again doing the Right Thing, then why does Thunderbird > display it the way it was intended? > > I am not familiar with the RFCs. When I read Stephen Turnbull's message > explaining that these are in fact malformed messages, I was very > worried. (I want the email library to just work...) Fortunately we can > do the right thing by default, while still supporting decoding of the > malformed messages. > > I hope you can approve this small patch... > -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: diff.txt Url: http://mail.python.org/pipermail/email-sig/attachments/20080216/bee0203b/attachment.txt From nando at acapela.com.br Sun Feb 17 03:24:00 2008 From: nando at acapela.com.br (Nando) Date: Sat, 16 Feb 2008 23:24:00 -0300 Subject: [Email-SIG] Patch: Improve recognition of attachment file name, with encodings In-Reply-To: <47B750F2.8040805@acapela.com.br> References: <47B2F15D.9030709@acapela.com.br> <47B326DE.7030607@acapela.com.br> <87tzkc7fd1.fsf@uwakimon.sk.tsukuba.ac.jp> <08Feb13.173321pst."58696"@synergy1.parc.xerox.com> <87odak6p7g.fsf@uwakimon.sk.tsukuba.ac.jp> <47B750F2.8040805@acapela.com.br> Message-ID: <47B79AC0.2090604@acapela.com.br> OK, I get question number 2 now. My question was: 2) Is there some flaw in decode_header()? Something that Thunderbird displays as "Eduardo & M?nica" is being decoded with the wrong character in place of the ?: repr(decode_header(m["subject"])[0][0]) 'Eduardo & M\xf4nica' The header being tested is: Subject: =?iso-8859-1?Q?Eduardo_&_M=F4nica?= In case we are again doing the Right Thing, then why does Thunderbird display it the way it was intended? The answer is I have to use codecs.decode(): import codecs In [20]: [(s, encoding)] = decode_header("=?iso-8859-1?Q?P=F4nei?=") In [21]: s Out[21]: 'P\xf4nei' In [22]: encoding Out[22]: 'iso-8859-1' In [23]: print codecs.decode(s, encoding) P?nei Well, that just makes it even harder to use the return value of the decode_header() function. And instead of encapsulating all that complexity in the email library, you are forcing every user of the library to find all this out by himself, just as I had to. This is very un-MartinFowler-like, if you pardon that expression :p I understand Stephen Turnbull's point that it is useful to map the Message class to RFC 2822, because some users need that. However, that is not what *I* need - I want a high-level email library, and I am sure many others do too. Other mail libraries have faced the challenges of encodings before. I don't really see why we in Python should hide from that can o'worms (as Hans-Peter Jansen put it). It is a dirty job, but someone gotta do it! "How would you handle a mixture of say: big5, euc_jp, koi8_r _and_ utf-8 encodings?" Well I don't know what the flabbergast you are talking about, but: Are you scared? Why should the application developer have to deal with something that you e-mail experts are much more qualified to implement? What is it, are you afraid of having a module accused of being "buggy"? (If so, you know very well that this is not the free software way.) What about code reuse? Did you see how much I had to do just in order to print a Subject header? I do think that a Message subclass (HighLevelMessage?) could play this role nicely - a high-level interface. Has anyone done this before? (It is a very obvious idea.) Is anybody else interested at all? Most of the vibes I get here are like "don't do this, don't do that"... Thanks to Mark Shapiro for showing me a way to do what I want. Nando Florestan =============== [skype] nandoflorestan [phone] + 55 (11) 3675-3038 [mobile] + 55 (11) 9820-5451 [internet] http://oui.com.br/ [? Capela] http://acapela.com.br/ [location] S?o Paulo - SP - Brasil From nando at acapela.com.br Tue Feb 19 10:55:25 2008 From: nando at acapela.com.br (Nando) Date: Tue, 19 Feb 2008 06:55:25 -0300 Subject: [Email-SIG] Patch: Improve recognition of attachment file name, with encodings In-Reply-To: <47B79AC0.2090604@acapela.com.br> References: <47B2F15D.9030709@acapela.com.br> <47B326DE.7030607@acapela.com.br> <87tzkc7fd1.fsf@uwakimon.sk.tsukuba.ac.jp> <08Feb13.173321pst."58696"@synergy1.parc.xerox.com> <87odak6p7g.fsf@uwakimon.sk.tsukuba.ac.jp> <47B750F2.8040805@acapela.com.br> <47B79AC0.2090604@acapela.com.br> Message-ID: <47BAA78D.7080704@acapela.com.br> You have succeeded in confusing me to the point I don't know whether my own proposed patch is useful anymore. It would seem to mean a change in philosophy, from "just implement the standards closely" to "give the application developer what he needs". But this goal would only be accomplished with many alterations to the codebase. So if this is true: Stephen J. Turnbull wrote: > > 1) Why does not it currently return the *decoded* header? > > Because Message is an implementation of RFC 2822, which says nothing > about decoding headers. It is very helpful to model your programs > directly on the standards the claim to conform to. > I cannot see any documentation stating that the way of this project is to have each class implement one RFC. If there *were* a writeup on this somewhere, maybe I wouldn't have annoyed you with so many questions and absurd propositions. But why would you lie to me? So it must be true... Then I don't agree with my own patch anymore. Anyway, the necessity of a high-level interface remains and nobody has answered my question: whither? Nando Florestan =============== [skype] nandoflorestan [phone] + 55 (11) 3675-3038 [mobile] + 55 (11) 9820-5451 [internet] http://oui.com.br/ [? Capela] http://acapela.com.br/ [location] S?o Paulo - SP - Brasil From stephen at xemacs.org Tue Feb 19 23:51:51 2008 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 20 Feb 2008 07:51:51 +0900 Subject: [Email-SIG] Patch: Improve recognition of attachment file name, with encodings In-Reply-To: <47BAA78D.7080704@acapela.com.br> References: <47B2F15D.9030709@acapela.com.br> <47B326DE.7030607@acapela.com.br> <87tzkc7fd1.fsf@uwakimon.sk.tsukuba.ac.jp> <08Feb13.173321pst."58696"@synergy1.parc.xerox.com> <87odak6p7g.fsf@uwakimon.sk.tsukuba.ac.jp> <47B750F2.8040805@acapela.com.br> <47B79AC0.2090604@acapela.com.br> <47BAA78D.7080704@acapela.com.br> Message-ID: <87tzk4woe0.fsf@uwakimon.sk.tsukuba.ac.jp> Nando writes: > > Because Message is an implementation of RFC 2822, which says nothing > > about decoding headers. It is very helpful to model your programs > > directly on the standards the claim to conform to. > I cannot see any documentation stating that the way of this project is > to have each class implement one RFC. If there *were* a writeup on this > somewhere, maybe I wouldn't have annoyed you with so many questions and > absurd propositions. But I didn't say that it was a general principle. I said that Message is an implementation of RFC 2822. For historical reasons it lives in the rfc822 module. Its main docstring says: RFC 2822 message manipulation. Note: This is only a very rough sketch of a full RFC-822 parser; in particular the tokenizing of addresses does not adhere to all the quoting rules. Note: RFC 2822 is a long awaited update to RFC 822. This module should conform to RFC 2822, and is thus mis-named (it's not worth renaming it). Some effort at RFC 2822 updates have been made, but a thorough audit has not been performed. Consider any RFC 2822 non-conformance to be a bug. > Anyway, the necessity of a high-level interface remains and nobody has > answered my question: whither? As I wrote earlier, I think writing a *general* high-level interface is going to be very hard. I don't have time to devote to it. If you have a set of use cases that have a lot of common needs, then you can and should write a module to serve those needs. You'll probably find other people with similar needs, and some of them will help. From stuart at stuartbishop.net Tue Feb 26 04:54:48 2008 From: stuart at stuartbishop.net (Stuart Bishop) Date: Tue, 26 Feb 2008 10:54:48 +0700 Subject: [Email-SIG] Just give me the decoded header? In-Reply-To: <47B423F4.2080201@acapela.com.br> References: <47B423F4.2080201@acapela.com.br> Message-ID: <47C38D88.8000102@stuartbishop.net> Nando wrote: > Gentlemen, please consider the following ipython session: > > > In [98]: m = email.message_from_file(f) > > In [99]: print m["subject"] > =?utf-8?b?W291aS5jb20uYnJdIENhcnTDo28gZGUgY3LDqWRpdG8gdGVyw6EgbGVn?= > =?utf-8?b?aXNsYcOnw6NvIGVzcGVjw61maWNh?= > > > It gives me the raw subject header value. Now of course I just wanted > the header in unicode. So I have to do: > > > In [100]: from email.header import decode_header > > In [101]: decode_header(m["subject"]) > Out[101]: > [('[oui.com.br] Cart\xc3\xa3o de cr\xc3\xa9dito ter\xc3\xa1 > legisla\xc3\xa7\xc3\xa3o espec\xc3\xadfica', > 'utf-8')] > > In [102]: print decode_header(m["subject"])[0][0] > [oui.com.br] Cart?o de cr?dito ter? legisla??o espec?fica > > > My questions are: > 1) Why does not it currently return the *decoded* header? Because you often need access to the raw header. Also, not all headers are encoded the same. While what you have works for Subject:, it doesn't work for To:, Reply-To:, From: etc. > 2) Would it break too many apps if we changed it? Yes. Particularly apps that need to log or report broken email headers that cannot be decoded. > 2.1) If it would, can we add a function such as > message.getheader("subject") for this? > 2.1.1) Would you like me to propose a patch with the obvious implementation? I'd love to see things become more Unicode aware. Perhaps return an object implementing __str__() and __unicode__() (or decode()). The cast-to-unicode conversion would decode headers with known encodings and raise an exception on headers with unknown encodings. Similarly, setting headers using Unicode strings would use the known encodings to perform the reverse operation. And you still have access to the raw value if you want to round trip. -- Stuart Bishop http://www.stuartbishop.net/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: OpenPGP digital signature Url : http://mail.python.org/pipermail/email-sig/attachments/20080226/984d632d/attachment.pgp From stuart at stuartbishop.net Tue Feb 26 05:11:51 2008 From: stuart at stuartbishop.net (Stuart Bishop) Date: Tue, 26 Feb 2008 11:11:51 +0700 Subject: [Email-SIG] Patch: Improve recognition of attachment file name, with encodings In-Reply-To: <47B79AC0.2090604@acapela.com.br> References: <47B2F15D.9030709@acapela.com.br> <47B326DE.7030607@acapela.com.br> <87tzkc7fd1.fsf@uwakimon.sk.tsukuba.ac.jp> <08Feb13.173321pst."58696"@synergy1.parc.xerox.com> <87odak6p7g.fsf@uwakimon.sk.tsukuba.ac.jp> <47B750F2.8040805@acapela.com.br> <47B79AC0.2090604@acapela.com.br> Message-ID: <47C39187.6040303@stuartbishop.net> Nando wrote: > OK, I get question number 2 now. My question was: > > 2) Is there some flaw in decode_header()? Something that Thunderbird > displays as "Eduardo & M?nica" is being decoded with the wrong character > in place of the ?: > repr(decode_header(m["subject"])[0][0]) > 'Eduardo & M\xf4nica' > The header being tested is: > Subject: =?iso-8859-1?Q?Eduardo_&_M=F4nica?= > In case we are again doing the Right Thing, then why does Thunderbird > display it the way it was intended? > > > The answer is I have to use codecs.decode(): > > import codecs > > In [20]: [(s, encoding)] = decode_header("=?iso-8859-1?Q?P=F4nei?=") > > In [21]: s > Out[21]: 'P\xf4nei' > > In [22]: encoding > Out[22]: 'iso-8859-1' > > In [23]: print codecs.decode(s, encoding) > P?nei > > Well, that just makes it even harder to use the return value of the > decode_header() function. And instead of encapsulating all that > complexity in the email library, you are forcing every user of the > library to find all this out by himself, just as I had to. It gets harder, as you are not handling Unicode domain names. Code to convert email addresses between their ASCII and Unicode representations can be found at http://stuartbishop.net/Software/EmailAddress/ (Barry - we should discuss getting code to do this into the standard library again. I think I opened a bug on this soon after I wrote it - in 2004!) It is a bit of a learning curve, and I suspect that most users of the library have written the same or similar helpers, possibly several times. eg. the nearly mandatory header decoder: def decode_header(s): '''Decode an RFC2047 email header into a Unicode string.''' s = email.Header.decode_header(s) s = [b[0].decode(b[1] or 'ascii') for b in s] return u''.join(s) -- Stuart Bishop http://www.stuartbishop.net/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: OpenPGP digital signature Url : http://mail.python.org/pipermail/email-sig/attachments/20080226/dfbd5963/attachment.pgp