From nando at acapela.com.br  Wed Feb 13 14:32:13 2008
From: nando at acapela.com.br (Nando)
Date: Wed, 13 Feb 2008 11:32:13 -0200
Subject: [Email-SIG] Patch: Improve recognition of attachment file name
Message-ID: <47B2F15D.9030709@acapela.com.br>

Greetings, Mr. Barry Warsaw and all other Pythonistas,

How do you like this little patch?

$ svn diff
Index: message.py
===================================================================
--- message.py  (revision 60758)
+++ message.py  (working copy)
@@ -671,7 +671,10 @@
         filename = self.get_param('filename', missing, 
'content-disposition')
         if filename is missing:
             filename = self.get_param('name', missing, 
'content-disposition')
+        # nando: Some messages specify the file name of attachment this 
way:
         if filename is missing:
+            filename = self.get_param('name', missing, 'content-type')
+        if filename is missing:
             return failobj
         return utils.collapse_rfc2231_value(filename).strip()


This is the first time I collaborate this way, so if there is anything 
else I can do to help, let me know, cause I am sort of ignorant.

-- 
Nando Florestan
===============
[skype]    nandoflorestan
[phone]  + 55 (11) 3675-3038
[mobile] + 55 (11) 9820-5451
[internet] http://oui.com.br/
[? Capela] http://acapela.com.br/
[location] S?o Paulo - SP - Brasil


From nando at acapela.com.br  Wed Feb 13 18:20:30 2008
From: nando at acapela.com.br (Nando)
Date: Wed, 13 Feb 2008 15:20:30 -0200
Subject: [Email-SIG] Patch: Improve recognition of attachment file name,
	with encodings
In-Reply-To: <47B2F15D.9030709@acapela.com.br>
References: <47B2F15D.9030709@acapela.com.br>
Message-ID: <47B326DE.7030607@acapela.com.br>

I have a second suggestion to that same Message.get_filename() method.

It needs to understand filenames that come with text encodings.

The proposed patch is in the attached text file.

Thank you for your time...

Nando Florestan
===============
[skype]    nandoflorestan
[phone]  + 55 (11) 3675-3038
[mobile] + 55 (11) 9820-5451
[internet] http://oui.com.br/
[? Capela] http://acapela.com.br/
[location] S?o Paulo - SP - Brasil


Nando wrote:
> Greetings, Mr. Barry Warsaw and all other Pythonistas,
>
> How do you like this little patch?
>
> $ svn diff
> Index: message.py
> ===================================================================
> --- message.py  (revision 60758)
> +++ message.py  (working copy)
> @@ -671,7 +671,10 @@
>          filename = self.get_param('filename', missing, 
> 'content-disposition')
>          if filename is missing:
>              filename = self.get_param('name', missing, 
> 'content-disposition')
> +        # nando: Some messages specify the file name of attachment this 
> way:
>          if filename is missing:
> +            filename = self.get_param('name', missing, 'content-type')
> +        if filename is missing:
>              return failobj
>          return utils.collapse_rfc2231_value(filename).strip()
>
>
> This is the first time I collaborate this way, so if there is anything 
> else I can do to help, let me know, cause I am sort of ignorant.
>
>   
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: diff.txt
Url: http://mail.python.org/pipermail/email-sig/attachments/20080213/fc27c5fe/attachment.txt 

From stephen at xemacs.org  Wed Feb 13 21:47:54 2008
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Thu, 14 Feb 2008 05:47:54 +0900
Subject: [Email-SIG]  Patch: Improve recognition of attachment file name,
	with encodings
In-Reply-To: <47B326DE.7030607@acapela.com.br>
References: <47B2F15D.9030709@acapela.com.br> <47B326DE.7030607@acapela.com.br>
Message-ID: <87tzkc7fd1.fsf@uwakimon.sk.tsukuba.ac.jp>

Nando writes:

 > I have a second suggestion to that same Message.get_filename() method.
 > 
 > It needs to understand filenames that come with text encodings.

It does, already, by use of .collapse_rfc2231_value.  That uses RFC
2231 however, not RFC 2047, as you propose.  Use of RFC 2047 encodings
in parameters is specifically forbidden by that standard.

 > +        # nando: Some messages specify the file name of attachment this way:
 >          if filename is missing:
 > +            filename = self.get_param('name', missing, 'content-type')
 > +        if filename is missing:
 >              return failobj
 > +        """The following line takes care of cases such as this:
 > +Content-Disposition: attachment;
 > +  filename="=?ISO-8859-1?Q?z=C7D-_Zoltan=5Fchunk=5F5.wmv?="
 > +        """
 > +        filename = decode_header(filename)[0][0]
 >          return utils.collapse_rfc2231_value(filename).strip()

I feel your pain; Japanese MUAs do this kind of thing all the time,
too.  However, decoding such garbage should not be done without
specific permission from a human user, because it's forbidden by the
standard.

From janssen at parc.com  Thu Feb 14 02:33:14 2008
From: janssen at parc.com (Bill Janssen)
Date: Wed, 13 Feb 2008 17:33:14 PST
Subject: [Email-SIG] Patch: Improve recognition of attachment file name,
	with encodings
In-Reply-To: <87tzkc7fd1.fsf@uwakimon.sk.tsukuba.ac.jp> 
References: <47B2F15D.9030709@acapela.com.br> <47B326DE.7030607@acapela.com.br>
	<87tzkc7fd1.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <08Feb13.173321pst."58696"@synergy1.parc.xerox.com>

>  > +        # nando: Some messages specify the file name of attachment this way:
>  >          if filename is missing:
>  > +            filename = self.get_param('name', missing, 'content-type')
>  > +        if filename is missing:
>  >              return failobj
>  > +        """The following line takes care of cases such as this:
>  > +Content-Disposition: attachment;
>  > +  filename="=?ISO-8859-1?Q?z=C7D-_Zoltan=5Fchunk=5F5.wmv?="
>  > +        """
>  > +        filename = decode_header(filename)[0][0]
>  >          return utils.collapse_rfc2231_value(filename).strip()
> 
> I feel your pain; Japanese MUAs do this kind of thing all the time,
> too.  However, decoding such garbage should not be done without
> specific permission from a human user, because it's forbidden by the
> standard.

Would it be possible to make this a configurable option, so that if
the user enables it, it's done?

Bill

From stephen at xemacs.org  Thu Feb 14 07:12:51 2008
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Thu, 14 Feb 2008 15:12:51 +0900
Subject: [Email-SIG] Patch: Improve recognition of attachment file name,
	with encodings
In-Reply-To: <08Feb13.173321pst."58696"@synergy1.parc.xerox.com>
References: <47B2F15D.9030709@acapela.com.br> <47B326DE.7030607@acapela.com.br>
	<87tzkc7fd1.fsf@uwakimon.sk.tsukuba.ac.jp>
	<08Feb13.173321pst."58696"@synergy1.parc.xerox.com>
Message-ID: <87odak6p7g.fsf@uwakimon.sk.tsukuba.ac.jp>

Bill Janssen writes:

 > Would it be possible to make this a configurable option, so that if
 > the user enables it, it's done?

I don't like it at all, but it has to be on the table, because I get
such malformed messages daily.  I don't think it's going to stop.
Users of the email module are going to want to read their mail,
they're going to want to read the file names, so they know where
they're saving attachments and what the content probably is.


From nando at acapela.com.br  Thu Feb 14 12:20:20 2008
From: nando at acapela.com.br (Nando)
Date: Thu, 14 Feb 2008 09:20:20 -0200
Subject: [Email-SIG] Just give me the decoded header?
Message-ID: <47B423F4.2080201@acapela.com.br>

Gentlemen, please consider the following ipython session:


In [98]: m = email.message_from_file(f)

In [99]: print m["subject"]
=?utf-8?b?W291aS5jb20uYnJdIENhcnTDo28gZGUgY3LDqWRpdG8gdGVyw6EgbGVn?=
        =?utf-8?b?aXNsYcOnw6NvIGVzcGVjw61maWNh?=


It gives me the raw subject header value. Now of course I just wanted 
the header in unicode. So I have to do:


In [100]: from email.header import decode_header

In [101]: decode_header(m["subject"])
Out[101]:
[('[oui.com.br] Cart\xc3\xa3o de cr\xc3\xa9dito ter\xc3\xa1 
legisla\xc3\xa7\xc3\xa3o espec\xc3\xadfica',
  'utf-8')]

In [102]: print decode_header(m["subject"])[0][0]
[oui.com.br] Cart?o de cr?dito ter? legisla??o espec?fica


My questions are:
1) Why does not it currently return the *decoded* header?
2) Would it break too many apps if we changed it?
2.1) If it would, can we add a function such as 
message.getheader("subject") for this?
2.1.1) Would you like me to propose a patch with the obvious implementation?

Sometimes, for things more or less like this, I just feel like 
*subclassing* Message. But I can't. The MIME parser is wired to create 
Messages. I don't think I can tell it to create a MyMessageSubclass. 
This also happens with the convenience function 
email.message_from_file(f). It creates a Message. I *think* I could make 
it into a class method of Message, then I would be able to call 
MyMessage.from_file(). Is this idea -- making things more 
object-oriented -- interesting for you?

For starters, isn't it high time Message became a new-style class by 
inheriting from object?

-- 
Nando Florestan
===============
[skype]    nandoflorestan
[phone]  + 55 (11) 3675-3038
[mobile] + 55 (11) 9820-5451
[internet] http://oui.com.br/
[? Capela] http://acapela.com.br/
[location] S?o Paulo - SP - Brasil


From mark at msapiro.net  Thu Feb 14 21:42:28 2008
From: mark at msapiro.net (Mark Sapiro)
Date: Thu, 14 Feb 2008 12:42:28 -0800
Subject: [Email-SIG] Just give me the decoded header?
In-Reply-To: <47B423F4.2080201@acapela.com.br>
References: <47B423F4.2080201@acapela.com.br>
Message-ID: <47B4A7B4.6090602@msapiro.net>

Nando wrote:
> 
> Sometimes, for things more or less like this, I just feel like 
> *subclassing* Message. But I can't. The MIME parser is wired to create 
> Messages. I don't think I can tell it to create a MyMessageSubclass. 
> This also happens with the convenience function 
> email.message_from_file(f). It creates a Message. I *think* I could make 
> it into a class method of Message, then I would be able to call 
> MyMessage.from_file(). Is this idea -- making things more 
> object-oriented -- interesting for you?

You can do this now, albeit somewhat differently. See the _class 
argument at <http://docs.python.org/lib/node149.html> and the _factory 
argument at <http://docs.python.org/lib/node148.html>.

e.g. if your mymessage module defines a MyMessage class as a sub class 
of email.message.Message, you can do

import email
import mymessage

f = open('/path/to/message/file')

msg = email.message_from_file(f, mymessage.MyMessage)


to create a MyMessage instance. You can also do

import email
import mymessage

p = email.parser.Parser(mymessage.MyMessage)

to create a parser which will create MyMessage instances.

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan


From stephen at xemacs.org  Thu Feb 14 22:26:11 2008
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Fri, 15 Feb 2008 06:26:11 +0900
Subject: [Email-SIG]  Just give me the decoded header?
In-Reply-To: <47B423F4.2080201@acapela.com.br>
References: <47B423F4.2080201@acapela.com.br>
Message-ID: <87ir0r6xho.fsf@uwakimon.sk.tsukuba.ac.jp>

Nando writes:

 > My questions are:
 > 1) Why does not it currently return the *decoded* header?

Because Message is an implementation of RFC 2822, which says nothing
about decoding headers.  It is very helpful to model your programs
directly on the standards the claim to conform to.

Why restrict the base interface to such a low-level API?  Well,
Internet email is an ancient system going back to RFC 561 at least
(published in 1973), and many things that seem unnecessary today with
modern technology remain necessary because you cannot know what
generation of technology you are communicating with (or even if the
remote user is a dog, as the famous joke goes).  Often optimizations
in modern programs depend on assumptions about standard conformance.

 > 2) Would it break too many apps if we changed it?

It probably would.  Multiply decoding headers will probably result in
passing non-ASCII to the ASCII codec, and boom! you're down.  For
example, Mailman is vulnerable to this.

 > 2.1) If it would, can we add a function such as 
 > message.getheader("subject") for this?

You could, but why would you need that particular implementation?

 > Sometimes, for things more or less like this, I just feel like 
 > *subclassing* Message.

Why do that?  In my experience, you will eventually find a need to
pass the original Message to some routine (or even the original
message, in digital signing applications).  If you want to work with a
SmartMessage so that it contains the same data but returns the decoded
headers, just include the original Message as an attribute:

import email
class SmartMessage(Object):
    def __init__(self,email_message):
        self.raw_message = email_message
    def __getitem__(self,key):
        return email.header.decode_header(self.raw_message[key])

etc.

However, the problem you're going to run into is that this kind of
behavior (whether implemented as a subclass or by enveloping the
raw_message attribute) will make it impossible for apps to distinguish
between Messages and SmartMessages in contexts where it matters.

 > But I can't. The MIME parser is wired to create 
 > Messages. I don't think I can tell it to create a MyMessageSubclass. 

Again, why do you want to?  Everything you need to implement the
behavior you want is in the Message already.

 > For starters, isn't it high time Message became a new-style class by 
 > inheriting from object?

Sure, but code speaks louder than words.  Nobody has been willing to
speak up yet. :-(


From hpj at urpla.net  Fri Feb 15 13:47:23 2008
From: hpj at urpla.net (Hans-Peter Jansen)
Date: Fri, 15 Feb 2008 13:47:23 +0100
Subject: [Email-SIG] Just give me the decoded header?
In-Reply-To: <47B423F4.2080201@acapela.com.br>
References: <47B423F4.2080201@acapela.com.br>
Message-ID: <200802151347.23900.hpj@urpla.net>

Am Donnerstag, 14. Februar 2008 schrieb Nando:
> Gentlemen, please consider the following ipython session:
>
>
> In [98]: m = email.message_from_file(f)
>
> In [99]: print m["subject"]
> =?utf-8?b?W291aS5jb20uYnJdIENhcnTDo28gZGUgY3LDqWRpdG8gdGVyw6EgbGVn?=
>         =?utf-8?b?aXNsYcOnw6NvIGVzcGVjw61maWNh?=
>
>
> It gives me the raw subject header value. Now of course I just wanted
> the header in unicode. So I have to do:
>
>
> In [100]: from email.header import decode_header
>
> In [101]: decode_header(m["subject"])
> Out[101]:
> [('[oui.com.br] Cart\xc3\xa3o de cr\xc3\xa9dito ter\xc3\xa1
> legisla\xc3\xa7\xc3\xa3o espec\xc3\xadfica',
>   'utf-8')]

Nando, you're just a lucky camper in that case. How would you handle a 
mixture of say: big5, euc_jp, koi8_r _and_ utf-8 encodings. Please don't 
claim, that this is unlikely. Sure it is, but never the less, it happens, 
and does your code gets this pathological case right? 

Wait, let's normalize them - but how do we handle encoding failures? 
Remember, there are way too many MUAs, mailing list managers, email 
gateways, autoresponder, etc. out there, which get this wrong! 

Next you ask for email.Message to reparse email addresses to conform to RFC 
2822, and voila, you created a unmanageable creature called Frankenstein.. 

If you think about the consequences, you will understand, that Barry and 
friends will do _everything_ to keep this can o'worms closed in this 
context.

Pete

From nando at acapela.com.br  Sat Feb 16 22:09:06 2008
From: nando at acapela.com.br (Nando)
Date: Sat, 16 Feb 2008 19:09:06 -0200
Subject: [Email-SIG] Patch: Improve recognition of attachment file name,
 with encodings
In-Reply-To: <87odak6p7g.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <47B2F15D.9030709@acapela.com.br>	<47B326DE.7030607@acapela.com.br>	<87tzkc7fd1.fsf@uwakimon.sk.tsukuba.ac.jp>	<08Feb13.173321pst."58696"@synergy1.parc.xerox.com>
	<87odak6p7g.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <47B750F2.8040805@acapela.com.br>

Decoding of RFC 2047 encoded filenames... I attach an updated patch. Now 
it is off by default, but can be enabled by flipping a flag. I have 
updated the docstring for the get_filename() method. Let me know if I am 
forgetting something.

Two questions:

1) I have done this for the get_filename() method only. The flag that 
needs to be set is called *garbage_filename_decoding*. Look, it says 
"filename" in there. But are there any other parameters where the 
improper usage of RFC 2047 also commonly occurs? If so, maybe a single 
flag for all of them would be more appropriate...

2) Is there some flaw in decode_header()? Something that Thunderbird 
displays as "Eduardo & M?nica" is being decoded with the wrong character 
in place of the ?:
repr(decode_header(m["subject"])[0][0])
'Eduardo & M\xf4nica'
The header being tested is:
Subject: =?iso-8859-1?Q?Eduardo_&_M=F4nica?=
In case we are again doing the Right Thing, then why does Thunderbird 
display it the way it was intended?

I am not familiar with the RFCs. When I read Stephen Turnbull's message 
explaining that these are in fact malformed messages, I was very 
worried. (I want the email library to just work...) Fortunately we can 
do the right thing by default, while still supporting decoding of the 
malformed messages.

I hope you can approve this small patch...

Nando Florestan
===============
[skype]    nandoflorestan
[phone]  + 55 (11) 3675-3038
[mobile] + 55 (11) 9820-5451
[internet] http://oui.com.br/
[? Capela] http://acapela.com.br/
[location] S?o Paulo - SP - Brasil


Stephen J. Turnbull wrote:
> Bill Janssen writes:
>
>  > Would it be possible to make this a configurable option, so that if
>  > the user enables it, it's done?
>
> I don't like it at all, but it has to be on the table, because I get
> such malformed messages daily.  I don't think it's going to stop.
> Users of the email module are going to want to read their mail,
> they're going to want to read the file names, so they know where
> they're saving attachments and what the content probably is.
>   

From nando at acapela.com.br  Sat Feb 16 22:25:13 2008
From: nando at acapela.com.br (Nando)
Date: Sat, 16 Feb 2008 19:25:13 -0200
Subject: [Email-SIG] Patch: Improve recognition of attachment file name,
 with encodings
In-Reply-To: <47B750F2.8040805@acapela.com.br>
References: <47B2F15D.9030709@acapela.com.br>	<47B326DE.7030607@acapela.com.br>	<87tzkc7fd1.fsf@uwakimon.sk.tsukuba.ac.jp>	<08Feb13.173321pst."58696"@synergy1.parc.xerox.com>	<87odak6p7g.fsf@uwakimon.sk.tsukuba.ac.jp>
	<47B750F2.8040805@acapela.com.br>
Message-ID: <47B754B9.5000705@acapela.com.br>

Looks like I forgot to attach the patch. Sorry. Here it is.

Nando Florestan
===============
[skype]    nandoflorestan
[phone]  + 55 (11) 3675-3038
[mobile] + 55 (11) 9820-5451
[internet] http://oui.com.br/
[? Capela] http://acapela.com.br/
[location] S?o Paulo - SP - Brasil


Nando wrote:
> Decoding of RFC 2047 encoded filenames... I attach an updated patch. Now 
> it is off by default, but can be enabled by flipping a flag. I have 
> updated the docstring for the get_filename() method. Let me know if I am 
> forgetting something.
>
> Two questions:
>
> 1) I have done this for the get_filename() method only. The flag that 
> needs to be set is called *garbage_filename_decoding*. Look, it says 
> "filename" in there. But are there any other parameters where the 
> improper usage of RFC 2047 also commonly occurs? If so, maybe a single 
> flag for all of them would be more appropriate...
>
> 2) Is there some flaw in decode_header()? Something that Thunderbird 
> displays as "Eduardo & M?nica" is being decoded with the wrong character 
> in place of the ?:
> repr(decode_header(m["subject"])[0][0])
> 'Eduardo & M\xf4nica'
> The header being tested is:
> Subject: =?iso-8859-1?Q?Eduardo_&_M=F4nica?=
> In case we are again doing the Right Thing, then why does Thunderbird 
> display it the way it was intended?
>
> I am not familiar with the RFCs. When I read Stephen Turnbull's message 
> explaining that these are in fact malformed messages, I was very 
> worried. (I want the email library to just work...) Fortunately we can 
> do the right thing by default, while still supporting decoding of the 
> malformed messages.
>
> I hope you can approve this small patch...
>   
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: diff.txt
Url: http://mail.python.org/pipermail/email-sig/attachments/20080216/bee0203b/attachment.txt 

From nando at acapela.com.br  Sun Feb 17 03:24:00 2008
From: nando at acapela.com.br (Nando)
Date: Sat, 16 Feb 2008 23:24:00 -0300
Subject: [Email-SIG] Patch: Improve recognition of attachment file name,
 with encodings
In-Reply-To: <47B750F2.8040805@acapela.com.br>
References: <47B2F15D.9030709@acapela.com.br>	<47B326DE.7030607@acapela.com.br>	<87tzkc7fd1.fsf@uwakimon.sk.tsukuba.ac.jp>	<08Feb13.173321pst."58696"@synergy1.parc.xerox.com>	<87odak6p7g.fsf@uwakimon.sk.tsukuba.ac.jp>
	<47B750F2.8040805@acapela.com.br>
Message-ID: <47B79AC0.2090604@acapela.com.br>

OK, I get question number 2 now. My question was:

2) Is there some flaw in decode_header()? Something that Thunderbird 
displays as "Eduardo & M?nica" is being decoded with the wrong character 
in place of the ?:
repr(decode_header(m["subject"])[0][0])
'Eduardo & M\xf4nica'
The header being tested is:
Subject: =?iso-8859-1?Q?Eduardo_&_M=F4nica?=
In case we are again doing the Right Thing, then why does Thunderbird 
display it the way it was intended?


The answer is I have to use codecs.decode():

import codecs

In [20]: [(s, encoding)] = decode_header("=?iso-8859-1?Q?P=F4nei?=")

In [21]: s
Out[21]: 'P\xf4nei'

In [22]: encoding
Out[22]: 'iso-8859-1'

In [23]: print codecs.decode(s, encoding)
P?nei

Well, that just makes it even harder to use the return value of the 
decode_header() function. And instead of encapsulating all that 
complexity in the email library, you are forcing every user of the 
library to find all this out by himself, just as I had to.

This is very un-MartinFowler-like, if you pardon that expression :p

I understand Stephen Turnbull's point that it is useful to map the 
Message class to RFC 2822, because some users need that. However, that 
is not what *I* need - I want a high-level email library, and I am sure 
many others do too.

Other mail libraries have faced the challenges of encodings before. I 
don't really see why we in Python should hide from that can o'worms (as 
Hans-Peter Jansen put it). It is a dirty job, but someone gotta do it!

"How would you handle a mixture of say: big5, euc_jp, koi8_r _and_ utf-8 
encodings?"

Well I don't know what the flabbergast you are talking about, but:

Are you scared?

Why should the application developer have to deal with something that 
you e-mail experts are much more qualified to implement?

What is it, are you afraid of having a module accused of being "buggy"? 
(If so, you know very well that this is not the free software way.)

What about code reuse? Did you see how much I had to do just in order to 
print a Subject header?

I do think that a Message subclass (HighLevelMessage?) could play this 
role nicely - a high-level interface. Has anyone done this before? (It 
is a very obvious idea.) Is anybody else interested at all? Most of the 
vibes I get here are like "don't do this, don't do that"...

Thanks to Mark Shapiro for showing me a way to do what I want.

Nando Florestan
===============
[skype]    nandoflorestan
[phone]  + 55 (11) 3675-3038
[mobile] + 55 (11) 9820-5451
[internet] http://oui.com.br/
[? Capela] http://acapela.com.br/
[location] S?o Paulo - SP - Brasil


From nando at acapela.com.br  Tue Feb 19 10:55:25 2008
From: nando at acapela.com.br (Nando)
Date: Tue, 19 Feb 2008 06:55:25 -0300
Subject: [Email-SIG] Patch: Improve recognition of attachment file name,
 with encodings
In-Reply-To: <47B79AC0.2090604@acapela.com.br>
References: <47B2F15D.9030709@acapela.com.br>	<47B326DE.7030607@acapela.com.br>	<87tzkc7fd1.fsf@uwakimon.sk.tsukuba.ac.jp>	<08Feb13.173321pst."58696"@synergy1.parc.xerox.com>	<87odak6p7g.fsf@uwakimon.sk.tsukuba.ac.jp>
	<47B750F2.8040805@acapela.com.br> <47B79AC0.2090604@acapela.com.br>
Message-ID: <47BAA78D.7080704@acapela.com.br>

You have succeeded in confusing me to the point I don't know whether my 
own proposed patch is useful anymore.

It would seem to mean a change in philosophy, from "just implement the 
standards closely" to "give the application developer what he needs". 
But this goal would only be accomplished with many alterations to the 
codebase.

So if this is true:

Stephen J. Turnbull wrote:
>  > 1) Why does not it currently return the *decoded* header?
>
> Because Message is an implementation of RFC 2822, which says nothing
> about decoding headers.  It is very helpful to model your programs
> directly on the standards the claim to conform to.
>   
I cannot see any documentation stating that the way of this project is 
to have each class implement one RFC. If there *were* a writeup on this 
somewhere, maybe I wouldn't have annoyed you with so many questions and 
absurd propositions.

But why would you lie to me? So it must be true...

Then I don't agree with my own patch anymore.

Anyway, the necessity of a high-level interface remains and nobody has 
answered my question: whither?

Nando Florestan
===============
[skype]    nandoflorestan
[phone]  + 55 (11) 3675-3038
[mobile] + 55 (11) 9820-5451
[internet] http://oui.com.br/
[? Capela] http://acapela.com.br/
[location] S?o Paulo - SP - Brasil


From stephen at xemacs.org  Tue Feb 19 23:51:51 2008
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Wed, 20 Feb 2008 07:51:51 +0900
Subject: [Email-SIG] Patch: Improve recognition of attachment file name,
 with encodings
In-Reply-To: <47BAA78D.7080704@acapela.com.br>
References: <47B2F15D.9030709@acapela.com.br> <47B326DE.7030607@acapela.com.br>
	<87tzkc7fd1.fsf@uwakimon.sk.tsukuba.ac.jp>
	<08Feb13.173321pst."58696"@synergy1.parc.xerox.com>
	<87odak6p7g.fsf@uwakimon.sk.tsukuba.ac.jp>
	<47B750F2.8040805@acapela.com.br> <47B79AC0.2090604@acapela.com.br>
	<47BAA78D.7080704@acapela.com.br>
Message-ID: <87tzk4woe0.fsf@uwakimon.sk.tsukuba.ac.jp>

Nando writes:

 > > Because Message is an implementation of RFC 2822, which says nothing
 > > about decoding headers.  It is very helpful to model your programs
 > > directly on the standards the claim to conform to.

 > I cannot see any documentation stating that the way of this project is 
 > to have each class implement one RFC. If there *were* a writeup on this 
 > somewhere, maybe I wouldn't have annoyed you with so many questions and 
 > absurd propositions.

But I didn't say that it was a general principle.  I said that Message
is an implementation of RFC 2822.  For historical reasons it lives in
the rfc822 module.  Its main docstring says:

  RFC 2822 message manipulation.

  Note: This is only a very rough sketch of a full RFC-822 parser; in
  particular the tokenizing of addresses does not adhere to all the
  quoting rules.

  Note: RFC 2822 is a long awaited update to RFC 822.  This module
  should conform to RFC 2822, and is thus mis-named (it's not worth
  renaming it).  Some effort at RFC 2822 updates have been made, but a
  thorough audit has not been performed.  Consider any RFC 2822
  non-conformance to be a bug.

 > Anyway, the necessity of a high-level interface remains and nobody has 
 > answered my question: whither?

As I wrote earlier, I think writing a *general* high-level interface
is going to be very hard.  I don't have time to devote to it.  If you
have a set of use cases that have a lot of common needs, then you can
and should write a module to serve those needs.  You'll probably find
other people with similar needs, and some of them will help.


From stuart at stuartbishop.net  Tue Feb 26 04:54:48 2008
From: stuart at stuartbishop.net (Stuart Bishop)
Date: Tue, 26 Feb 2008 10:54:48 +0700
Subject: [Email-SIG] Just give me the decoded header?
In-Reply-To: <47B423F4.2080201@acapela.com.br>
References: <47B423F4.2080201@acapela.com.br>
Message-ID: <47C38D88.8000102@stuartbishop.net>

Nando wrote:
> Gentlemen, please consider the following ipython session:
> 
> 
> In [98]: m = email.message_from_file(f)
> 
> In [99]: print m["subject"]
> =?utf-8?b?W291aS5jb20uYnJdIENhcnTDo28gZGUgY3LDqWRpdG8gdGVyw6EgbGVn?=
>         =?utf-8?b?aXNsYcOnw6NvIGVzcGVjw61maWNh?=
> 
> 
> It gives me the raw subject header value. Now of course I just wanted 
> the header in unicode. So I have to do:
> 
> 
> In [100]: from email.header import decode_header
> 
> In [101]: decode_header(m["subject"])
> Out[101]:
> [('[oui.com.br] Cart\xc3\xa3o de cr\xc3\xa9dito ter\xc3\xa1 
> legisla\xc3\xa7\xc3\xa3o espec\xc3\xadfica',
>   'utf-8')]
> 
> In [102]: print decode_header(m["subject"])[0][0]
> [oui.com.br] Cart?o de cr?dito ter? legisla??o espec?fica
> 
> 
> My questions are:
> 1) Why does not it currently return the *decoded* header?

Because you often need access to the raw header. Also, not all headers are
encoded the same. While what you have works for Subject:, it doesn't work
for To:, Reply-To:, From: etc.

> 2) Would it break too many apps if we changed it?

Yes. Particularly apps that need to log or report broken email headers that
cannot be decoded.

> 2.1) If it would, can we add a function such as 
> message.getheader("subject") for this?
> 2.1.1) Would you like me to propose a patch with the obvious implementation?

I'd love to see things become more Unicode aware.

Perhaps return an object implementing __str__() and __unicode__() (or
decode()). The cast-to-unicode conversion would decode headers with known
encodings and raise an exception on headers with unknown encodings.
Similarly, setting headers using Unicode strings would use the known
encodings to perform the reverse operation. And you still have access to the
raw value if you want to round trip.


-- 
Stuart Bishop <stuart at stuartbishop.net>
http://www.stuartbishop.net/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: OpenPGP digital signature
Url : http://mail.python.org/pipermail/email-sig/attachments/20080226/984d632d/attachment.pgp 

From stuart at stuartbishop.net  Tue Feb 26 05:11:51 2008
From: stuart at stuartbishop.net (Stuart Bishop)
Date: Tue, 26 Feb 2008 11:11:51 +0700
Subject: [Email-SIG] Patch: Improve recognition of attachment file name,
 with encodings
In-Reply-To: <47B79AC0.2090604@acapela.com.br>
References: <47B2F15D.9030709@acapela.com.br>	<47B326DE.7030607@acapela.com.br>	<87tzkc7fd1.fsf@uwakimon.sk.tsukuba.ac.jp>	<08Feb13.173321pst."58696"@synergy1.parc.xerox.com>	<87odak6p7g.fsf@uwakimon.sk.tsukuba.ac.jp>	<47B750F2.8040805@acapela.com.br>
	<47B79AC0.2090604@acapela.com.br>
Message-ID: <47C39187.6040303@stuartbishop.net>

Nando wrote:
> OK, I get question number 2 now. My question was:
> 
> 2) Is there some flaw in decode_header()? Something that Thunderbird 
> displays as "Eduardo & M?nica" is being decoded with the wrong character 
> in place of the ?:
> repr(decode_header(m["subject"])[0][0])
> 'Eduardo & M\xf4nica'
> The header being tested is:
> Subject: =?iso-8859-1?Q?Eduardo_&_M=F4nica?=
> In case we are again doing the Right Thing, then why does Thunderbird 
> display it the way it was intended?
> 
> 
> The answer is I have to use codecs.decode():
> 
> import codecs
> 
> In [20]: [(s, encoding)] = decode_header("=?iso-8859-1?Q?P=F4nei?=")
> 
> In [21]: s
> Out[21]: 'P\xf4nei'
> 
> In [22]: encoding
> Out[22]: 'iso-8859-1'
> 
> In [23]: print codecs.decode(s, encoding)
> P?nei
> 
> Well, that just makes it even harder to use the return value of the 
> decode_header() function. And instead of encapsulating all that 
> complexity in the email library, you are forcing every user of the 
> library to find all this out by himself, just as I had to.

It gets harder, as you are not handling Unicode domain names. Code to
convert email addresses between their ASCII and Unicode representations can
be found at http://stuartbishop.net/Software/EmailAddress/

(Barry - we should discuss getting code to do this into the standard library
again. I think I opened a bug on this soon after I wrote it - in 2004!)

It is a bit of a learning curve, and I suspect that most users of the
library have written the same or similar helpers, possibly several times.
eg. the nearly mandatory header decoder:

def decode_header(s):
    '''Decode an RFC2047 email header into a Unicode string.'''
    s = email.Header.decode_header(s)
    s = [b[0].decode(b[1] or 'ascii') for b in s]
    return u''.join(s)


-- 
Stuart Bishop <stuart at stuartbishop.net>
http://www.stuartbishop.net/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: OpenPGP digital signature
Url : http://mail.python.org/pipermail/email-sig/attachments/20080226/dfbd5963/attachment.pgp