From richardjones at optushome.com.au Mon Nov 10 17:28:15 2003 From: richardjones at optushome.com.au (Richard Jones) Date: Mon Nov 10 17:48:23 2003 Subject: [Email-SIG] Long header continuations Message-ID: <200311110928.15385.richardjones@optushome.com.au> [sorry, I don't have the time to sign up to another mailing list :( ] Long ago, I posted the following bug regarding Python's rfc822 module's handling of long header continuation lines: https://sf.net/tracker/?func=detail&atid=105470&aid=504152&group_id=5470 ... and it was lost to the ether. I'd really like for it to be fixed. Barry informed me of this email-sig effort to develop the next generation of the email package. Apparently it has the same bug, and needs fixing too. Richard -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: signature Url : http://mail.python.org/pipermail/email-sig/attachments/20031111/a4a21011/attachment.bin From donn at u.washington.edu Mon Nov 17 16:49:17 2003 From: donn at u.washington.edu (Donn Cave) Date: Mon Nov 17 16:58:43 2003 Subject: [Email-SIG] RFC 2231 continuations vs. encoding Message-ID: The following leads to an "unpack list of wrong size" error in Utils.decode_rfc2231() | Content-Type: IMAGE/GIF; NAME*0="est_pais.pl?s_xref=20020502elpepiint_12.Tes&i_type=3&s_anchor=el pepiint&end=11696"; | NAME*1="69203.gif" | Content-Transfer-Encoding: BASE64 | Content-ID: | Content-Description: | | R0lGODlA...base64stuff...2lmDQ== As far as I can tell, the code assumes that a continued value like NAME*0 must be encoded, where the RFC says an encoded value should be named NAME*0*. I fixed it by checking for param[1][0].endswith('*'), otherwise substitute default value for charset and language. Donn Cave, University Computing Services, University of Washington donn@u.washington.edu (not subscribed.) From barry at python.org Fri Nov 21 10:19:30 2003 From: barry at python.org (Barry Warsaw) Date: Fri Nov 21 10:19:36 2003 Subject: [Email-SIG] A suggestion: HTML stripping Message-ID: <1069427970.2383.37.camel@anthem> I had a suggestion from a happy email package user that I thought might be interesting to consider. He was using email as a replacement for the Perl demime thingie. He was generally happy about what email allowed him to do, except for one thing. He was using a DecodedGenerator but wanted to strip text/html parts of its tags, leaving just plain text. In Mailman, I actually call out to something like lynx to render text/html into plain text, but I think he wanted something simpler. He just wanted to rip out all the tags, and ended up using an HTMLParser class to do this. Something to think about for email 3.0. -Barry From matt at mondoinfo.com Fri Nov 21 15:18:55 2003 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Fri Nov 21 15:51:50 2003 Subject: [Email-SIG] A suggestion: HTML stripping In-Reply-To: <1069427970.2383.37.camel@anthem> References: <1069427970.2383.37.camel@anthem> Message-ID: <1069433052.27.1622@mint-julep.mondoinfo.com> > I had a suggestion from a happy email package user that I thought > might be interesting to consider. He was using email as a > replacement for the Perl demime thingie. He was generally happy > about what email allowed him to do, except for one thing. He was > using a DecodedGenerator but wanted to strip text/html parts of its > tags, leaving just plain text. > In Mailman, I actually call out to something like lynx to render > text/html into plain text, but I think he wanted something simpler. > He just wanted to rip out all the tags, and ended up using an > HTMLParser class to do this. It's a sad state we've come to when we have to turn mail back into text . I have code that's just like what you describe. In fact it's a slightly-twiddled version of some code that Alex Martelli posted to comp.lang.python a while back. It's sufficiently trivial that I may as well just paste it here in case it's of use to anyone: # Very slightly modified from Alex Martelli's news post # <9cpm4202cv1@news1.newsguy.com> of May 2, 2001, # Subject: Stripping HTML tags from a string # Thanks, Alex class Cleaner(sgmllib.SGMLParser): entitydefs={"nbsp": " "} # I'll break if I want to def __init__(self): sgmllib.SGMLParser.__init__(self) self.result = [] def do_p(self, *junk): self.result.append('\n') def do_br(self, *junk): self.result.append('\n') def handle_data(self, data): self.result.append(data) def cleaned_text(self): return ''.join(self.result) def stripHTML(text): c=Cleaner() try: c.feed(text) except sgmllib.SGMLParseError: raise ValueError,"Unable to parse HTML" else: t=c.cleaned_text() return t Regards, Matt From barry at python.org Fri Nov 21 16:27:52 2003 From: barry at python.org (Barry Warsaw) Date: Fri Nov 21 16:28:03 2003 Subject: [Email-SIG] Generator.HeaderParsedGenerator In-Reply-To: References: Message-ID: <1069450072.2383.159.camel@anthem> On Sun, 2003-10-05 at 02:51, Jason R.Mastaler wrote: > Can the attached patch be considered for inclusion in email? This > issue is a former mimelib tracker item, but those trackers are now > disabled. I've included the previous commentary leading to the patch > below. FWIW, we've been using this in TMDA successfully for months > now. So I've been thinking a little bit about this recently. I'm not sure I feel comfortable adding this to email 2.x/Python 2.3. It's definitely a new feature that is probably not appropriate for a patch release. But there's a deeper issue which we might want to think about for email 3.0. Currently we decide how to render a message by its Content-Type header, but that may not be optimal. If we had to resort to the HeaderParser to parse a message, the Content-Type header may lie, or at least it won't accurately describe the algorithm we should use to flatten the message. It doesn't make sense to use some other header, or change the Content-Type header, so I'm thinking we want individual messages to have some other say in how they get flattened, either via attribute setting or method call. Perhaps messages should have an "effective" content type which, if present is used instead to determine how to flatten the message. I'm just thinking out loud here. -Barry From tim at catseye.net Fri Nov 21 17:37:30 2003 From: tim at catseye.net (Tim Legant) Date: Fri Nov 21 17:41:01 2003 Subject: [Email-SIG] Re: Generator.HeaderParsedGenerator References: <1069450072.2383.159.camel@anthem> Message-ID: <86isldcpqt.fsf@skitty.catseye.net> Barry Warsaw writes: > On Sun, 2003-10-05 at 02:51, Jason R.Mastaler wrote: > > Can the attached patch be considered for inclusion in email? This > > issue is a former mimelib tracker item, but those trackers are now > > disabled. I've included the previous commentary leading to the patch > > below. FWIW, we've been using this in TMDA successfully for months > > now. > > So I've been thinking a little bit about this recently. I'm not sure I > feel comfortable adding this to email 2.x/Python 2.3. It's definitely a > new feature that is probably not appropriate for a patch release. Turns out it's not a complete solution anyhow. For example, if you build a new Message from scratch and attach the HeaderParsed Message, it will still blow. We do this in TMDA to generate auto-responses. I hacked an unhappy solution around it for now, but it needs to be addressed "correctly". The problem is that Generator clones itself to flatten sub-parts and thus uses the wrong class of generator to generate the HeaderParsed sub-part. > But there's a deeper issue which we might want to think about for email > 3.0. Currently we decide how to render a message by its Content-Type > header, but that may not be optimal. We don't actually look at the Content-Type, in at least one case, which is the cause of another problem Jason just discovered. We had a message show up with the following header field: Content-Type: multipour alternative; boundary="57C41D49D2E1982C2C.B7A" Parser._parsebody assumes that, if it finds a valid boundary, the message is a multipart message and proceeds to parse it as such. Generator comes along, asks for the content-type and gets 'text/plain', because Message.get_content_type() can't make any sense out of 'multipour alternative'. Unfortunately, Message._payload is a list of sub-Message objects and Generator._handle_text raises a TypeError. > If we had to resort to the HeaderParser to parse a message, the > Content-Type header may lie, or at least it won't accurately > describe the algorithm we should use to flatten the message. For any HeaderParsed Message that isn't 'text/plain', Content-Type will definitely be lying. > It doesn't make sense to use some other header, or change the > Content-Type header, so I'm thinking we want individual messages to have > some other say in how they get flattened, either via attribute setting > or method call. Perhaps messages should have an "effective" content > type which, if present is used instead to determine how to flatten the > message. Essentially, the data structure of Message._payload must alway "win", regardless of Content-Type, or there will be errors. The "effective" type is one way to implement that... The email package has to work in two different scenarios; in some cases, email users will want to know about the errors and in other cases they need a valid rfc2822 message generated, as close as possible to the broken original, yet still able to be presented in an MUA, or re-sent, or whatever. TMDA clearly falls into the second category because, as a delivery agent, we can't live with dropped/lost mail. I know Matthew has been thinking about an entirely different parsing framework for email 3.0. I'm looking forward to that, hoping we can address some of these issues there. Tim From PYTHON at telefonica.net Mon Nov 24 14:43:41 2003 From: PYTHON at telefonica.net (PYTHON@telefonica.net) Date: Mon Nov 24 16:09:27 2003 Subject: [Email-SIG] OffTopic (email lib) : attachments in a forwarded email Message-ID: Hi all, Firts of all sorry for this off-topic post (I didn?t know where can I post it). I have a problem when I try to get the attachments of a forwarded email; I can only get the attach file that contains the "real attached files I want", not those files # here is my code: import pop3lib impor email.Parser a = poplib.POP3("pop3.telefonica.net") a.user(userpop3) a.pass_(passpop3) try: (numMsgs, totalSize) = a.stat() if numMsgs == 0: print "Sorry - there are no messages in the mailbox" else: for thisNum in range(1, numMsgs + 1): (server_msg, body, octets) = a.retr(thisNum) # "cuerpo" now is a string cuerpo=StringIO.StringIO(string.join(body,'\n')) p = email.Parser.Parser() msg = p.parse(cuerpo) os.chdir(dir_out) for part in msg.walk(): name= part.get_param("name") if name != None: f = open(dir_out + "\\" + name, "wb") f.write(part.get_payload(decode=1)) f.close print "Name of the attachment: ", name a.quit() #thanks From jason at mastaler.com Mon Nov 24 22:54:12 2003 From: jason at mastaler.com (Jason R. Mastaler) Date: Mon Nov 24 22:54:22 2003 Subject: [Email-SIG] support CJKCodecs in Charset.py Message-ID: Should we support CJKCodecs in Charset.py? See http://cjkpython.i18n.org/. Currently we use a combination of 3 separate packages to support Japanese, Chinese, and Korean. Switching to cjkcodecs would support all of them with just one package. In addition, 2 of the 3 packages we currently support are obsolete. KoreanCodecs and ChineseCodecs have been revoked in favor of cjkcodecs, so the code is no longer even available. JapaneseCodecs is still available, but has not been developed in over a year. cjkcodecs is actively developed, and available in port/package form for a number of operating systems. I have a diff against CVS Charset.py that applies support for cjkcodecs if you are interested. From barry at python.org Mon Nov 24 23:14:14 2003 From: barry at python.org (Barry Warsaw) Date: Mon Nov 24 23:14:26 2003 Subject: [Email-SIG] support CJKCodecs in Charset.py In-Reply-To: References: Message-ID: <1069733653.31869.24.camel@anthem> On Mon, 2003-11-24 at 22:54, Jason R.Mastaler wrote: > Should we support CJKCodecs in Charset.py? See > http://cjkpython.i18n.org/. > > Currently we use a combination of 3 separate packages to support > Japanese, Chinese, and Korean. Switching to cjkcodecs would support > all of them with just one package. > > In addition, 2 of the 3 packages we currently support are obsolete. > KoreanCodecs and ChineseCodecs have been revoked in favor of > cjkcodecs, so the code is no longer even available. JapaneseCodecs is > still available, but has not been developed in over a year. cjkcodecs > is actively developed, and available in port/package form for a number > of operating systems. > > I have a diff against CVS Charset.py that applies support for > cjkcodecs if you are interested. There were some similar discussions on mailman-developers a while back. I'm not qualified to say, and I don't know if folks like Tokio Kikuchi, Ben Gertzfield, and Martin von Loewis are on this list, but IIRC, there was some controversy here. -Barry From jason at mastaler.com Mon Nov 24 23:33:56 2003 From: jason at mastaler.com (Jason R. Mastaler) Date: Mon Nov 24 23:34:01 2003 Subject: [Email-SIG] Re: support CJKCodecs in Charset.py References: <1069733653.31869.24.camel@anthem> Message-ID: Barry Warsaw writes: > There were some similar discussions on mailman-developers a while > back. I'm not qualified to say, and I don't know if folks like Tokio > Kikuchi, Ben Gertzfield, and Martin von Loewis are on this list, but > IIRC, there was some controversy here. Controversy regarding all of cjkcodecs, or just the Japanese portion? The maintainer of cjkcodecs was also the maintainer of KoreanCodecs, and he has revoked KoreanCodecs. He also says ChineseCodecs was never formally released, and it hasn't been maintained for 3 years (a quick look at the source code confirms this). So, I think these two can safely be replaced. Should I modify my patch to leave the Japanese alone, and only replace Korean and Chinese with cjkcodecs? Do you want to investigate this issue of cjkcodecs vs. JapaneseCodecs further? From tkikuchi at is.kochi-u.ac.jp Tue Nov 25 19:10:48 2003 From: tkikuchi at is.kochi-u.ac.jp (Tokio Kikuchi) Date: Tue Nov 25 19:10:55 2003 Subject: [Email-SIG] Re: support CJKCodecs in Charset.py In-Reply-To: References: <1069733653.31869.24.camel@anthem> Message-ID: <3FC3EF88.1080208@is.kochi-u.ac.jp> Hi, Jason R. Mastaler wrote: > > Should I modify my patch to leave the Japanese alone, and only replace > Korean and Chinese with cjkcodecs? Do you want to investigate this > issue of cjkcodecs vs. JapaneseCodecs further? > Both JapaneseCodecs and CJKCodecs include 'euc-jp' and 'iso-2022-jp' which are required in internal conversion of messages in Mailman. While CJKCodecs include jis-x0213 ('iso-2022-jp-[23]'), JapaneseCodecs can handle half-width-kana by 'iso-2022-jp-ext'. I think we can handle well behaved japanese messages by CJKCodecs and users can add JapaneseCodecs if they want to do with extended charsets. +1 for CJKCodecs -- Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp http://weather.is.kochi-u.ac.jp/ From jason at mastaler.com Wed Nov 26 14:12:04 2003 From: jason at mastaler.com (Jason R. Mastaler) Date: Wed Nov 26 14:12:07 2003 Subject: [Email-SIG] Re: support CJKCodecs in Charset.py References: <1069733653.31869.24.camel@anthem> <3FC3EF88.1080208@is.kochi-u.ac.jp> Message-ID: Tokio Kikuchi writes: > Both JapaneseCodecs and CJKCodecs include 'euc-jp' and 'iso-2022-jp' > which are required in internal conversion of messages in Mailman. > While CJKCodecs include jis-x0213 ('iso-2022-jp-[23]'), > JapaneseCodecs can handle half-width-kana by 'iso-2022-jp-ext'. > > I think we can handle well behaved japanese messages by CJKCodecs > and users can add JapaneseCodecs if they want to do with extended > charsets. > > +1 for CJKCodecs Thank-you for the clarification Kikuchi-san. As {CJK,Japanese,Korean}Codecs have their own .pth file that registers their encoding aliases, there's no need to use a specific module prefix like 'japanese.' or 'korean.' as is currently done in Charset.py. If we remove these prefixes there won't be an incompatibility issue with {CJK,Japanese,Korean}Codecs. The only compatibility issue will be with ChineseCodecs which will no longer be supported.