From barry at python.org Sun Oct 3 05:23:42 2004 From: barry at python.org (Barry Warsaw) Date: Sun Oct 3 05:23:47 2004 Subject: [Email-SIG] email 3.0 API changes Message-ID: <1096773822.7306.63.camel@geddy.wooz.org> I've finally gotten around to making the major email 3.0 API changes I'd planned for Python 2.4. These are now checked into the Py2.4 tree and I'll try to spin a standalone email 3.0 alpha distutils package soon. In brief, here are the changes: * All features that in email 2.x raised DeprecationWarnings have now been removed. These include: the _encoder arg to MIMEText constructor, Message.add_payload(), Utils.dump_address_pair(), Utils.decode(), Utils.encode() * These now raise DeprecationWarnings: Generator.__call__(), Message.get_type(), Message.get_main_type(), Message.get_subtype(), and the 'strict' argument to the Parser constructor. * Support for Python earlier than 2.3 is removed. * Renamed the Defect classes. Is there anything else that people want to try to get into email 3.0? Note that Python 2.4 beta 1 is scheduled to be released on October 11. That same event will freeze email 3.0's API until Python 2.5. Note that I'm not promising that I will have any time to develop new code for the email package between now and then, but if you have patches, I'll willing to talk about it. :) -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 307 bytes Desc: This is a digitally signed message part Url : http://mail.python.org/pipermail/email-sig/attachments/20041002/5509ba67/attachment.pgp From barry at python.org Sun Oct 3 05:36:43 2004 From: barry at python.org (Barry Warsaw) Date: Sun Oct 3 05:36:46 2004 Subject: [Email-SIG] Two other API nits Message-ID: <1096774603.7313.69.camel@geddy.wooz.org> Two other things we've talked about in the past, which, if we're going to do, we should do now. However they will break backward compatibility: * Change Generator's constructor's mangle_from_ default from True to False * Change Message.__str__() to by default not include the Unix From. These are two default settings that I think were wrong. However, we can't change these in a backward compatible way so we will probably break code if we change them. Does anybody have any clever ways for changing the defaults without breaking gobs of code? We could potentially fix Generator.__init__() by using a different argument name and deprecating mangle_from_ but I don't see any good way of changing __str__(). I'm open to suggestions and opinions, including "yah, they're broken but we've lived with it this long and it's better not to change them". -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 307 bytes Desc: This is a digitally signed message part Url : http://mail.python.org/pipermail/email-sig/attachments/20041002/f2017702/attachment.pgp From andrew.stuart at xse.com.au Sun Oct 3 12:54:48 2004 From: andrew.stuart at xse.com.au (Andrew Stuart) Date: Sun Oct 3 12:54:14 2004 Subject: [Email-SIG] catchmail - a new open source python utility for storing emails in a database Message-ID: <00f101c4a937$693939f0$4001a8c0@beast> Hello Python email-sig people, I am releasing a utility called catchmail to open source. Catchmail writes emails into a Postgres database. It is based on an extended version of the Yukatan data model (a SQL schema for relational storage of email RFC822 messages). I commissioned Mark Hammond to write the catchmail code for my company XSE and now I have decided to release it to the public. Here is the catchmail homepage: http://www.users.bigpond.net.au/mysite/catchmail.htm It's not quite ready for release however - it needs more people to try to use it and check it out before full scale public release. I thought the Python email-sig was the best place to do this. If anyone has the time or the inclination I would value a code review and advice being given as to how to do things differently or better. I'm no great Python programmer so any volunteers who might be interested in helping to enhance and help support catchmail would be much appreciated. I have set up a newsgroup at http://groups-beta.google.com/catchmail There is also a final known problem that I would value advice on. Everything seems to be working fine except one thing - unicode If I create the database using this command, everything seems to run fine - I can import 4000 emails if I create the Postgres data with this command: createdb -U postgres catchmail; If I create the Postgres database using this command, postgres starts to come back with unicode errors when I do the import createdb --encoding=UNICODE -U postgres test The import process starts to fail on lots of messages with this error: Database error: ERROR: invalid byte sequence for encoding "UNICODE": 0xe92062 The objective is to have the database in Unicode so I suppose its quite an important problem to resolve. It looks to me like some sort of encoding/decoding requirement but although I had a good look I couldn't sort it out. I'm afraid I don't much understand how unicode is meant to be used in this sort of application - if you can throw any light on it for me it would be appreciated. How SHOULD unicode be implemented for a utility such as this? I'd like catchmail to be as flexible as possible and to lose as little data as possible through things like character set conversions. I found some references to client encoding and multibyte in the postgres docs here - but maybe it should be fixed in the Python code? SET CLIENT_ENCODING TO 'value'; http://jamesthornton.com/postgres/7.3/postgres/multibyte.html http://www.postgresql.org/docs/7.4/static/multibyte.html#MULTIBYTE-TRANSLATION-TABLE The latest version of catchmail is the one found on the website at http://www.users.bigpond.net.au/mysite/catchmail.htm Any feedback on catchmail or your experience with catchmail valued. Thanks to the great work of Mark Hammond and Jukka Zitting! Andrew Stuart andrew.stuart@xse.com.au From phd at phd.pp.ru Sun Oct 3 14:16:17 2004 From: phd at phd.pp.ru (Oleg Broytmann) Date: Sun Oct 3 14:16:20 2004 Subject: [Email-SIG] catchmail - a new open source python utility for storing emails in a database Message-ID: <20041003121617.GA5370@phd.pp.ru> On Sun, Oct 03, 2004 at 08:54:48PM +1000, Andrew Stuart wrote: > http://groups-beta.google.com/catchmail Redirected me to www.google.com . Also I recommend to use Python DB API for poratbility instead of concentrating on Postgres. Even better (more portable) probably would be an object-relational mapper. Oleg. -- Oleg Broytmann http://phd.pp.ru/ phd@phd.pp.ru Programmers don't die, they just GOSUB without RETURN. From barry at python.org Mon Oct 4 02:33:11 2004 From: barry at python.org (Barry Warsaw) Date: Mon Oct 4 02:33:18 2004 Subject: [Email-SIG] Handling large emails: DiskMessage and DiskFeedParser In-Reply-To: <40B29E3D.5010004@netbox.biz> References: <40B29E3D.5010004@netbox.biz> Message-ID: <1096849991.21012.27.camel@geddy.wooz.org> On Mon, 2004-05-24 at 21:15, Menno Smits wrote: Yes, this was months ago. ;) > FeedParser is great because it doesn't load the entire message into > memory during parsing (yes, I realise there are other reasons for > FeedParser exising too). However, once the message is parsed the > attachment bodies are still loaded entirely in to memory when Message > instances are created and populated. This is a big problem for real > world enviroments where large messages are possible. All available > memory is consumed and the machine grinds to a halt. We see large > (40MB+) emails all this time and problems start to occur when several of > these are being processed simultaneously. > > To cope with this problem I've created 2 classes DiskMessage and > DiskFeedParser (see http://oss.netboxblue.com). I've prototyped a different approach, see if you like it. If you do, there's still time to get it into Python 2.4. We define a new protocol whereby if the message object returned by the factor has the following three methods, we use those when capturing the payload of non-MIME messages. If not, then we capture those lines in an internal list object just like normal, calling set_payload() at the end. The methods are: def storage_open(self) def storage_write(self, data) def storage_close(self) So you could use something like this (from the unit test): class ExternalStorageMessage(Message): def storage_open(self): fd, self._path = tempfile.mkstemp() self._fp = os.fdopen(fd, 'w') def storage_write(self, data): self._fp.write(data) def storage_close(self): self._fp.close() fp = open(self._path) payload = fp.read() self.set_payload(payload) Season to taste. Thoughts? -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 307 bytes Desc: This is a digitally signed message part Url : http://mail.python.org/pipermail/email-sig/attachments/20041003/f3256cbd/attachment.pgp From barry at python.org Tue Oct 5 05:11:51 2004 From: barry at python.org (Barry Warsaw) Date: Tue Oct 5 05:11:58 2004 Subject: [Email-SIG] email 3.0a0 Message-ID: <1096945911.7679.35.camel@geddy.wooz.org> Python 2.4 will come with version 3.0 of the email package. I've made a standalone distutils package of email 3.0 for folks who don't want to download the Python CVS. I plan on freezing the API when Python 2.4 beta 1 is released. For documentation and download links please see the email-sig home page: http://www.python.org/sigs/email-sig Changes in email 3.0 include: * New FeedParser provides an incremental parsing API for applications that may read email message from blocking sources. FeedParser is also more standards compliant than the old parser, and is "non-strict" so that it should never raise parse errors when parsing broken messages. * Previously deprecated API features have been removed, while a few more deprecations have been added. * Support for Pythons earlier than 2.3 have been removed. * Lots and lots of fixes. Feel free to join the email-sig mailing list for further discussion. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 307 bytes Desc: This is a digitally signed message part Url : http://mail.python.org/pipermail/email-sig/attachments/20041004/bffac4aa/attachment.pgp From barry at python.org Sat Oct 9 22:36:13 2004 From: barry at python.org (Barry Warsaw) Date: Sat Oct 9 22:36:16 2004 Subject: [Email-SIG] Handling large emails: DiskMessage and DiskFeedParser In-Reply-To: <41610768.4090403@NetBoxBlue.com> References: <40B29E3D.5010004@netbox.biz> <1096849991.21012.27.camel@geddy.wooz.org> <41610768.4090403@NetBoxBlue.com> Message-ID: <1097354173.29448.30.camel@presto.wooz.org> [Following up to the list, with permission -BAW] > > We define a new protocol whereby if the message object returned by the > > factor has the following three methods, we use those when capturing the > > payload of non-MIME messages. If not, then we capture those lines in an > > internal list object just like normal, calling set_payload() at the > > end. The methods are: > > > > def storage_open(self) > > def storage_write(self, data) > > def storage_close(self) > > This generally seems good. I like the idea of having this protocol > optionally invoked if the methods are there. The one thing missing for > me is a "read" API. The whole point of this for me was so the message > payloads never touch RAM and in order to do this you need to be able to > read the payload out in chunks. > > This of course requires support in the Generator class too. It would > need to read message payloads out in chunks when rebuilding a message. Dang, you're right of course. This would mean we'd have to add a storage_read() method, and I think we'd only need to hook this into Generator._handle_text(), but it might get tricky when the payload needs to be body_encode()'d because the message has a charset. I'm beginning to think that we don't have enough time to adequately prototype and test this for Python 2.4. In the meantime, I've put a patch on SF that you can start playing with. It supports the write protocol but not the read protocol, so it's incomplete, but it might be enought to take further if you're interested. http://sourceforge.net/tracker/index.php?func=detail&aid=1043706&group_id=5470&atid=305470 > > So you could use something like this (from the unit test): > > > > class ExternalStorageMessage(Message): > > def storage_open(self): > > fd, self._path = tempfile.mkstemp() > > self._fp = os.fdopen(fd, 'w') > > > > def storage_write(self, data): > > self._fp.write(data) > > > > def storage_close(self): > > self._fp.close() > > fp = open(self._path) > > payload = fp.read() > > self.set_payload(payload) > ^^^^^^^^ > Looks good except why do you call read the whole payload into memory > here? Isn't it the goal not to do this? Am I missing something? Just that this is only an example. In a real application you wouldn't want to do it this way. > An alternative solution I've been thinking of... what if we abstract > message payloads to a "Payload" class? We could have MemoryPayload for > in-memory storage (the default), TmpFilePayload for temporary disk > storage etc etc. The read/write interface to the payload would always be > the same and all Message methods would only ever access the payload via > the API. Each Message instance would have exactly one MessagePayload > instance internally. I realise this would be a big change and probably > isn't suited for Python 2.4 but do you think this is useful? It might be the right way to do it, much like headers can be strings or instances of Header. I don't think we can really do either for Python 2.4, but we can continue to pursue this for email 3.1 / Python 2.5. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 307 bytes Desc: This is a digitally signed message part Url : http://mail.python.org/pipermail/email-sig/attachments/20041009/757c9258/attachment.pgp From matt at mondoinfo.com Sun Oct 10 04:23:04 2004 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Sun Oct 10 04:33:02 2004 Subject: [Email-SIG] Handling large emails: DiskMessage and DiskFeedParser In-Reply-To: <1097354173.29448.30.camel@presto.wooz.org> References: <40B29E3D.5010004@netbox.biz> <1096849991.21012.27.camel@geddy.wooz.org> <41610768.4090403@NetBoxBlue.com> <1097354173.29448.30.camel@presto.wooz.org> Message-ID: <1097356913.61.2691@mint-julep.mondoinfo.com> > I'm beginning to think that we don't have enough time to adequately > prototype and test this for Python 2.4. I agree with the idea of waiting. When Barry first suggested the new protocol, I thought I'd prototype something and see how it might work in practice, partly because I was one of the first people to advocate disk-based message storage as an option. But I'm having some trouble thinking of a practical use case these days. My original thought some time ago was someone running a canonicalization or filtering proxy-like-thing on a smallish server. But it seems that RAM sizes have gone up faster than email message sizes, at least where I look. I leave my Postfix servers with their default max message size of around 10 MB and I don't hear many complaints. I can't think of a server that I've spec'd at all recently with less than 1 GB of RAM. In the case of an MUA, I'd say that if someone swamps their VM by opening too many messages, it's their own problem. I'm perfectly willing to believe that someone has a use case. But without one that I can test my ideas against, I don't trust my own intuition about what's a good idea and what's not. Regards, Matt From andrewm at object-craft.com.au Thu Oct 14 15:45:26 2004 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Thu Oct 14 15:45:29 2004 Subject: [Email-SIG] email module - getting unicode string from header? Message-ID: <20041014134526.B775A3C24C@coffee.object-craft.com.au> Given an email.Message.Message() object, what is the canonical way to obtain a list of unicoded address headers? This is the effect I'm trying to achieve, but it seems somewhat cumbersome: >>> from email.Header import make_header, decode_header >>> from email.Utils import formataddr, getaddresses >>> [unicode(make_header(decode_header(formataddr(t)))) for t in getaddresses(m.get_all('to')+m.get_all('cc'))] -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From bkirsch at osafoundation.org Tue Oct 19 01:51:09 2004 From: bkirsch at osafoundation.org (Brian Kirsch) Date: Tue Oct 19 01:52:30 2004 Subject: [Email-SIG] Email Address Validator In-Reply-To: <1093550140.82.2408@mint-julep.mondoinfo.com> References: <1093111617.36.2300@mint-julep.mondoinfo.com> <412E1A5C.8090203@interlink.com.au> <1093550140.82.2408@mint-julep.mondoinfo.com> Message-ID: <956BF44A-2160-11D9-8B6E-000A95CA1ECC@osafoundation.org> Hello, I am looking for a robust email address validator written in Python. Can anyone point me in the right direction. Most of the address validators I have found on the web are buggy. They allow some bad addresses and reject some valid ones. Brian Kirsch - Email Framework Engineer Open Source Applications Foundation 543 Howard St. 5th Floor? San Francisco, CA 94105? (415) 946-3056? From barry at python.org Tue Oct 19 04:12:21 2004 From: barry at python.org (Barry Warsaw) Date: Tue Oct 19 04:12:28 2004 Subject: [Email-SIG] Email Address Validator In-Reply-To: <956BF44A-2160-11D9-8B6E-000A95CA1ECC@osafoundation.org> References: <1093111617.36.2300@mint-julep.mondoinfo.com> <412E1A5C.8090203@interlink.com.au> <1093550140.82.2408@mint-julep.mondoinfo.com> <956BF44A-2160-11D9-8B6E-000A95CA1ECC@osafoundation.org> Message-ID: <1098151941.8233.9.camel@presto.wooz.org> On Mon, 2004-10-18 at 19:51, Brian Kirsch wrote: > Hello, > I am looking for a robust email address validator written in Python. > Can anyone point me in the right direction. Most > of the address validators I have found on the web are buggy. They allow > some bad addresses and reject some valid ones. Mailman has some email validation code in it, but it's not perfect. I think this would be a great thing to add to the library. Care to post some links to what you've found? Maybe we can combine them into a canonical validator module? -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 307 bytes Desc: This is a digitally signed message part Url : http://mail.python.org/pipermail/email-sig/attachments/20041018/688e91e5/attachment.pgp From bkirsch at osafoundation.org Tue Oct 19 19:52:45 2004 From: bkirsch at osafoundation.org (Brian Kirsch) Date: Tue Oct 19 19:54:15 2004 Subject: [Email-SIG] Email Address Validator In-Reply-To: <1098151941.8233.9.camel@presto.wooz.org> References: <1093111617.36.2300@mint-julep.mondoinfo.com> <412E1A5C.8090203@interlink.com.au> <1093550140.82.2408@mint-julep.mondoinfo.com> <956BF44A-2160-11D9-8B6E-000A95CA1ECC@osafoundation.org> <1098151941.8233.9.camel@presto.wooz.org> Message-ID: Hi Barry, I think it would be great to a real world tested Email Address Validator in to the library. I have attached a python program that condenses all the Validator examples I found on the web. This is a good starting place although none of the examples return accurate results 100% of the time. So there is work to be done :) -------------- next part -------------- A non-text attachment was scrubbed... Name: emailAddressValidationTest.py Type: application/octet-stream Size: 3631 bytes Desc: not available Url : http://mail.python.org/pipermail/email-sig/attachments/20041019/4357acb4/emailAddressValidationTest.obj -------------- next part -------------- Brian Kirsch - Email Framework Engineer Open Source Applications Foundation 543 Howard St. 5th Floor? San Francisco, CA 94105? (415) 946-3056? On Oct 18, 2004, at 7:12 PM, Barry Warsaw wrote: > On Mon, 2004-10-18 at 19:51, Brian Kirsch wrote: >> Hello, >> I am looking for a robust email address validator written in Python. >> Can anyone point me in the right direction. Most >> of the address validators I have found on the web are buggy. They >> allow >> some bad addresses and reject some valid ones. > > Mailman has some email validation code in it, but it's not perfect. I > think this would be a great thing to add to the library. Care to post > some links to what you've found? Maybe we can combine them into a > canonical validator module? > > -Barry > From indrek at inversion.ee Tue Oct 19 21:28:26 2004 From: indrek at inversion.ee (Indrek =?ISO-8859-1?Q?J=E4rve?=) Date: Tue Oct 19 21:27:07 2004 Subject: [Email-SIG] Modifying messages loaded with message_from_string() with 3.0 Message-ID: <1098214106.5818.14.camel@hercules.dustbite.org> Hi, While testing our webmail client code with email 3.0, I found that modifying message objects loaded with message_from_string() (attach()ing new files) break boundaries - the added file will become unaccessable after the next reload with message_from_string(). I've attached a testcase and the output I got on Suse 9.1, Python 2.3.3, email 3.0 from the python 2.4b1 package. The second run in testcase1.output is with the default Python 2.3.3 email library. Is this something that shouldn't be done this way or something just gone a bit broken? Best regards, Indrek -- Indrek J?rve Inversion Software O? Cell: +372 58058966 Fax: +372 623 8818 -------------- next part -------------- A non-text attachment was scrubbed... Name: testcase1.py Type: application/x-python Size: 1066 bytes Desc: not available Url : http://mail.python.org/pipermail/email-sig/attachments/20041019/9b4abb10/testcase1.bin -------------- next part -------------- hercules:/home/incx/lightdev/jykala # python testcase1.py X after attach [, ] Y after attach [, ] X after rload [, ] Y after rload [] hercules:/home/incx/lightdev/jykala # mv email/ email.2 hercules:/home/incx/lightdev/jykala # python testcase1.py X after attach [, ] Y after attach [, ] X after rload [, ] Y after rload [, ] From matt at mondoinfo.com Tue Oct 19 21:56:01 2004 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Tue Oct 19 21:56:39 2004 Subject: [Email-SIG] Email Address Validator In-Reply-To: References: <1093111617.36.2300@mint-julep.mondoinfo.com> <412E1A5C.8090203@interlink.com.au> <1093550140.82.2408@mint-julep.mondoinfo.com> <956BF44A-2160-11D9-8B6E-000A95CA1ECC@osafoundation.org> <1098151941.8233.9.camel@presto.wooz.org> Message-ID: <1098214244.83.7736@mint-julep.mondoinfo.com> Dear Brian, > I think it would be great to a real world tested Email Address > Validator in to the library. I have attached a python program that > condenses all the Validator examples I found on the web. This is a > good starting place although none of the examples return accurate > results 100% of the time. So there is work to be done :) A parser that fully implements RFC 2822's rules is pretty non-trivial to write, I think. When I first read RFC 822, I remember wondering if the authors were frustrated parser designers. Even considering only the actual address ("addr-spec" in RFC 2822-ese), we have these rules: addr-spec = local-part "@" domain local-part = dot-atom / quoted-string / obs-local-part domain = dot-atom / domain-literal / obs-domain domain-literal = [CFWS] "[" *([FWS] dcontent) [FWS] "]" [CFWS] dcontent = dtext / quoted-pair dtext = NO-WS-CTL / ; Non white space controls %d33-90 / ; The rest of the US-ASCII %d94-126 ; characters not including "[", ; "]", or "\" I think that a function that will check an address for correct syntax would be somewhat convenient to have, but I think that one that's right only 99% of the time would probably be worse than none at all. In particular, I'd be rather surprised if someone could implement all those rules with just a regular expression. Two of the four examples you sent along don't like the (I'm pretty sure) legal address "fred&barney"@example.com. And none of them like matt@[127.0.0.1] which isn't used much but I'm not aware of its having been declared illegal recently. In your list of bad addresses, I'm pretty sure that "brian" and "brian@localhost" are both legal. I suspect that the lack of email address validators out there stems both from the difficulty of writing one and the fact that they're not all that critical. I've always thought that if the MTA can deliver a message, that's fine. And if it can't, the user will get it back. Regards, Matt From phd at phd.pp.ru Tue Oct 19 22:28:40 2004 From: phd at phd.pp.ru (Oleg Broytmann) Date: Tue Oct 19 22:28:42 2004 Subject: [Email-SIG] Email Address Validator In-Reply-To: <1098214244.83.7736@mint-julep.mondoinfo.com> References: <1093111617.36.2300@mint-julep.mondoinfo.com> <412E1A5C.8090203@interlink.com.au> <1093550140.82.2408@mint-julep.mondoinfo.com> <956BF44A-2160-11D9-8B6E-000A95CA1ECC@osafoundation.org> <1098151941.8233.9.camel@presto.wooz.org> <1098214244.83.7736@mint-julep.mondoinfo.com> Message-ID: <20041019202840.GA25658@phd.pp.ru> On Tue, Oct 19, 2004 at 02:56:01PM -0500, Matthew Dixon Cowles wrote: > In particular, I'd be rather surprised if someone could implement all > those rules with just a regular expression. Two of the four examples > you sent along don't like the (I'm pretty sure) legal address > "fred&barney"@example.com. And none of them like matt@[127.0.0.1] > which isn't used much but I'm not aware of its having been declared > illegal recently. Dare to test these? The first is simpler, http://www.breakingpar.com/bkp/home.nsf/Doc?OpenNavigator&U=87256B280015193F87256C40004CC8C6 and the second is much more complex, but probably much closer... http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html Oleg. -- Oleg Broytmann http://phd.pp.ru/ phd@phd.pp.ru Programmers don't die, they just GOSUB without RETURN. From barry at python.org Tue Oct 19 22:31:56 2004 From: barry at python.org (Barry Warsaw) Date: Tue Oct 19 22:32:06 2004 Subject: [Email-SIG] Email Address Validator In-Reply-To: <1098214244.83.7736@mint-julep.mondoinfo.com> References: <1093111617.36.2300@mint-julep.mondoinfo.com> <412E1A5C.8090203@interlink.com.au> <1093550140.82.2408@mint-julep.mondoinfo.com> <956BF44A-2160-11D9-8B6E-000A95CA1ECC@osafoundation.org> <1098151941.8233.9.camel@presto.wooz.org> <1098214244.83.7736@mint-julep.mondoinfo.com> Message-ID: <1098217916.22522.44.camel@geddy.wooz.org> On Tue, 2004-10-19 at 15:56, Matthew Dixon Cowles wrote: > In particular, I'd be rather surprised if someone could implement all > those rules with just a regular expression. I wouldn't necessarily expect it to be doable with regexps. One of the weirdest approaches I've ever seen is Emacs's mail-extr.el which IIRC used a special major-mode and syntax tables to parse an address. I would sort of expect that anything we include in the stdlib would do support both the parsing and the validation use cases with as much common code as possible. This probably can't be done for Python 2.4/email 3.0 but lets put it on the list for email 3.1 (which may have a different release schedule than Python 2.5). -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 307 bytes Desc: This is a digitally signed message part Url : http://mail.python.org/pipermail/email-sig/attachments/20041019/43eee2ff/attachment.pgp From barry at python.org Tue Oct 19 22:37:27 2004 From: barry at python.org (Barry Warsaw) Date: Tue Oct 19 22:37:32 2004 Subject: [Email-SIG] Email Address Validator In-Reply-To: <20041019202840.GA25658@phd.pp.ru> References: <1093111617.36.2300@mint-julep.mondoinfo.com> <412E1A5C.8090203@interlink.com.au> <1093550140.82.2408@mint-julep.mondoinfo.com> <956BF44A-2160-11D9-8B6E-000A95CA1ECC@osafoundation.org> <1098151941.8233.9.camel@presto.wooz.org> <1098214244.83.7736@mint-julep.mondoinfo.com> <20041019202840.GA25658@phd.pp.ru> Message-ID: <1098218247.22516.50.camel@geddy.wooz.org> On Tue, 2004-10-19 at 16:28, Oleg Broytmann wrote: > and the second is much more complex, but probably much closer... > http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html I like my brain way too much to ever force, trick, beg, plead, cajole, tempt, offer, or tease it into maintaining something like that. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 307 bytes Desc: This is a digitally signed message part Url : http://mail.python.org/pipermail/email-sig/attachments/20041019/84eb2d85/attachment.pgp From bkirsch at osafoundation.org Tue Oct 19 23:13:56 2004 From: bkirsch at osafoundation.org (Brian Kirsch) Date: Tue Oct 19 23:15:19 2004 Subject: [Email-SIG] Email Address Validator In-Reply-To: <1098218247.22516.50.camel@geddy.wooz.org> References: <1093111617.36.2300@mint-julep.mondoinfo.com> <412E1A5C.8090203@interlink.com.au> <1093550140.82.2408@mint-julep.mondoinfo.com> <956BF44A-2160-11D9-8B6E-000A95CA1ECC@osafoundation.org> <1098151941.8233.9.camel@presto.wooz.org> <1098214244.83.7736@mint-julep.mondoinfo.com> <20041019202840.GA25658@phd.pp.ru> <1098218247.22516.50.camel@geddy.wooz.org> Message-ID: Yeah that expression looks like a nightmare to work with. I have attached an updated python email test that includes the regular expression at http://www.breakingpar.com/bkp/home.nsf/Doc? OpenNavigator&U=87256B280015193F87256C40004CC8C6 as well as more valid and invalid addresses. -------------- next part -------------- A non-text attachment was scrubbed... Name: emailAddressValidator.py Type: application/octet-stream Size: 4191 bytes Desc: not available Url : http://mail.python.org/pipermail/email-sig/attachments/20041019/ccb7d8c4/emailAddressValidator-0001.obj -------------- next part -------------- Enjoy, Brian Kirsch - Email Framework Engineer Open Source Applications Foundation 543 Howard St. 5th Floor? San Francisco, CA 94105? (415) 946-3056? On Oct 19, 2004, at 1:37 PM, Barry Warsaw wrote: > On Tue, 2004-10-19 at 16:28, Oleg Broytmann wrote: > >> and the second is much more complex, but probably much closer... >> http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html > > I like my brain way too much to ever force, trick, beg, plead, cajole, > tempt, offer, or tease it into maintaining something like that. > > -Barry > > _______________________________________________ > Email-SIG mailing list > Email-SIG@python.org > Your options: > http://mail.python.org/mailman/options/email-sig/ > bkirsch%40osafoundation.org From stuart at stuartbishop.net Wed Oct 20 11:30:13 2004 From: stuart at stuartbishop.net (Stuart Bishop) Date: Wed Oct 20 11:31:19 2004 Subject: [Email-SIG] Email Address Validator In-Reply-To: <1098214244.83.7736@mint-julep.mondoinfo.com> References: <1093111617.36.2300@mint-julep.mondoinfo.com> <412E1A5C.8090203@interlink.com.au> <1093550140.82.2408@mint-julep.mondoinfo.com> <956BF44A-2160-11D9-8B6E-000A95CA1ECC@osafoundation.org> <1098151941.8233.9.camel@presto.wooz.org> <1098214244.83.7736@mint-julep.mondoinfo.com> Message-ID: <41763025.3040307@stuartbishop.net> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Matthew Dixon Cowles wrote: | I think that a function that will check an address for correct syntax | would be somewhat convenient to have, but I think that one that's | right only 99% of the time would probably be worse than none at all. | | In particular, I'd be rather surprised if someone could implement all | those rules with just a regular expression. Two of the four examples | you sent along don't like the (I'm pretty sure) legal address | "fred&barney"@example.com. And none of them like matt@[127.0.0.1] | which isn't used much but I'm not aware of its having been declared | illegal recently. | | In your list of bad addresses, I'm pretty sure that "brian" and | "brian@localhost" are both legal. I'm not sure I entirely agree. Whilst many odd strings of characters might be valid email addresses, I wouldn't want to let them get into my systems as if you see them in the real world they are certainly erroneous, malicious or test data. When I want an email address, I want something much more limited (foo@bar.com), in some cases allowing brackets to encode the name in there as well. Its like bang notation - plenty of systems will refuse to deal with it now because of its use in relaying spam, but it was still technically legal last time I looked. It would be possible to validate the domain using DNS, or at least confirm the TLD is valid, if the tool is for Internet addresses rather than something only meaningful to the local network (foo@mail.intranet, bar@localhost). | I suspect that the lack of email address validators out there stems | both from the difficulty of writing one and the fact that they're not | all that critical. I've always thought that if the MTA can deliver a | message, that's fine. And if it can't, the user will get it back. Mmm... a few checks to catch typos or obviously illegal addresses (to catch people attempting to fit their postal address into an email address form field for example), but I've never worried about proper domain validation and all the 'icky MX record checking that is involved (which seems rather pointless anyway since the address might suddenly become valid again in 5 minutes time when some router that was just rebooted comes back online). - -- Stuart Bishop http://www.stuartbishop.net/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) iD8DBQFBdjAkAfqZj7rGN0oRAv/1AKCAyRBqhV4d36pQ0byQqVMG0jPVkQCbBDt3 631qxjZoZJrS1DHCztxAg8U= =CRhu -----END PGP SIGNATURE----- From matt at mondoinfo.com Wed Oct 20 22:19:25 2004 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Wed Oct 20 22:26:02 2004 Subject: [Email-SIG] Email Address Validator In-Reply-To: <41763025.3040307@stuartbishop.net> References: <1093111617.36.2300@mint-julep.mondoinfo.com> <412E1A5C.8090203@interlink.com.au> <1093550140.82.2408@mint-julep.mondoinfo.com> <956BF44A-2160-11D9-8B6E-000A95CA1ECC@osafoundation.org> <1098151941.8233.9.camel@presto.wooz.org> <1098214244.83.7736@mint-julep.mondoinfo.com> <41763025.3040307@stuartbishop.net> Message-ID: <1098302837.05.8555@mint-julep.mondoinfo.com> Dear Stuart, > I'm not sure I entirely agree. Whilst many odd strings of > characters might be valid email addresses, I wouldn't want to let > them get into my systems as if you see them in the real world they > are certainly erroneous, malicious or test data. When I want an > email address, I want something much more limited (foo@bar.com), in > some cases allowing brackets to encode the name in there as well. > Its like bang notation - plenty of systems will refuse to deal with > it now because of its use in relaying spam, but it was still > technically legal last time I looked. It's perfectly reasonable for you to limit the email addresses that you want your software to accept. But the original poster was posting from an organization that's writing an email client and subsequent discussion has been about something that might end up in Python's standard library. In those cases, I suspect that it would be bad to disallow any legal addresses. > It would be possible to validate the domain using DNS, or at least > confirm the TLD is valid, if the tool is for Internet addresses > rather than something only meaningful to the local network > (foo@mail.intranet, bar@localhost). The problem with doing that is that DNS would have to be available and functioning correctly before an address would be accepted. If my laptop and I were in a coffee-shop or a conference room without 802.11x service, I'd be annoyed if I couldn't enter an email address into my addressbook. > Mmm... a few checks to catch typos or obviously illegal addresses > (to catch people attempting to fit their postal address into an > email address form field for example), Something like that could well be useful, but I generally think that messy heuristics (which I routinely write) more properly belong in examples and user code than in the standard library. > but I've never worried about proper domain validation and all the > 'icky MX record checking that is involved (which seems rather > pointless anyway since the address might suddenly become valid > again in 5 minutes time when some router that was just rebooted > comes back online). I agree. I'm a sysadmin when I have another hat on and I know too well that connectivity can fail at just the wrong moment. Regards, Matt From barry at python.org Wed Oct 20 23:24:31 2004 From: barry at python.org (Barry Warsaw) Date: Wed Oct 20 23:24:36 2004 Subject: [Email-SIG] Email Address Validator In-Reply-To: <41763025.3040307@stuartbishop.net> References: <1093111617.36.2300@mint-julep.mondoinfo.com> <412E1A5C.8090203@interlink.com.au> <1093550140.82.2408@mint-julep.mondoinfo.com> <956BF44A-2160-11D9-8B6E-000A95CA1ECC@osafoundation.org> <1098151941.8233.9.camel@presto.wooz.org> <1098214244.83.7736@mint-julep.mondoinfo.com> <41763025.3040307@stuartbishop.net> Message-ID: <1098307471.17351.78.camel@geddy.wooz.org> On Wed, 2004-10-20 at 05:30, Stuart Bishop wrote: > I'm not sure I entirely agree. Whilst many odd strings of characters > might be valid email addresses, I wouldn't want to let them get into my > systems as if you see them in the real world they are certainly > erroneous, malicious or test data. When I want an email address, I want > something much more limited (foo@bar.com), in some cases allowing > brackets to encode the name in there as well. Its like bang notation - > plenty of systems will refuse to deal with it now because of its use in > relaying spam, but it was still technically legal last time I looked. I actually don't think bang-addresses are legal in RFC 2822. > It would be possible to validate the domain using DNS, or at least > confirm the TLD is valid, if the tool is for Internet addresses rather > than something only meaningful to the local network (foo@mail.intranet, > bar@localhost). This is not something our email package parser should do. What our parser should do is validate the syntax of addresses according to RFC 2822, and probably also be able to split addr-specs into local-parts and domains (in 2822-speak). Higher level tools can then do stricter validation of email addresses based on the requirements of the application. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 307 bytes Desc: This is a digitally signed message part Url : http://mail.python.org/pipermail/email-sig/attachments/20041020/fb5d0afa/attachment.pgp From sbjiii at comcast.net Fri Oct 22 23:33:03 2004 From: sbjiii at comcast.net (Broadus Jones) Date: Fri Oct 22 23:29:19 2004 Subject: [Email-SIG] Demo code for mbox message tests In-Reply-To: Message-ID: <20041022212919.2AB831E4002@bag.python.org> I tried a slightly modified copy of your code (both are below) with varying results. Using a mailbox from Imail, tried this under Python 2.3.4 in Cygwin and Win32. The results are below. Does anyone have any idea why it worked this way? Broadus ----- cygwin results start ----- bjones@bjones-2k01 ~/mail_testing $ python check_mbox.py message count 483 message count 481 message count 482 message count 482 message count 482 message count 482 message count 482 message count 482 message count 482 bjones@bjones-2k01 ~/mail_testing $ md5sum mbox-? 539798a7459815366f826312697d99d0 *mbox-0 d0bdb56dd8b6377959b49765caf04205 *mbox-1 fe339e2ff213736090a063c1f61b67d8 *mbox-2 fe339e2ff213736090a063c1f61b67d8 *mbox-3 fe339e2ff213736090a063c1f61b67d8 *mbox-4 fe339e2ff213736090a063c1f61b67d8 *mbox-5 fe339e2ff213736090a063c1f61b67d8 *mbox-6 fe339e2ff213736090a063c1f61b67d8 *mbox-7 fe339e2ff213736090a063c1f61b67d8 *mbox-8 fe339e2ff213736090a063c1f61b67d8 *mbox-9 bjones@bjones-2k01 ~/mail_testing $ ls -al mbox-? -rwxr-xr-x 1 bjones Enterpri 7248444 Oct 22 16:11 mbox-0 -rw-r--r-- 1 bjones Enterpri 7235445 Oct 22 16:19 mbox-1 -rw-r--r-- 1 bjones Enterpri 7235457 Oct 22 16:19 mbox-2 -rw-r--r-- 1 bjones Enterpri 7235457 Oct 22 16:19 mbox-3 -rw-r--r-- 1 bjones Enterpri 7235457 Oct 22 16:20 mbox-4 -rw-r--r-- 1 bjones Enterpri 7235457 Oct 22 16:20 mbox-5 -rw-r--r-- 1 bjones Enterpri 7235457 Oct 22 16:20 mbox-6 -rw-r--r-- 1 bjones Enterpri 7235457 Oct 22 16:21 mbox-7 -rw-r--r-- 1 bjones Enterpri 7235457 Oct 22 16:21 mbox-8 -rw-r--r-- 1 bjones Enterpri 7235457 Oct 22 16:21 mbox-9 bjones@bjones-2k01 ~/mail_testing $ ----- cygwin results end ----- ----- win32 results start ----- C:\Python23>python check_mbox.py message count 483 message count 114 message count 75 message count 44 message count 25 message count 17 message count 12 message count 6 message count 2 C:\Python23>md5sum mbox-? 539798a7459815366f826312697d99d0 *mbox-0 e3c74093172c66328e585f1ff6594325 *mbox-1 569c47edd5781c5684060d558372162a *mbox-2 3fcab1ee8a155749bb998ee907490eaf *mbox-3 e90eccfc8c82a31aa4b553910696f160 *mbox-4 8b1754eb3ad2126bccdbd502b2a5baf1 *mbox-5 fbbb48c336644a8611aadf9cfcab4099 *mbox-6 775a646a0730211db669eeceaef8a69e *mbox-7 d33194f510207ec44541be50ec33dde2 *mbox-8 c4067ab7c6a5f2df7ac2241fc932d0b0 *mbox-9 C:\Python23>dir mbox-? Volume in drive C is BJones_C Volume Serial Number is 3CB6-8D83 Directory of C:\Python23 10/22/2004 04:22p 7,248,444 mbox-0 10/22/2004 04:24p 7,337,591 mbox-1 10/22/2004 04:24p 7,444,674 mbox-2 10/22/2004 04:24p 7,553,703 mbox-3 10/22/2004 04:25p 7,673,618 mbox-4 10/22/2004 04:25p 7,782,220 mbox-5 10/22/2004 04:25p 7,893,373 mbox-6 10/22/2004 04:25p 8,007,114 mbox-7 10/22/2004 04:25p 8,123,874 mbox-8 10/22/2004 04:26p 8,243,674 mbox-9 10 File(s) 77,308,285 bytes 0 Dir(s) 1,722,507,264 bytes free C:\Python23> ----- win32 results end ----- Code used in check_mbox.py ----- code start ----- #!/usr/bin/env python #Given the mbox-format file "mbox-in", it writes "mbox-out" as normalized data. #It then reads this file and writes "mbox-out2". #mbox-out and mbox-out2 should be identical, but aren't. import email # import email.Iterators import mailbox # import datetime from sys import exc_info #Error-catching replacement of email.message_from_file. See mailbox docs. def msgfactory(fp): try: return email.message_from_file(fp) except email.Errors.MessageParseError: s="From MailerDaemon %s\n"%email.Utils.formatdate(localtime=True) s+="From: MailerDaemon\n" s+="Subject: Error: %s\n\n"%exc_info()[1] s+='Sorry, couldn\'t parse message due to error:\n"%s"\n\n'%exc_info()[1] return email.message_from_string(s) def readmbox(mboxin,mboxout): fp=open(mboxin) f=open(mboxout,"w") mbox=mailbox.UnixMailbox(fp,msgfactory) msg_count=0 for msg in mbox: f.write(str(msg)) msg_count += 1 fp.close() f.close() print "message count ", msg_count for i in range(1,10): readmbox("mbox-" + str(i - 1),"mbox-" + str(i)) ----- code end ----- -----Original Message----- From: Python Email sig [mailto:email-sig@shopip.com] Sent: Monday, June 14, 2004 5:55 AM To: email-sig@python.org Subject: [Email-SIG] Demo code for mbox message tests One would expect that reading an mbox file of messages and writing it out would produce an identical file, at least if it was previously written by the same Python code. This is important in my case since I generate an MD5 hash of each message. In *almost* every case the file does not change, however I have seen a few cases where spurious spaces get appended to the end of header lines. Use this code to verify that these Python mail functions are working correctly. Copy your favorite mbox file to "mbox-in", then run this code. #!/usr/bin/env python #Given the mbox-format file "mbox-in", it writes "mbox-out" as normalized data. #It then reads this file and writes "mbox-out2". #mbox-out and mbox-out2 should be identical, but aren't. import email import mailbox from sys import exc_info #Error-catching replacement of email.message_from_file. See mailbox docs. def msgfactory(fp): try: return email.message_from_file(fp) except email.Errors.MessageParseError: s="From MailerDaemon %s\n"%email.Utils.formatdate(localtime=True) s+="From: MailerDaemon\n" s+="Subject: Error: %s\n\n"%exc_info()[1] s+='Sorry, couldn\'t parse message due to error:\n"%s"\n\n'%exc_info()[1] return email.message_from_string(s) def readmbox(mboxin,mboxout): fp=open(mboxin) f=open(mboxout,"w") mbox=mailbox.UnixMailbox(fp,msgfactory) for msg in mbox: f.write(str(msg)) fp.close() f.close() readmbox("mbox-in","mbox-out") readmbox("mbox-out","mbox-out2") From stuart at stuartbishop.net Sun Oct 24 06:47:05 2004 From: stuart at stuartbishop.net (Stuart Bishop) Date: Sun Oct 24 06:47:16 2004 Subject: [Email-SIG] Email Address Validator In-Reply-To: <1098302837.05.8555@mint-julep.mondoinfo.com> References: <1093111617.36.2300@mint-julep.mondoinfo.com> <412E1A5C.8090203@interlink.com.au> <1093550140.82.2408@mint-julep.mondoinfo.com> <956BF44A-2160-11D9-8B6E-000A95CA1ECC@osafoundation.org> <1098151941.8233.9.camel@presto.wooz.org> <1098214244.83.7736@mint-julep.mondoinfo.com> <41763025.3040307@stuartbishop.net> <1098302837.05.8555@mint-julep.mondoinfo.com> Message-ID: <417B33C9.7030808@stuartbishop.net> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Matthew Dixon Cowles wrote: | It's perfectly reasonable for you to limit the email addresses that | you want your software to accept. But the original poster was posting | from an organization that's writing an email client and subsequent | discussion has been about something that might end up in Python's | standard library. In those cases, I suspect that it would be bad to | disallow any legal addresses. I know if I was writing an email client, I would consider it a desirable feature to handle insane email addresses as any other spam (and if the authors decide that it is desirable, I suspect they wouldn't want strict RFC checking anyway since non-RFC complient email addresses you might want to handle, such as iso-8859-1 encoded usernames, are much more common than root@[127.0.0.1] etc. Its a practicality versus purity debate.) - -- Stuart Bishop http://www.stuartbishop.net/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) iD8DBQFBezPJAfqZj7rGN0oRAjASAKCc0HfkvJGGvH3cChNbaHqFL1NqHACfZRHL Q7OAvo05g9WhbfX3WbpXlPk= =I0SR -----END PGP SIGNATURE----- From stuart at stuartbishop.net Sun Oct 24 06:54:46 2004 From: stuart at stuartbishop.net (Stuart Bishop) Date: Sun Oct 24 06:54:59 2004 Subject: [Email-SIG] Email Address Validator In-Reply-To: <1098307471.17351.78.camel@geddy.wooz.org> References: <1093111617.36.2300@mint-julep.mondoinfo.com> <412E1A5C.8090203@interlink.com.au> <1093550140.82.2408@mint-julep.mondoinfo.com> <956BF44A-2160-11D9-8B6E-000A95CA1ECC@osafoundation.org> <1098151941.8233.9.camel@presto.wooz.org> <1098214244.83.7736@mint-julep.mondoinfo.com> <41763025.3040307@stuartbishop.net> <1098307471.17351.78.camel@geddy.wooz.org> Message-ID: <417B3596.7030705@stuartbishop.net> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Barry Warsaw wrote: | This is not something our email package parser should do. What our | parser should do is validate the syntax of addresses according to RFC | 2822, and probably also be able to split addr-specs into local-parts and | domains (in 2822-speak). Higher level tools can then do stricter | validation of email addresses based on the requirements of the | application. This also ties into Unicode email addresses, which currently has a patch with a single (negative) review on SF and still has had 0 comments in this forum despite being brougt up twice before: http://www.stuartbishop.net/Software/EmailAddress/ I suspect this already handles most of this, except for better splitting and validation of the domain part. If nothing else, any address vaidation work should use it as an example on how to translate between Unicode and ASCII email address representations. - -- Stuart Bishop http://www.stuartbishop.net/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) iD8DBQFBezWWAfqZj7rGN0oRAlJoAKCGh+6u8/uL7wsDr7GQODZgPQZC6QCcCIRz xZf1RB8nNIyqilNqTW76y7U= =K4Dk -----END PGP SIGNATURE-----