From janssen at parc.com Thu Jan 5 18:55:41 2012 From: janssen at parc.com (Bill Janssen) Date: Thu, 5 Jan 2012 09:55:41 PST Subject: [Email-SIG] API for email threading library? Message-ID: <722.1325786141@parc.com> Folks, I'm working on an implementation of RFC 5256 email threading, designed so that it could fit as a submodule in the "email" package, if such a think was ever seen to be useful. I'd like to ask "the wisdom of the crowd" what they think an appropriate interface to such a thing would be? The basic operation is that you create a collection (type C) of email threads (type T) by passing a set of messages (type M) to the constructor. * Should M be required to be "email.message.Message", or perhaps some less restrictive type, say "ThreadableMessageAPI"? All that's strictly required is the ability to retrieve the Message-ID, Subject, Date, References, and In-Reply-To fields. * What operations should be possible on C? Some that come to mind: * retrieve_thread (M or message-id) => T * add_message (M) => T * add_messages (set of M) => None * remove_message (M or message-id) => T (or None) ? * What's the interface for T? It's a tree with possible dummy nodes, so a tuple of messages plus nested tuples would do it. What should the nodes in the tree be? Normalized (see RFC 5256) Message-IDs? email.message.Message instances? * For large sets of threads (millions of messages) a persistence mechanism would be useful. Should there be a standard interface to such a mechanism, perhaps as class methods on C? If so, what should it look like? Should the implementation contain a default persistent subclass of C, based on sqlite3? What side-effects would persistence requirements have on the other design considerations? For instance, would you have to save the entire text of a message for each node? Just the headers? Just some of the headers? Just the Message-ID? Have at it! Advise away! Bill From barry at python.org Fri Jan 6 02:21:08 2012 From: barry at python.org (Barry Warsaw) Date: Thu, 5 Jan 2012 20:21:08 -0500 Subject: [Email-SIG] API for email threading library? In-Reply-To: <722.1325786141@parc.com> References: <722.1325786141@parc.com> Message-ID: <20120105202108.15db9629@resist.wooz.org> On Jan 05, 2012, at 09:55 AM, Bill Janssen wrote: >Folks, I'm working on an implementation of RFC 5256 email threading, >designed so that it could fit as a submodule in the "email" package, if >such a think was ever seen to be useful. I really like the idea of threading support being included in the email package. (I admit that I don't have time right now to read the RFC.) My general thoughts are that the actual messages needn't be included in the thread collection, but perhaps just Message-IDs. That would allow an application to store the actual message objects anywhere they want, and would reduce space requirements of the thread collection. >I'd like to ask "the wisdom of the crowd" what they think an appropriate >interface to such a thing would be? The basic operation is that you >create a collection (type C) of email threads (type T) by passing a set >of messages (type M) to the constructor. > >* Should M be required to be "email.message.Message", or perhaps some > less restrictive type, say "ThreadableMessageAPI"? All that's > strictly required is the ability to retrieve the Message-ID, Subject, > Date, References, and In-Reply-To fields. I think it would be fine then to allow duck-typing of the input objects. I don't have a sense of whether it needs a formal (as in Python's ABCs) interface type. >* What operations should be possible on C? Some that come to mind: > > * retrieve_thread (M or message-id) => T Message-ID as input. > * add_message (M) => T Duck-typed message. > * add_messages (set of M) => None > * remove_message (M or message-id) => T (or None) ? Probably Message-ID as the input. I guess the rule would be that if you need all the headers you mention above, a duck-typed message would be required. For operations that only need the Message-ID, just accept that. And you probably want the full Message-ID header value, e.g. it would include the angle brackets. >* What's the interface for T? It's a tree with possible dummy nodes, so > a tuple of messages plus nested tuples would do it. What should the > nodes in the tree be? Normalized (see RFC 5256) Message-IDs? > email.message.Message instances? Will the tree get mutated when a message is added in the middle of a thread, or will you generate a new tree? That would make a difference for tuple-of-tuples or list-of-lists. I think the nodes would be Message-IDs, but you'd need a public API for normalizing them, and my application would have to make sure that my messages are normalized (or at least the lookup keys are) or I might not be able to find a message given its normalized id. OTOH, maybe the message parser or message object itself should provide an API for normalizing ids? Let's think about some use cases. - given any message, find the entire thread it's a part of - given a message, find all children - given a message, find a path to the root of the thread - find the parts of the thread that fall within a date range - find the parts of a thread with a matching subject >* For large sets of threads (millions of messages) a persistence > mechanism would be useful. Should there be a standard interface to > such a mechanism, perhaps as class methods on C? If so, what should > it look like? Should the implementation contain a default persistent > subclass of C, based on sqlite3? What side-effects would persistence > requirements have on the other design considerations? For instance, > would you have to save the entire text of a message for each node? > Just the headers? Just some of the headers? Just the Message-ID? Great questions. We've long talked about a persistence mechanism for message parts (e.g. store the big binary parts on disk instead of in memory). Some consistency of design would be good here. But I agree that persistence should definitely be part of the story, and it needs to be plugable. Have to think more about this, but a big +1 for the idea. It would serve as a very good component for the ideas I have about a next generation email archiver. -Barry From rdmurray at bitdance.com Fri Jan 6 02:30:22 2012 From: rdmurray at bitdance.com (R. David Murray) Date: Thu, 05 Jan 2012 20:30:22 -0500 Subject: [Email-SIG] API for email threading library? In-Reply-To: <20120105202108.15db9629@resist.wooz.org> References: <722.1325786141@parc.com> <20120105202108.15db9629@resist.wooz.org> Message-ID: <20120106015213.9F9EA2500E5@webabinitio.net> On Thu, 05 Jan 2012 20:21:08 -0500, Barry Warsaw wrote: > On Jan 05, 2012, at 09:55 AM, Bill Janssen wrote: > > >Folks, I'm working on an implementation of RFC 5256 email threading, > >designed so that it could fit as a submodule in the "email" package, if > >such a think was ever seen to be useful. > > I really like the idea of threading support being included in the email > package. (I admit that I don't have time right now to read the RFC.) My > general thoughts are that the actual messages needn't be included in the > thread collection, but perhaps just Message-IDs. That would allow an > application to store the actual message objects anywhere they want, and would > reduce space requirements of the thread collection. I don't have time to read the RFC either :(. But from a skim of the first bits, my immediate reaction is that the best thing to do is to break everything down into as many discrete components as practical (pluggable thread storage, thread construction (which presumably takes duck typed Message objects containing at least the relevant headers) with different subclasses or plugins for the different sorting algorithms, thread query, etc) and keep them as decoupled as possible. That would give a server implementer the greatest flexibility. You'll probably want to noodle on the various APIs and make some concrete (but not fully fleshed out) proposals for discussion. That's the procedure that seemed to work best when we were working on the email6 API. On a possibly related note, it has become clear to me through work I've done recently that the parser/generator classes need some non-trivial refactoring to make using external (not in-the-object-in-memory) storage of all or parts of the message possible. I'm not at all sure when I'll have time to work on that, but I've got a bunch of relevant notes for use when I do :) --David PS: If you implement the 'base subject' algorithm I bet we can get agreement to check that right in to email.utils before 3.3 :) From janssen at parc.com Fri Jan 6 03:49:00 2012 From: janssen at parc.com (Bill Janssen) Date: Thu, 5 Jan 2012 18:49:00 PST Subject: [Email-SIG] API for email threading library? In-Reply-To: <20120105202108.15db9629@resist.wooz.org> References: <722.1325786141@parc.com> <20120105202108.15db9629@resist.wooz.org> Message-ID: <11955.1325818140@parc.com> Thanks for the feedback, Barry. Barry Warsaw wrote: > On Jan 05, 2012, at 09:55 AM, Bill Janssen wrote: > > >Folks, I'm working on an implementation of RFC 5256 email threading, > >designed so that it could fit as a submodule in the "email" package, if > >such a think was ever seen to be useful. > > I really like the idea of threading support being included in the email > package. (I admit that I don't have time right now to read the RFC.) It basically defines two kinds of threading for IMAP: ORDEREDSUBJECT, which is "poor man's threading" using "Subject" and "Date" headers, and REFERENCES, which is JWS threading a la Netscape using "References" and "In-Reply-To" headers. I intend to support both. > My general thoughts are that the actual messages needn't be included in the > thread collection, but perhaps just Message-IDs. That would allow an > application to store the actual message objects anywhere they want, and would > reduce space requirements of the thread collection. We need "Subject", "Date", and either "References" or "In-Reply-To", in addition to "Message-ID", in order to add a new message to the thread DB. I was planning to use a struct with slots containing hashes of the value of each of these as the internal node structure in the thread-set instance. If the message objects were available on-demand (perhaps via a weakref or via a Message-ID to message mapping), we could save only a pointer to the message. Perhaps a "retrieve-message-by-message-id" callback object should be a parameter to the constructor. > >I'd like to ask "the wisdom of the crowd" what they think an appropriate > >interface to such a thing would be? The basic operation is that you > >create a collection (type C) of email threads (type T) by passing a set > >of messages (type M) to the constructor. > > > >* Should M be required to be "email.message.Message", or perhaps some > > less restrictive type, say "ThreadableMessageAPI"? All that's > > strictly required is the ability to retrieve the Message-ID, Subject, > > Date, References, and In-Reply-To fields. > > I think it would be fine then to allow duck-typing of the input objects. I > don't have a sense of whether it needs a formal (as in Python's ABCs) > interface type. I prefer an ABC as documentation of the duck-typing requirements. I'm thinking a subtype of email.message.Message would be good -- basically adding the contraint that the "Message-ID", "Subject", "Date", and "References" (or "In-Reply-To" headers) be set, but not requiring any payload. > >* What operations should be possible on C? Some that come to mind: > > > > * retrieve_thread (M or message-id) => T > > Message-ID as input. > > > * add_message (M) => T > > Duck-typed message. > > > * add_messages (set of M) => None > > * remove_message (M or message-id) => T (or None) ? > > Probably Message-ID as the input. I guess the rule would be that if you need > all the headers you mention above, a duck-typed message would be required. For "add", but not "remove". > For operations that only need the Message-ID, just accept that. Sure. Either/or. > And you probably want the full Message-ID header value, e.g. it would include > the angle brackets. Easier to get at. > >* What's the interface for T? It's a tree with possible dummy nodes, so > > a tuple of messages plus nested tuples would do it. What should the > > nodes in the tree be? Normalized (see RFC 5256) Message-IDs? > > email.message.Message instances? > > Will the tree get mutated when a message is added in the middle of a thread, > or will you generate a new tree? That would make a difference for > tuple-of-tuples or list-of-lists. It would be mutated internally, but the thread given back to callers would be an immutable copy of the internal tree. I was thinking that the returned thread would be a fresh tuple containing ( ...), where is a message ID, and each is a fresh tuple of the same form as . Thus the tree A --+-- B | +-- C --+-- D | +-- E | +-- F would look like this: ('A at example.com' ('B at example.com') ('C at example.com' ('D at example.com') ('E at example.com') ('F at example.com')))) though perhaps ('A at example.com' 'B at example.com' ('C at example.com' 'D at example.com' 'E at example.com' 'F at example.com')) would be more efficient -- each child is either a singleton represented by a string message-id, or a tuple of a reply plus its children. > I think the nodes would be Message-IDs, but you'd need a public API for > normalizing them, and my application would have to make sure that my messages > are normalized (or at least the lookup keys are) or I might not be able to > find a message given its normalized id. OTOH, maybe the message parser or > message object itself should provide an API for normalizing ids? The normalization of the Message-ID in RFC 5256 refers to the optional quoting allowed in RFC 2822, in which '<"01KF8JCEOCBS0045PS"@xxx.yyy.com>' and '<01KF8JCEOCBS0045PS at xxx.yyy.com>' and and '<"01KF8JCEOCBS0045PS"@[xxx.yyy.com]>' and '<01KF8JCEOCBS0045PS@[xxx.yyy.com]>' are all the same message ID, the normalized form of which is '01KF8JCEOCBS0045PS at xxx.yyy.com'. Might be useful to have a method or property on email.message.Message to retrieve this value. I'd certainly want to normalize any message-IDs passed in as keys. > Let's think about some use cases. > > - given any message, find the entire thread it's a part of > - given a message, find all children > - given a message, find a path to the root of the thread > - find the parts of the thread that fall within a date range Interesting, hadn't thought about that one. Good idea. > - find the parts of a thread with a matching subject Hmmm. Using ORDEREDSUBJECT, all of the parts of a thread have the same "base subject" -- which is another thing defined in RFC 5256. It's basically the subject of the message with any "Re:" or "Fwd:" or "[mailman-listname]" stuff trimmed off. The ORDEREDSUBJECT algorithm basically collects all messages with the same "base subject" and sorts them by date. "Base subject" would be another good thing to add to email.util or email.message.Message, by the way. In the REFERENCES algorithm, threads with the same base subject are merged, but I suppose threads where someone replied to an earlier message, but with a different subject line, would allow multiple base subjects per thread. Perhaps such threads should be split apart? > >* For large sets of threads (millions of messages) a persistence > > mechanism would be useful. Should there be a standard interface to > > such a mechanism, perhaps as class methods on C? If so, what should > > it look like? Should the implementation contain a default persistent > > subclass of C, based on sqlite3? What side-effects would persistence > > requirements have on the other design considerations? For instance, > > would you have to save the entire text of a message for each node? > > Just the headers? Just some of the headers? Just the Message-ID? > > Great questions. We've long talked about a persistence mechanism for message > parts (e.g. store the big binary parts on disk instead of in memory). Some > consistency of design would be good here. But I agree that persistence should > definitely be part of the story, and it needs to be plugable. > > Have to think more about this, but a big +1 for the idea. It would serve as a > very good component for the ideas I have about a next generation email > archiver. Yes, I intend to use it for UpLib (http://uplib.parc.com/), which is what I use to archive my many years of email. But I thought it would be more generally useful for others if I wrote it to work with the more stdlib email package. Bill From janssen at parc.com Fri Jan 6 04:22:10 2012 From: janssen at parc.com (Bill Janssen) Date: Thu, 5 Jan 2012 19:22:10 PST Subject: [Email-SIG] API for email threading library? In-Reply-To: <20120106015213.9F9EA2500E5@webabinitio.net> References: <722.1325786141@parc.com> <20120105202108.15db9629@resist.wooz.org> <20120106015213.9F9EA2500E5@webabinitio.net> Message-ID: <12717.1325820130@parc.com> David, thanks for the follow-up. R. David Murray wrote: > On Thu, 05 Jan 2012 20:21:08 -0500, Barry Warsaw wrote: > > On Jan 05, 2012, at 09:55 AM, Bill Janssen wrote: > > > > >Folks, I'm working on an implementation of RFC 5256 email threading, > > >designed so that it could fit as a submodule in the "email" package, if > > >such a think was ever seen to be useful. > > > > I really like the idea of threading support being included in the email > > package. (I admit that I don't have time right now to read the RFC.) My > > general thoughts are that the actual messages needn't be included in the > > thread collection, but perhaps just Message-IDs. That would allow an > > application to store the actual message objects anywhere they want, and would > > reduce space requirements of the thread collection. > > I don't have time to read the RFC either :(. But from a skim of the > first bits, my immediate reaction is that the best thing to do is to break > everything down into as many discrete components as practical (pluggable > thread storage, thread construction (which presumably takes duck typed > Message objects containing at least the relevant headers) with different > subclasses or plugins for the different sorting algorithms, thread query, > etc) and keep them as decoupled as possible. That would give a server > implementer the greatest flexibility. That sounds good to me, too. Let me think about pluggable thread persistence a bit more -- pluggable might work better than subtypes there, which is the path I've been going down. The key question is what would we want to be able to do with a re-vivified thread store. If we want to be able to add new messages to it, we need to have access to the "five headers" of each of the messages, either by saving them, or by having access to the message store. If not, we can just save the message-IDs. (It would be nice if we could use fixed-size hashes of the message IDs instead of strings, but that would require a message store which understood that concept.) On the other hand, if we're adding a message, presumably we also have access to the message store, and could retrieve the "five headers" therefrom given the message-id -- though that might be an expensive operations for large message stores. Interesting set of metadata requirements on the pluggable design, both for the thread store and the message store. > You'll probably want to noodle on the various APIs and make some > concrete (but not fully fleshed out) proposals for discussion. That's > the procedure that seemed to work best when we were working on the > email6 API. Think of this as the noodling :-). > PS: If you implement the 'base subject' algorithm I bet we can get > agreement to check that right in to email.utils before 3.3 :) I have working code for all of this; right now I'm expanding the test suite and looking at performance and API optimizations, not to mention PEP8-ification. Bill From matt at mondoinfo.com Fri Jan 6 05:42:29 2012 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Thu, 5 Jan 2012 22:42:29 -0600 (CST) Subject: [Email-SIG] API for email threading library? In-Reply-To: <722.1325786141@parc.com> References: <722.1325786141@parc.com> Message-ID: <1325824809.39.11029@mint-julep.mondoinfo.com> Bill, > Folks, I'm working on an implementation of RFC 5256 email > threading, designed so that it could fit as a submodule in the > "email" package, if such a think was ever seen to be useful. If you find it at all useful, you're very welcome to use anything you like from: http://www.mondoinfo.com/threadMessages.tar.gz Regards, Matt From janssen at parc.com Fri Jan 6 18:04:01 2012 From: janssen at parc.com (Bill Janssen) Date: Fri, 6 Jan 2012 09:04:01 PST Subject: [Email-SIG] API for email threading library? In-Reply-To: <1325824809.39.11029@mint-julep.mondoinfo.com> References: <722.1325786141@parc.com> <1325824809.39.11029@mint-julep.mondoinfo.com> Message-ID: <19088.1325869441@parc.com> Matthew Dixon Cowles wrote: > Bill, > > > Folks, I'm working on an implementation of RFC 5256 email > > threading, designed so that it could fit as a submodule in the > > "email" package, if such a think was ever seen to be useful. > > If you find it at all useful, you're very welcome to use anything you > like from: > > http://www.mondoinfo.com/threadMessages.tar.gz > > Regards, > Matt Thanks, Matt. Bill From stephen at xemacs.org Sat Jan 7 05:17:01 2012 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sat, 07 Jan 2012 13:17:01 +0900 Subject: [Email-SIG] API for email threading library? In-Reply-To: <722.1325786141@parc.com> References: <722.1325786141@parc.com> Message-ID: <87vcoo2lky.fsf@uwakimon.sk.tsukuba.ac.jp> Bill Janssen writes: > Folks, I'm working on an implementation of RFC 5256 email threading, > designed so that it could fit as a submodule in the "email" package, if > such a think was ever seen to be useful. I don't know if it belongs there, although that's the obvious place. There are a other threaded message structures that aren't email (or netnews, which is obviously basically the same thing). For example, issue trackers. > * Should M be required to be "email.message.Message", -1 > or perhaps some less restrictive type, say > "ThreadableMessageAPI"? All that's strictly required is the > ability to retrieve the Message-ID, Subject, Date, References, > and In-Reply-To fields. If a variety of existing apps are to be able to plug this in, the API shouldn't be bound to email.message.Message. +1 for duck-typing. > * What operations should be possible on C? Some that come to mind: > > * retrieve_thread (M or message-id) => T > * add_message (M) => T > * add_messages (set of M) => None > * remove_message (M or message-id) => T (or None) ? * Reparent message (this will actually merge threads). > * What's the interface for T? It's a tree with possible dummy nodes, so > a tuple of messages plus nested tuples would do it. What should the > nodes in the tree be? Normalized (see RFC 5256) Message-IDs? In a Lisp implementation of http://www.jwz.org/doc/threading.html I'm working on, I just use symbols named by the message IDs themselves; I'm not familiar with the normalization yet. > email.message.Message instances? I think it should be more abstract than that. From janssen at parc.com Mon Jan 9 22:05:59 2012 From: janssen at parc.com (Bill Janssen) Date: Mon, 9 Jan 2012 13:05:59 PST Subject: [Email-SIG] API for email threading library? In-Reply-To: <87vcoo2lky.fsf@uwakimon.sk.tsukuba.ac.jp> References: <722.1325786141@parc.com> <87vcoo2lky.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <64473.1326143159@parc.com> Thanks for the feedback, Stephen. Stephen J. Turnbull wrote: > Bill Janssen writes: > > > Folks, I'm working on an implementation of RFC 5256 email threading, > > designed so that it could fit as a submodule in the "email" package, if > > such a think was ever seen to be useful. > > I don't know if it belongs there, although that's the obvious place. > There are a other threaded message structures that aren't email (or > netnews, which is obviously basically the same thing). For example, > issue trackers. > > > * Should M be required to be "email.message.Message", > > -1 > > > or perhaps some less restrictive type, say > > "ThreadableMessageAPI"? All that's strictly required is the > > ability to retrieve the Message-ID, Subject, Date, References, > > and In-Reply-To fields. > > If a variety of existing apps are to be able to plug this in, the API > shouldn't be bound to email.message.Message. +1 for duck-typing. I think I'll finesse this issue with another (appropriate) layer of indirection. > > * What operations should be possible on C? Some that come to mind: > > > > * retrieve_thread (M or message-id) => T > > * add_message (M) => T > > * add_messages (set of M) => None > > * remove_message (M or message-id) => T (or None) ? > > * Reparent message (this will actually merge threads). > > > * What's the interface for T? It's a tree with possible dummy nodes, so > > a tuple of messages plus nested tuples would do it. What should the > > nodes in the tree be? Normalized (see RFC 5256) Message-IDs? > > In a Lisp implementation of http://www.jwz.org/doc/threading.html I'm > working on, I just use symbols named by the message IDs themselves; Yes, that works well for a static persistent representation. Lisp message threading? What's that in aid of, if you can say? > I'm not familiar with the normalization yet. RFC 5256 mentions it, but I had to go back to 2822 to figure it out. Referencing section 3.6.4 of RFC 2822: The IMAP guys seem to be implying that the DQUOTEs in "no-fold-quote" and the "[" and "]" brackets in "no-fold-literal" should be removed before comparing message-ids. I'll send a note to the IMAP list to verify that. Bill From stephen at xemacs.org Tue Jan 10 02:12:01 2012 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 10 Jan 2012 10:12:01 +0900 Subject: [Email-SIG] API for email threading library? In-Reply-To: <64473.1326143159@parc.com> References: <722.1325786141@parc.com> <87vcoo2lky.fsf@uwakimon.sk.tsukuba.ac.jp> <64473.1326143159@parc.com> Message-ID: <87pqes9x9a.fsf@uwakimon.sk.tsukuba.ac.jp> Bill Janssen writes: > I think I'll finesse this issue with another (appropriate) layer of > indirection. OK by me (can't bring myself to +1 on a thoughtful finesse. :) > > In a Lisp implementation of http://www.jwz.org/doc/threading.html I'm > > working on, I just use symbols named by the message IDs themselves; > > Yes, that works well for a static persistent representation. > > Lisp message threading? What's that in aid of, if you can say? The "VM" MUA for Emacs and XEmacs. > RFC 5256 mentions it, but I had to go back to 2822 to figure it out. Tee-hee-hee! The wild, wonderful world of RFCs: "You are in a twisty maze of ABNF, all alike ...." From janssen at parc.com Tue Jan 10 02:32:00 2012 From: janssen at parc.com (Bill Janssen) Date: Mon, 9 Jan 2012 17:32:00 PST Subject: [Email-SIG] API for email threading library? In-Reply-To: <87pqes9x9a.fsf@uwakimon.sk.tsukuba.ac.jp> References: <722.1325786141@parc.com> <87vcoo2lky.fsf@uwakimon.sk.tsukuba.ac.jp> <64473.1326143159@parc.com> <87pqes9x9a.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <69863.1326159120@parc.com> Stephen J. Turnbull wrote: > > Lisp message threading? What's that in aid of, if you can say? > > The "VM" MUA for Emacs and XEmacs. Ah! I use MH-E, myself. Bill From janssen at parc.com Tue Jan 10 02:36:49 2012 From: janssen at parc.com (Bill Janssen) Date: Mon, 9 Jan 2012 17:36:49 PST Subject: [Email-SIG] API for email threading library? In-Reply-To: <87vcoo2lky.fsf@uwakimon.sk.tsukuba.ac.jp> References: <722.1325786141@parc.com> <87vcoo2lky.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <69887.1326159409@parc.com> Some input from Mark Crispin (who wrote that bit about message-ID normalization in RFC 5256): > no-fold-quote does not exist in the current specification (RFC 5322) > [which obsoletes 2822 - wcj]. > > I don't know why you think that the brackets should be removed in > no-fold-literal. The brackets indicate that the contents are a literal IP > address as opposed to a domain. The fact that 10.20.30.40, as opposed to > [10.20.30.40], is parsed by some people as an IP address does not > necessarily mean that it is (I'll laugh when the first all-numeric TLD is > created!). Now, in the modern day of RFC 5322, this isn't a domain at all > but rather an id-right. > > People can flame at some length whether bloop at 10.20.30.40 and > bloop@[10.20.30.40] are the same message-ID. My guess is "no". > > The bottom line here is whether that text about normalized message ID has > any particular meaning in the context of RFC 5322 as opposed to earlier > versions of header syntax that used local-part at domain for message-id. > IMHO (and I wrote that text!) I would treat it as advice on how to treat > warts from the past rather than how to move forward. > > That is, once upon a time, it was necessary to treat: > > Message-ID: <"bloop"@grok.this> > and > Message-ID: > > as the same thing. This was a protocol wart and I'm glad to see it > declared obsolete. I wouldn't flame anyone who decided that strcmp() is > the one and only way to compare Message-IDs. I daresay that's what most > implementations did anyway even when RFC 822 was king. So, stripping double-quotes on the left side stays, stripping brackets on the right side is a no-no. Bill From janssen at parc.com Tue Jan 10 03:37:25 2012 From: janssen at parc.com (Bill Janssen) Date: Mon, 9 Jan 2012 18:37:25 PST Subject: [Email-SIG] API for email threading library? In-Reply-To: <722.1325786141@parc.com> References: <722.1325786141@parc.com> Message-ID: <71141.1326163045@parc.com> Thanks for all the feedback, folks. After musing about all of this, it seems to me that threading makes no sense outside the context of a message store (or forum post store, or netnews article store, or...). So I'm going to push all the decisions about "what's a message anyway" into an abstraction of such a message store: class ThreadableObjectStore: @abstractmethod get_message_id(msg) => message-ID (string) @abstractmethod get_subject(msg) => subject (string) @abstractmethod get_date(msg) => timestamp (float, seconds past epoch) @abstractmethod get_references(msg) => sequence of message-ID (list of string) So your particular instantiation of ThreadableObjectStore can decide what a 'msg' is, what a message-ID is, whether they're normalized, etc. An instance of a ThreadableObjectStore will be required to create an instance of a threadset. I'll provide such a class for mailbox.Mailbox, for testing. Also, I think that persistence of the threading analysis is really a function of the message store, not the threadset. So what the threadset requires is simply (1) a way to externalize its threads in a meaningful way, which a forest of tuple trees with message IDs at the nodes works perfectly well for, and (2) a way to take such a representation and revivify it, given a message store. Bill From stephen at xemacs.org Tue Jan 10 04:23:02 2012 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 10 Jan 2012 12:23:02 +0900 Subject: [Email-SIG] API for email threading library? In-Reply-To: <69887.1326159409@parc.com> References: <722.1325786141@parc.com> <87vcoo2lky.fsf@uwakimon.sk.tsukuba.ac.jp> <69887.1326159409@parc.com> Message-ID: <87lipg9r6x.fsf@uwakimon.sk.tsukuba.ac.jp> Bill Janssen writes: > So, stripping double-quotes on the left side stays, stripping brackets > on the right side is a no-no. Hrm. How about interpreting quoted-pairs? (Not that you should ever see them, but ....) That is, <"b\l\o\op"@grok.this> and <"bloop"@grok.this> should compare equal, no? Or yes? Which leads me to ... I wonder if the way the Postel Principle applies here isn't "you're better unifying too many message IDs because the user will immediately recognize thread content skew, while unifying too few will result in different parts of the thread being widely separated in the presentation of the message set, and possibly premature ejaculation of responses".[1] So, (without having thought about it *too* much) I would advocate unifying message IDs that are likely to be (mistakenly?) "normalized" by some implementations. And of course, you should never see such message IDs in practice; I don't think I've ever seen a mailbox, let alone the LHS of a message ID, in quotes outside of an RFC. Although I *have* seen whole addresses in quotes. BTW, although I'm working with VM myself, my intent is to make jwz-thread.el usable with any Emacsen-based MUA. (I'm really sick of how crappy *all* of the MUA code is in Emacs -- I can understand why one would use MH-E since the MUA is actually implemented elsewhere!) Footnotes: [1] Which is why I'm implementing a threading engine.... From janssen at parc.com Tue Jan 10 19:09:48 2012 From: janssen at parc.com (Bill Janssen) Date: Tue, 10 Jan 2012 10:09:48 PST Subject: [Email-SIG] API for email threading library? In-Reply-To: <87lipg9r6x.fsf@uwakimon.sk.tsukuba.ac.jp> References: <722.1325786141@parc.com> <87vcoo2lky.fsf@uwakimon.sk.tsukuba.ac.jp> <69887.1326159409@parc.com> <87lipg9r6x.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <79664.1326218988@parc.com> Stephen J. Turnbull wrote: > BTW, although I'm working with VM myself, my intent is to make > jwz-thread.el usable with any Emacsen-based MUA. Great! > (I'm really sick of > how crappy *all* of the MUA code is in Emacs -- I can understand why > one would use MH-E since the MUA is actually implemented elsewhere!) Completely agree -- when I shifted over to it, I had to re-write a third of the MH-E code to get it into some shape I could live with. Bill From janssen at parc.com Tue Jan 10 19:12:36 2012 From: janssen at parc.com (Bill Janssen) Date: Tue, 10 Jan 2012 10:12:36 PST Subject: [Email-SIG] API for email threading library? In-Reply-To: <87lipg9r6x.fsf@uwakimon.sk.tsukuba.ac.jp> References: <722.1325786141@parc.com> <87vcoo2lky.fsf@uwakimon.sk.tsukuba.ac.jp> <69887.1326159409@parc.com> <87lipg9r6x.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <79735.1326219156@parc.com> Stephen J. Turnbull wrote: > Bill Janssen writes: > > > So, stripping double-quotes on the left side stays, stripping brackets > > on the right side is a no-no. > > Hrm. How about interpreting quoted-pairs? (Not that you should ever > see them, but ....) > > That is, <"b\l\o\op"@grok.this> and <"bloop"@grok.this> should compare > equal, no? Or yes? Yes, I think. > Which leads me to ... I wonder if the way the Postel Principle applies > here isn't "you're better unifying too many message IDs because the > user will immediately recognize thread content skew, while unifying > too few will result in different parts of the thread being widely > separated in the presentation of the message set, and possibly > premature ejaculation of responses".[1] So, (without having thought > about it *too* much) I would advocate unifying message IDs that > are likely to be (mistakenly?) "normalized" by some implementations. > > And of course, you should never see such message IDs in practice; I > don't think I've ever seen a mailbox, let alone the LHS of a message > ID, in quotes outside of an RFC. Although I *have* seen whole > addresses in quotes. That's probably right, too. And Mark Crispin says as much. Bill From janssen at parc.com Wed Jan 11 20:00:39 2012 From: janssen at parc.com (Bill Janssen) Date: Wed, 11 Jan 2012 11:00:39 PST Subject: [Email-SIG] API for email threading library? In-Reply-To: <71141.1326163045@parc.com> References: <722.1325786141@parc.com> <71141.1326163045@parc.com> Message-ID: <98700.1326308439@parc.com> Here's what I've got so far. Comments would be appreciated. Bill ====================================================================== This module implements email threading per RFC 5256. It provides four classes: ThreadableObjectStore, MailboxStore, ReferencesSet, and OrderedSubjectSet. To use it, you need to provide it with a "mailstore", and a set of messages to thread. The mailstore must be a subclass of the abstract class ThreadableObjectStore; an implementation of a ThreadableObjectStore for mailbox.Mailbox is provided, as the class MailboxStore. Four methods must be implemented for a new ThreadableObjectStore subclass: tos_get_message_id(msg or message ID) => message ID where the message ID is an immutable value that must be unique in that ThreadableObjectStore context, and the msg can be whatever that ThreadableObjectStore considers a message. tos_get_subject(msg or message ID) => subject where the subject is the subject of the message, or None tos_get_date (msg or message ID) => timestamp where the timestamp is the date and time of the message, expressed as a standard Python time.time() value tos_get_references (msg or message ID) => sequence of message ID where the references are a sequence of message IDs, arranged in order as per RFC 5322. These message IDs must be in the same format as the message ID returned by tos_get_message_id(). The base ThreadableObjectStore class also provides a class method to compute the RFC 5256 "base subject": ThreadableObjectStore.tos_base_subject (subject text) => \ subject, is_reply_or_forward Takes a standard Subject: header value, and returns the "base subject" for it, along with a boolean flag indicating whether the supplied subject indicated a reply to or forward of the original subject To develop a set of threads, you then instantiate either ReferencesSet (the JWS algorithm from Netscape, formalized in RFC 5256), or OrderedSubjectSet (the "same subjects" algorithm, aka "poor man's threading"), both subclasses of the abstract class ThreadSet. Each constructor takes a ThreadableObjectStore instance and optionally a set of messages to use for the initial threads. If provided, those messages are analyzed into a set of threads. The threadset is iterable; the iteration is over the threads it contains. An instance of ThreadSet provides the following methods: add (msg or message ID) => thread add another message from the mailstore to the thread set, where "thread" is an object which has the attributes "message_id" (a string) and "children" (an ordered list of sub-threads), and is the root of the thread tree for that msg. remove (msg or message ID) => thread remove a message from the thread set, where thread is as for "add()", but may additionally be 'None' if the message was not in a thread, or was the only message in the thread. thread (msg or message ID) => thread obtain the thread containing the specified message, if any, where "thread" is as for "add()", or 'None' if no thread for that message exists. subject_threads (subject regexp) => set of thread obtain the threads where the base subject of the thread contains the specified regular expression, where "regexp" is a textual or compiled regular expression, and the return value is a set of threads. Note that subject comparisons are case-insensitive; compiled regexps must use the re.IGNORECASE flag. date_threads (starting time, ending time, root_only=False) => set of thread obtain the set of threads containing any messages between the two timestamps. Timestamps are time.time() timestamps; either may be specified as 'None' to mean either the start of time, or the distant future, respectively. If "root_only" is specified, will only consider the dates of the roots of each thread; threads with no root message (a subject forest) will always fail to match in this case. __contains__ (msg or message ID) => boolean Present to support the "in" operator. Support for persistence is provided with an instance method "to_external_form" and a class method "from_external_form" on thread sets. Calling "to_external_form" on a thread set instance will generate a set of tree structured nested tuples, where each tuple consists of an optional message ID followed by zero or more child tuples. ReferencesSet and OrderedSubjectSet also provide a class method, "from_external_form", which given a ThreadableObjectStore instance and an externalized thread set value, will create and return a new thread set instance initialized to that set of threads. MailboxStore is a subclass of ThreadableObjectStore designed to wrap mailboxes (subclasses of mailbox.Mailbox). For instance, >>> mbox = mailbox.Mbox("foo.mbox") >>> mboxstore = MailboxStore(mbox) >>> threadset = ReferencesSet (mboxstore, mbox.itervalues()) will produce a thread set for all the messages in the mbox-format mailbox 'foo.mbox', using the REFERENCES threading algorithm. MailboxStore also provides a static method to compute the normalized form of a message ID (the message ID stripped of <> angle brackets, and various quoted parts unquoted): MailboxStore.normalize_message_id(message ID) => message ID Take a standard RFC 5322 message ID string and return the normalized form of it. From janssen at parc.com Wed Jan 11 20:29:25 2012 From: janssen at parc.com (Bill Janssen) Date: Wed, 11 Jan 2012 11:29:25 PST Subject: [Email-SIG] API for email threading library? In-Reply-To: <87pqes9x9a.fsf@uwakimon.sk.tsukuba.ac.jp> References: <722.1325786141@parc.com> <87vcoo2lky.fsf@uwakimon.sk.tsukuba.ac.jp> <64473.1326143159@parc.com> <87pqes9x9a.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <99819.1326310165@parc.com> Stephen J. Turnbull wrote: > > Lisp message threading? What's that in aid of, if you can say? > > The "VM" MUA for Emacs and XEmacs. Incidentally, I'm using the Nov 2011 Python-dev archive as a test mbox. If were to try it with your software, too, we could test the implementations against each other. Bill From stephen at xemacs.org Thu Jan 12 06:28:27 2012 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 12 Jan 2012 14:28:27 +0900 Subject: [Email-SIG] API for email threading library? In-Reply-To: <99819.1326310165@parc.com> References: <722.1325786141@parc.com> <87vcoo2lky.fsf@uwakimon.sk.tsukuba.ac.jp> <64473.1326143159@parc.com> <87pqes9x9a.fsf@uwakimon.sk.tsukuba.ac.jp> <99819.1326310165@parc.com> Message-ID: <87mx9t8p6s.fsf@uwakimon.sk.tsukuba.ac.jp> Bill Janssen writes: > Stephen J. Turnbull wrote: > > > > Lisp message threading? What's that in aid of, if you can say? > > > > The "VM" MUA for Emacs and XEmacs. > > Incidentally, I'm using the Nov 2011 Python-dev archive as a test mbox. > If were to try it with your software, too, we could test the > implementations against each other. Of course I've been using an archive of my own, but I'm more than happy to switch to something publicly available. I've got a big meeting coming up on Saturday, I'll get back to you on this after that. Steve