From janssen at parc.com Sun Oct 17 23:01:12 2010 From: janssen at parc.com (Bill Janssen) Date: Sun, 17 Oct 2010 14:01:12 PDT Subject: [Archiver-dev] UpLib and archiving Message-ID: <83790.1287349272@parc.com> Just noticed this list, and thought I'd sign up. I build the UpLib archive system, at http://uplib.parc.com/. The latest release includes new support for building very large archives. UpLib has some support for email archiving already, including thread analysis and a built-in IMAP server, but that support needs to be re-worked for efficiency to support large archives. So I'm thinking about that just now. Some topics: 1. An email thread analysis library which works on a mixin, say ThreadableEmail, so that different email packages could use it. 2. Support for multipart/related parsing. 3. Indexing for search. UpLib currently indexes email into PyLucene with the following fields: date (untokenized) contents (tokenized -- just the body text, not the headers) email-message-id (untokenized) email-guid (untokenized -- a hash of the message-id) email-subject (tokenized) email-from-name (tokenized, only used if present) email-from-address (untokenized) email-attachment-to (untokenized, for attachments, guid of message) email-thread-index (untokenized, thread ID) email-references (untokenized, zero or more email-guids) email-in-reply-to (untokenized, zero or more email-guids) email-recipient-names (untokenized [should be tokenized]) email-recipients (untokenized -- who the message was sent to) Attachments are extracted, and indexed separately, with links from the attachment to the message, and links from the message to its attachments. This is a nice feature of UpLib over more specifically mail-archiving systems -- it can also archive images, Word, PDF, etc., and do proper metadata indexing on all of the various types. It also tries to leverage Lucene's multi-language support, by running a language guesser over the text of the email, and selecting the Lucene Analyzer which most closely matches that language. So, is this a good list of indexing fields? Bad list? Where does the Dublin Core factor into this? 4. Archive server frameworks. My IMAP server is currently built on top of Medusa, like the rest of UpLib. No one's working on Medusa. Bill From barry at python.org Mon Oct 18 20:28:59 2010 From: barry at python.org (Barry Warsaw) Date: Mon, 18 Oct 2010 14:28:59 -0400 Subject: [Archiver-dev] UpLib and archiving In-Reply-To: <83790.1287349272@parc.com> References: <83790.1287349272@parc.com> Message-ID: <20101018142859.4553f808@mission> On Oct 17, 2010, at 02:01 PM, Bill Janssen wrote: >I build the UpLib archive system, at http://uplib.parc.com/. > >The latest release includes new support for building very large >archives. UpLib has some support for email archiving already, including >thread analysis and a built-in IMAP server, but that support needs to be >re-worked for efficiency to support large archives. So I'm thinking >about that just now. Very cool! The state of the art in open source email archivers has been stagnant for years. I think a huge number of people would like to see a new offering, but getting things off the ground has always been too daunting. Maybe uplib will be the platform to build a nextgen email archiver on top of. >1. An email thread analysis library which works on a mixin, say > ThreadableEmail, so that different email packages could use it. Which different email packages do you mean? Is that "different versions of the stdlib email package" or something else? >2. Support for multipart/related parsing. That would be nice. >3. Indexing for search. UpLib currently indexes email into PyLucene > with the following fields: > > date (untokenized) > contents (tokenized -- just the body text, not the headers) How does it (or how do you envision it) working with non-text/plain parts? > email-message-id (untokenized) > email-guid (untokenized -- a hash of the message-id) There's also this: http://wiki.list.org/display/DEV/Stable+URLs Stable URLs on archive regeneration is absolutely critical and predictable URLs without communication between the MLM and archiver is highly desirable. The algorithm is simple, but I don't know how that works with uplib's notion of an email message's canonical URL. > email-subject (tokenized) > email-from-name (tokenized, only used if present) > email-from-address (untokenized) > email-attachment-to (untokenized, for attachments, guid of message) > email-thread-index (untokenized, thread ID) > email-references (untokenized, zero or more email-guids) > email-in-reply-to (untokenized, zero or more email-guids) > email-recipient-names (untokenized [should be tokenized]) > email-recipients (untokenized -- who the message was sent to) > > Attachments are extracted, and indexed separately, with links from the > attachment to the message, and links from the message to its > attachments. This is a nice feature of UpLib over more specifically > mail-archiving systems -- it can also archive images, Word, PDF, etc., > and do proper metadata indexing on all of the various types. And that is *very* cool. How do you handle security issues, i.e. html parts with evil content (javascript) or Content-Disposition filenames that lie about their type? > It also tries to leverage Lucene's multi-language support, by > running a language guesser over the text of the email, and selecting > the Lucene Analyzer which most closely matches that language. Wow, neat! > So, is this a good list of indexing fields? Bad list? Where does > the Dublin Core factor into this? It seems like a reasonably good start. Dunno about Dublin Core. >4. Archive server frameworks. My IMAP server is currently built on top > of Medusa, like the rest of UpLib. No one's working on Medusa. How hard would it be to slot in Twisted? Something I've always wanted to see was an archiver that supported IMAP, NNTP, and web access. Twisted seems like the obvious choice. What about plugins? There are a few areas that come to mind about things I'd like to have pluggable: * Content hyperlinks. Let's say you've got an archive for a -commits list. It would be nice to be able to dig out things like bug numbers and vcs revisions and hyperlink them to tracker or viewvcs pages. * Take-down support. If a list admin wants to remove a posting, she should be able to do that without disrupting email threads or breaking URLs. One way I've thought about doing that is a dynamic rendering plugin that checked the to-be-displayed message against a blacklist, and if there's a hit, it would substitute the body of the message with something like "Content unavailable due to take-down notice. Contact postmaster at python.org for detail." * Email address obfuscation. Obviously we'd want to support that, but using what algorithm? xxx'ing out the domain? Using a central forwarding service? How do we recognize email addresses? * Send-me-this-message button. I do a Google search and find a message in an archive from 4 years ago. It's relevant to the problem I'm now having and I'd like to respond to it in my normal email reader. Maybe IMAP/NNTP is the right way to go, or there could be a button to allow the user to forward the message to herself. Interested to hear your thoughts. This would be a cool project to work on, and maybe we should also engage mailman-developers. Thanks for releasing it under the GPL[*]. -Barry [*] While GPLv2 would be incompatible with Mailman 3's GPLv3, I don't think it matters. The two systems will be lightly connected, though we would have to think about the integration points. MM3 has a plugin system for archivers. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: not available URL: From janssen at parc.com Tue Oct 19 03:13:26 2010 From: janssen at parc.com (Bill Janssen) Date: Mon, 18 Oct 2010 18:13:26 PDT Subject: [Archiver-dev] UpLib and archiving In-Reply-To: <20101018142859.4553f808@mission> References: <83790.1287349272@parc.com> <20101018142859.4553f808@mission> Message-ID: <92627.1287450806@parc.com> Barry Warsaw wrote: > On Oct 17, 2010, at 02:01 PM, Bill Janssen wrote: > > >I build the UpLib archive system, at http://uplib.parc.com/. > > > >The latest release includes new support for building very large > >archives. UpLib has some support for email archiving already, including > >thread analysis and a built-in IMAP server, but that support needs to be > >re-worked for efficiency to support large archives. So I'm thinking > >about that just now. > > Very cool! The state of the art in open source email archivers has been > stagnant for years. I think a huge number of people would like to see a new > offering, but getting things off the ground has always been too daunting. > Maybe uplib will be the platform to build a nextgen email archiver on top of. UpLib is not specifically about email -- it's a general purpose digital document archiver. (But most email is not just about email, either. See for instance http://www.parc.com/content/attachments/email_habitat_exploration_4360_parc.pdf.) I pour my email into it, so I've added some features to it to help with the process of reading and finding email, like the threading code. It knows how to parse emails (using the email package) and render them, and do threading, etc. > > >1. An email thread analysis library which works on a mixin, say > > ThreadableEmail, so that different email packages could use it. > > Which different email packages do you mean? Is that "different versions of > the stdlib email package" or something else? Different email archival implementations, I was thinking. For instance, I'm re-writing the thread support in UpLib to handle email thread forests backed by an SQLite DB. I'd like to be able to use a library to deduce the threads, but keep them in my own format. To implement the two separate IMAP threading algorithms in RFC 5256, you need "Subject", "Date", "Message-ID" (the *normalized* Message-ID), "References", and "In-Reply-To". So such a library would need an email Message type which provides these fields in some fashion. In UpLib, I'd want these to be either subtypes of Document, or perhaps a separate record type created solely for the purposes of threading. > > >2. Support for multipart/related parsing. > > That would be nice. > > >3. Indexing for search. UpLib currently indexes email into PyLucene > > with the following fields: > > > > date (untokenized) > > contents (tokenized -- just the body text, not the headers) > > How does it (or how do you envision it) working with non-text/plain parts? It's got some code to deal with that. If the Content-Type is non-text, it just processes it as it would any other document of that content-type. Otherwise, if it's text/* or multipart, the email parsing code picks one part, either text/plain or text/html (or a series of such), and makes it the "main" part. The text extracted from that is the "contents". Other parts are typically classified as "attachments", broken off, and separately indexed. So text in them is also indexed, but it's indexed in the role of an attachment to a particular email message. Attachments show up as little icons in the visual rendering of the document. > > email-message-id (untokenized) > > email-guid (untokenized -- a hash of the message-id) > > There's also this: http://wiki.list.org/display/DEV/Stable+URLs > > Stable URLs on archive regeneration is absolutely critical and predictable > URLs without communication between the MLM and archiver is highly desirable. > The algorithm is simple, but I don't know how that works with uplib's notion > of an email message's canonical URL. > > > email-subject (tokenized) > > email-from-name (tokenized, only used if present) > > email-from-address (untokenized) > > email-attachment-to (untokenized, for attachments, guid of message) > > email-thread-index (untokenized, thread ID) > > email-references (untokenized, zero or more email-guids) > > email-in-reply-to (untokenized, zero or more email-guids) > > email-recipient-names (untokenized [should be tokenized]) > > email-recipients (untokenized -- who the message was sent to) > > > > Attachments are extracted, and indexed separately, with links from the > > attachment to the message, and links from the message to its > > attachments. This is a nice feature of UpLib over more specifically > > mail-archiving systems -- it can also archive images, Word, PDF, etc., > > and do proper metadata indexing on all of the various types. > > And that is *very* cool. How do you handle security issues, i.e. html parts > with evil content (javascript) or Content-Disposition filenames that lie about > their type? I don't run Javascript, so evil parts get to stick around and potentially do damage in the future. UpLib has an extensible system of data analysis engines called "rippers" which are automatically run on each document; if you were concerned about the possibility of lingering malware a malware-detector ripper could be added to flag and/or remove such content. I actually do process Javascript lightly to remove some irritating pieces, like "capture" redirects. As for the Content-Disposition filenames: UpLib runs its own content-type determiner over the content to try to see what it is rather than just relying on the filename, though it will fall back to the filename if it can't figure it out. And I've hardcoded some typical situations. > > It also tries to leverage Lucene's multi-language support, by > > running a language guesser over the text of the email, and selecting > > the Lucene Analyzer which most closely matches that language. > > Wow, neat! > > > So, is this a good list of indexing fields? Bad list? Where does > > the Dublin Core factor into this? > > It seems like a reasonably good start. Dunno about Dublin Core. > > >4. Archive server frameworks. My IMAP server is currently built on top > > of Medusa, like the rest of UpLib. No one's working on Medusa. > > How hard would it be to slot in Twisted? Something I've always wanted to see > was an archiver that supported IMAP, NNTP, and web access. Twisted seems like > the obvious choice. I've got the Web access and the IMAP support, but not NNTP -- never had the need for it. Twisted seems to have support going forward that Medusa no longer has, and at some point I plan to port UpLib from Medusa to Twisted. > What about plugins? There are a few areas that come to mind about things I'd > like to have pluggable: > > * Content hyperlinks. Let's say you've got an archive for a -commits list. > It would be nice to be able to dig out things like bug numbers and vcs > revisions and hyperlink them to tracker or viewvcs pages. UpLib automatically recognizes and stores hyperlinks found in Word, PDF, Powerpoint, etc. as part of the standard metadata extraction process. There's also a ripper which recognizes URLs and stores them as links. In house, we have some support for entity-finding: person or corporation or location names, dates, etc. They are also automatically turned into links, and show up as hyperlinks in the Web and Java UI tools. What you're suggesting is more of that. > * Take-down support. If a list admin wants to remove a posting, she should be > able to do that without disrupting email threads or breaking URLs. One way > I've thought about doing that is a dynamic rendering plugin that checked the > to-be-displayed message against a blacklist, and if there's a hit, it would > substitute the body of the message with something like "Content unavailable > due to take-down notice. Contact postmaster at python.org for detail." Yeah, kind of provide a stand-in for the real message. UpLib re-threads if you modify the corpus, so removal is automatic. It also includes a capability to "replace" the content of an existing document, which sounds like what you'd want for the above. > * Email address obfuscation. Obviously we'd want to support that, but using > what algorithm? xxx'ing out the domain? Using a central forwarding > service? How do we recognize email addresses? I don't obfuscate anything, really. But this is an issue for a public Web UI design, I think. > * Send-me-this-message button. I do a Google search and find a message in an > archive from 4 years ago. It's relevant to the problem I'm now having and > I'd like to respond to it in my normal email reader. Maybe IMAP/NNTP is the > right way to go, or there could be a button to allow the user to forward the > message to herself. Nice idea. I've got an extension which (sort of) supports this (you can email a copy of any document to anyone), and the user can define new buttons to add to the UI in her config file. I, for instance, added a button which shows me all the email threads which have been updated today: Today's Mail, /action/basic/email_threads?query=date:today+$email, _blank Of course, the normal UpLib Web UI puts apppropriate "mailto:" links around people's names, and adds "Reply-To" and "Reply-To-All" links to the message. Just click on that and it opens up in your MUA. > Interested to hear your thoughts. This would be a cool project to work on, > and maybe we should also engage mailman-developers. Thanks for releasing it > under the GPL[*]. Well, there's lots to do :-). The current IMAP server, for instance, is more about getting the IMAP protocol right than it is efficiency. When you go into python-dev size archives without breaking it into chunks (like the per-month view in Mailman archives), it poops out. Shouldn't do that. My normal development process is to write any new code as an UpLib extension, then if it works I eventually fold it into the codebase. Extensions are easy to add (just plunk them in a directory, and point the repository at that directory), and there are a number of examples included with the source code. The IMAP server is an extension, for instance. Bill > > -Barry > > [*] While GPLv2 would be incompatible with Mailman 3's GPLv3, I don't think it > matters. The two systems will be lightly connected, though we would have to > think about the integration points. MM3 has a plugin system for archivers. > _______________________________________________ > Archiver-dev mailing list > Archiver-dev at python.org > http://mail.python.org/mailman/listinfo/archiver-dev From earl at earlhood.com Tue Oct 19 03:56:46 2010 From: earl at earlhood.com (Earl Hood) Date: Mon, 18 Oct 2010 20:56:46 -0500 Subject: [Archiver-dev] UpLib and archiving In-Reply-To: <92627.1287450806@parc.com> References: <83790.1287349272@parc.com> <20101018142859.4553f808@mission> <92627.1287450806@parc.com> Message-ID: On Mon, Oct 18, 2010 at 8:13 PM, Bill Janssen wrote: >> And that is *very* cool. ?How do you handle security issues, i.e. html parts >> with evil content (javascript) or Content-Disposition filenames that lie about >> their type? > > I don't run Javascript, so evil parts get to stick around and > potentially do damage in the future. ?UpLib has an extensible system of > data analysis engines called "rippers" which are automatically run on > each document; if you were concerned about the possibility of lingering > malware a malware-detector ripper could be added to flag and/or remove > such content. Sanitizing javascript should be the default behavior. This is a major XSS exploit, and if you want others to utilize your software for their sites, they will open their site to XSS if this is not done. > As for the Content-Disposition filenames: UpLib runs its own > content-type determiner over the content to try to see what it is rather > than just relying on the filename, though it will fall back to the > filename if it can't figure it out. ?And I've hardcoded some typical > situations. Falling back to filename should be a configurable option, and it should be disabled by default. If you are really paranoid about security, you should have a whitelist of filename extensions that you allow. At a minimum, at least have a list of extensions that are forbidden (e.g. .shtml, .cgi). IMO, the content-type should be the authoritative source of what the type of file is, but scanning the data is reasonable depending on how robust it is. Attackers are known to give incorrect values in an attempt to fool email processors, but such attempts are usually done with the content-disposition filename parameter since some popular MUAs display it to the user, which can mislead them on its true contents. I recommend that all attachments be saved into an attachments area so you can place restrictive web server configuration settings on it. This approach assumes you serve up attachment data directly via the file system via standard HTTP server retrieval. If you serve up attachments via custom web service (e.g. servlet, CGI), then filenaming concerns of attachments are not as critical. >> * Email address obfuscation. ?Obviously we'd want to support that, but using >> ? what algorithm? ?xxx'ing out the domain? ?Using a central forwarding >> ? service? ?How do we recognize email addresses? > > I don't obfuscate anything, really. ?But this is an issue for a public > Web UI design, I think. If your plugin architecture has the support for data filtering step before archiving, this could be done with plugins. Or, if you have plugins that allow the filter of content on retrieval, that may be better. This way the stored data still has the email addresses intact, but they get obfuscated on rendering. --ewh From janssen at parc.com Tue Oct 19 08:18:39 2010 From: janssen at parc.com (Bill Janssen) Date: Mon, 18 Oct 2010 23:18:39 PDT Subject: [Archiver-dev] UpLib and archiving In-Reply-To: References: <83790.1287349272@parc.com> <20101018142859.4553f808@mission> <92627.1287450806@parc.com> Message-ID: <96270.1287469119@parc.com> Earl, good input. Earl Hood wrote: > Sanitizing javascript should be the default behavior. This is a > major XSS exploit, and if you want others to utilize your software > for their sites, they will open their site to XSS if this > is not done. Probably not, because this isn't a typical Web server. > > As for the Content-Disposition filenames: UpLib runs its own > > content-type determiner over the content to try to see what it is rather > > than just relying on the filename, though it will fall back to the > > filename if it can't figure it out. ?And I've hardcoded some typical > > situations. > > Falling back to filename should be a configurable option, and > it should be disabled by default. That could easily be done. > I recommend that all attachments be saved into an attachments > area so you can place restrictive web server configuration > settings on it. This approach assumes you serve up attachment > data directly via the file system via standard HTTP server > retrieval. If you serve up attachments via custom > web service (e.g. servlet, CGI), then filenaming concerns of > attachments are not as critical. Right. The HTTP interface to UpLib is not a typical Web server -- it doesn't allow direct access to files. All access is mediated through my code. > >> * Email address obfuscation. ?Obviously we'd want to support that, but using > >> ? what algorithm? ?xxx'ing out the domain? ?Using a central forwarding > >> ? service? ?How do we recognize email addresses? > > > > I don't obfuscate anything, really. ?But this is an issue for a public > > Web UI design, I think. > > If your plugin architecture has the support for data filtering > step before archiving, this could be done with plugins. > > Or, if you have plugins that allow the filter of content on > retrieval, that may be better. This way the stored data > still has the email addresses intact, but they get obfuscated > on rendering. Right, that's how I'd do it. Though I do think input obfuscation could be done without losing much, if anything. UpLib typically serves up a rendering of a document, rather than the actual bits (though there is an API to retrieve the actual bits). The current rendering for email messages doesn't obfuscate addresses by design, but it does typicall obfuscate them a bit for clarity -- email addresses were never designed to be shown to the user. Bill From barry at python.org Tue Oct 19 20:46:34 2010 From: barry at python.org (Barry Warsaw) Date: Tue, 19 Oct 2010 14:46:34 -0400 Subject: [Archiver-dev] UpLib and archiving In-Reply-To: <92627.1287450806@parc.com> References: <83790.1287349272@parc.com> <20101018142859.4553f808@mission> <92627.1287450806@parc.com> Message-ID: <20101019144634.2400f766@mission> Leading question: is the source available in a vcs? I could only find tarball downloads. To really get good collaboration, you really should make the vcs publicly available. If it is available and I missed it on the download page, sorry! Otherwise, please do add a link from that page. If it's not publicly available, I'd of course recommend Bazaar and would be happy to help in making the code available on Launchpad and in Ubuntu. On Oct 18, 2010, at 06:13 PM, Bill Janssen wrote: >I've got the Web access and the IMAP support, but not NNTP -- never had >the need for it. Twisted seems to have support going forward that >Medusa no longer has, and at some point I plan to port UpLib from Medusa >to Twisted. NNTP does seem to be a dying art, probably for the dumb reason that Usenet is useless :). Gmane keeps it alive and as I mentioned, I'd love for an archiver to provide it (my mail reader supports NNTP). It may be that really solid read-only IMAP support solves that use case though. I think a Twisted port would open up lots of possible avenues for vending the content. >> What about plugins? There are a few areas that come to mind about things I'd >> like to have pluggable: >> >> * Content hyperlinks. Let's say you've got an archive for a -commits list. >> It would be nice to be able to dig out things like bug numbers and vcs >> revisions and hyperlink them to tracker or viewvcs pages. > >UpLib automatically recognizes and stores hyperlinks found in Word, PDF, >Powerpoint, etc. as part of the standard metadata extraction process. >There's also a ripper which recognizes URLs and stores them as links. > >In house, we have some support for entity-finding: person or corporation >or location names, dates, etc. They are also automatically turned into >links, and show up as hyperlinks in the Web and Java UI tools. What >you're suggesting is more of that. Yep. Pluggable entity finders is what I'm thinking about. E.g. for a python-dev archive, you'd probably search for strings like "issue XYZ" and "bug XYZ" and hyperlink them to the tracker issue. Architecturally, I've gone back and forth about where these types of transformations should go. Pipermail always had the view that these happened at message input time. What I mean is, when Mailman sends a message to the subscribers, it also "sends" the message to the archiver. The archiver immediately works out where in the thread it should go, and statically creates the rendered view (HTML) with any transformations done at that time. Pipermail is of course ancient, and 12 years ago it made sense to do all the processing upfront so that when someone wanted to view the page, it was super cheap. I'd always thought that it would be better to stitch together the final rendered page (with caching) at the time the page was requested. This would allow a site administer to invalidate the cache as an easy way to update the rendering rules (add, modify entity filters, dynamic take downs, updated style sheets, new obfuscation rules, etc.). >> * Take-down support. If a list admin wants to remove a posting, she should >> be able to do that without disrupting email threads or breaking URLs. >> One way I've thought about doing that is a dynamic rendering plugin that >> checked the to-be-displayed message against a blacklist, and if there's a >> hit, it would substitute the body of the message with something like >> "Content unavailable due to take-down notice. Contact >> postmaster at python.org for detail." > >Yeah, kind of provide a stand-in for the real message. UpLib re-threads >if you modify the corpus, so removal is automatic. It also includes a >capability to "replace" the content of an existing document, which >sounds like what you'd want for the above. Yep, that's what I'm thinking. There's a strong preference by postmasters not to modify the original message backing whatever is used to generate the displayed page. So a take-down is more like marking a message-id for wholesale replacement. >Nice idea. I've got an extension which (sort of) supports this (you can >email a copy of any document to anyone), and the user can define new >buttons to add to the UI in her config file. I, for instance, added a >button which shows me all the email threads which have been updated >today: > >Today's Mail, /action/basic/email_threads?query=date:today+$email, _blank > >Of course, the normal UpLib Web UI puts apppropriate "mailto:" links >around people's names, and adds "Reply-To" and "Reply-To-All" links to >the message. Just click on that and it opens up in your MUA. Nice. >> Interested to hear your thoughts. This would be a cool project to work on, >> and maybe we should also engage mailman-developers. Thanks for releasing it >> under the GPL[*]. > >Well, there's lots to do :-). The current IMAP server, for instance, is >more about getting the IMAP protocol right than it is efficiency. When >you go into python-dev size archives without breaking it into chunks >(like the per-month view in Mailman archives), it poops out. Shouldn't >do that. > >My normal development process is to write any new code as an UpLib >extension, then if it works I eventually fold it into the codebase. >Extensions are easy to add (just plunk them in a directory, and point >the repository at that directory), and there are a number of examples >included with the source code. The IMAP server is an extension, for >instance. Cool. I think the first step is to make the vcs available. I'd like to grab the source and start taking a look. Cheers, -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: not available URL: From janssen at parc.com Tue Oct 19 21:38:37 2010 From: janssen at parc.com (Bill Janssen) Date: Tue, 19 Oct 2010 12:38:37 PDT Subject: [Archiver-dev] UpLib and archiving In-Reply-To: <20101019144634.2400f766@mission> References: <83790.1287349272@parc.com> <20101018142859.4553f808@mission> <92627.1287450806@parc.com> <20101019144634.2400f766@mission> Message-ID: <99375.1287517117@parc.com> Barry Warsaw wrote: > >I've got the Web access and the IMAP support, but not NNTP -- never had > >the need for it. Twisted seems to have support going forward that > >Medusa no longer has, and at some point I plan to port UpLib from Medusa > >to Twisted. > > NNTP does seem to be a dying art, probably for the dumb reason that Usenet is > useless :). Gmane keeps it alive and as I mentioned, I'd love for an archiver > to provide it (my mail reader supports NNTP). It may be that really solid > read-only IMAP support solves that use case though. I think a Twisted port > would open up lots of possible avenues for vending the content. To implement Twisted support, two modules would have to change uplib.angelHandler and uplib.startAngel. I believe all the Medusa code is localized there. So two new modules (or an extension which strategically replaces some of the functions and classes in those modules using a before_repository_instantiation function) would be necessary to run this under Twisted instead of Medusa. The IMAP server has its own submodule, medusaHandler, which localizes all the dependencies on Medusa. To port that to Twisted, a twisted version of that submodule would be needed. > >> What about plugins? There are a few areas that come to mind about things I'd > >> like to have pluggable: > >> > >> * Content hyperlinks. Let's say you've got an archive for a -commits list. > >> It would be nice to be able to dig out things like bug numbers and vcs > >> revisions and hyperlink them to tracker or viewvcs pages. > > > >UpLib automatically recognizes and stores hyperlinks found in Word, PDF, > >Powerpoint, etc. as part of the standard metadata extraction process. > >There's also a ripper which recognizes URLs and stores them as links. > > > >In house, we have some support for entity-finding: person or corporation > >or location names, dates, etc. They are also automatically turned into > >links, and show up as hyperlinks in the Web and Java UI tools. What > >you're suggesting is more of that. > > Yep. Pluggable entity finders is what I'm thinking about. E.g. for a > python-dev archive, you'd probably search for strings like "issue XYZ" and > "bug XYZ" and hyperlink them to the tracker issue. Yep, that's the kind of thing the ripper architecture combined with the extensions architecture supports really well. The supplied NYTimes extension (extensions/NYTimes.py) is a good example. It provides both a custom DocumentParser specialized for NY Times Web articles, and a ripper that updates the standard metadata information with info gleaned from comments in the HTML of the article. > Architecturally, I've gone back and forth about where these types of > transformations should go. Pipermail always had the view that these happened > at message input time. What I mean is, when Mailman sends a message to the > subscribers, it also "sends" the message to the archiver. The archiver > immediately works out where in the thread it should go, and statically creates > the rendered view (HTML) with any transformations done at that time. > > Pipermail is of course ancient, and 12 years ago it made sense to do all the > processing upfront so that when someone wanted to view the page, it was super > cheap. I'd always thought that it would be better to stitch together the > final rendered page (with caching) at the time the page was requested. This > would allow a site administer to invalidate the cache as an easy way to update > the rendering rules (add, modify entity filters, dynamic take downs, updated > style sheets, new obfuscation rules, etc.). Yep, I'm completely in the same mindset. UpLib built renderings of the document at ingestion time, to save time on the UI. For some docs, that's still appropriate, but for things like email thread display, that's given way to dynamically constructed views built when asked for. > >> * Take-down support. If a list admin wants to remove a posting, she should > >> be able to do that without disrupting email threads or breaking URLs. > >> One way I've thought about doing that is a dynamic rendering plugin that > >> checked the to-be-displayed message against a blacklist, and if there's a > >> hit, it would substitute the body of the message with something like > >> "Content unavailable due to take-down notice. Contact > >> postmaster at python.org for detail." > > > >Yeah, kind of provide a stand-in for the real message. UpLib re-threads > >if you modify the corpus, so removal is automatic. It also includes a > >capability to "replace" the content of an existing document, which > >sounds like what you'd want for the above. > > Yep, that's what I'm thinking. There's a strong preference by postmasters not > to modify the original message backing whatever is used to generate the > displayed page. So a take-down is more like marking a message-id for > wholesale replacement. In UpLib, you just upload the replacement version with a specified metadata field, "replacement-contents-for", set to a doc ID, specifying that the current contents of that doc should be replaced by this new doc. That way the UpLib doc ID doesn't change, and structures build around that doc ID are still good. > Cool. I think the first step is to make the vcs available. I'd like > to grab the source and start taking a look. Why is a VCS necessary for that? I usually just visit the tarball with Emacs :-). I'm not set up for that right now (and not sure I want to be), so I don't think that will happen in the near future. I still like using CVS (I know, I know, but I like the annotations on individual files) with point-release tarballs. I suppose each point-release tarball could be used to update a Mercurial VCS, right? Clearly needs more discussion. But the source is up there -- feel free to unpack it into something you're more comfortable with to look at it. Bill From jeff at jab.org Wed Oct 20 06:45:34 2010 From: jeff at jab.org (Jeff Breidenbach) Date: Tue, 19 Oct 2010 21:45:34 -0700 Subject: [Archiver-dev] UpLib and archiving In-Reply-To: <99375.1287517117@parc.com> References: <83790.1287349272@parc.com> <20101018142859.4553f808@mission> <92627.1287450806@parc.com> <20101019144634.2400f766@mission> <99375.1287517117@parc.com> Message-ID: >The latest release includes new support for building very large archives. How large? (I'm imagining you holding your arms wide and saying "This large!") -------------- next part -------------- An HTML attachment was scrubbed... URL: From janssen at parc.com Wed Oct 20 07:25:31 2010 From: janssen at parc.com (Bill Janssen) Date: Tue, 19 Oct 2010 22:25:31 PDT Subject: [Archiver-dev] UpLib and archiving In-Reply-To: References: <83790.1287349272@parc.com> <20101018142859.4553f808@mission> <92627.1287450806@parc.com> <20101019144634.2400f766@mission> <99375.1287517117@parc.com> Message-ID: <1140.1287552331@parc.com> Jeff Breidenbach wrote: > >The latest release includes new support for building very large archives. > > How large? (I'm imagining you holding your arms wide and saying "This > large!") Exactly! The target is 100 million docs in one UpLib. So far, I've built repos with about 500K docs in them. CiteSeer, for instance, says that they have a 1.6 million document archive. How many mail messages do you have in mail-archive.com, 85 million? Bill From jeff at jab.org Wed Oct 20 19:28:46 2010 From: jeff at jab.org (Jeff Breidenbach) Date: Wed, 20 Oct 2010 10:28:46 -0700 Subject: [Archiver-dev] UpLib and archiving In-Reply-To: <1140.1287552331@parc.com> References: <83790.1287349272@parc.com> <20101018142859.4553f808@mission> <92627.1287450806@parc.com> <20101019144634.2400f766@mission> <99375.1287517117@parc.com> <1140.1287552331@parc.com> Message-ID: Correct, about 85 million messages and ~150 page views per second when things are busy. -Jeff -------------- next part -------------- An HTML attachment was scrubbed... URL: From barry at python.org Wed Oct 20 19:51:13 2010 From: barry at python.org (Barry Warsaw) Date: Wed, 20 Oct 2010 13:51:13 -0400 Subject: [Archiver-dev] UpLib and archiving In-Reply-To: <99375.1287517117@parc.com> References: <83790.1287349272@parc.com> <20101018142859.4553f808@mission> <92627.1287450806@parc.com> <20101019144634.2400f766@mission> <99375.1287517117@parc.com> Message-ID: <20101020135113.14f84290@mission> Hi Bill, thanks for the response. Let me address one issue first. On Oct 19, 2010, at 12:38 PM, Bill Janssen wrote: >Why is a VCS necessary for that? I usually just visit the tarball with >Emacs :-). Good thing you didn't say "vim" :) >I'm not set up for that right now (and not sure I want to be), so I >don't think that will happen in the near future. I still like using CVS >(I know, I know, but I like the annotations on individual files) with >point-release tarballs. I suppose each point-release tarball could be >used to update a Mercurial VCS, right? Clearly needs more discussion. > >But the source is up there -- feel free to unpack it into something >you're more comfortable with to look at it. If you're more comfortable with CVS, that's totally cool. One advantage of making the CVS (read-only) publicly available is that we can set up imports into other vcs's. For example, we could point Launchpad at your CVS repository and import the code into Bazaar. I'll bet bitbucket can do something similar for Mercurial. The advantage of that then is that folks can use the vcs they are comfortable with but have access to all the revision history, more easily generate patches that won't conflict when you try to merge them, and more easily follow your development. The alternative is to manually import the tarballs into the vcs, and hope there are no conflicts, file moves, or other changes that are near impossible to make work manually. Another problem of course is that everyone will have to do that manually for the vcs of their choice. Or if we don't use a vcs and just hack the tarball, it will be pretty difficult to get you reliable patches that you can review and integrate. Not fun! If you made your CVS repo (read-only) available, I'd create a project on Launchpad for uplib, link it to your repo and bug tracker, and work to get the package into Debian and Ubuntu. Just something to think about. :) -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: not available URL: From janssen at parc.com Wed Oct 20 20:14:44 2010 From: janssen at parc.com (Bill Janssen) Date: Wed, 20 Oct 2010 11:14:44 PDT Subject: [Archiver-dev] UpLib and archiving In-Reply-To: References: <83790.1287349272@parc.com> <20101018142859.4553f808@mission> <92627.1287450806@parc.com> <20101019144634.2400f766@mission> <99375.1287517117@parc.com> <1140.1287552331@parc.com> Message-ID: <4927.1287598484@parc.com> Jeff Breidenbach wrote: > Correct, about 85 million messages and ~150 page views per second when > things are busy. Any stats on new messages per day? I'd like to at least design a Hadoop-based UpLib deployment that would support something on the order of 100K adds per day. Bill From jeff at jab.org Thu Oct 21 01:16:42 2010 From: jeff at jab.org (Jeff Breidenbach) Date: Wed, 20 Oct 2010 16:16:42 -0700 Subject: [Archiver-dev] UpLib and archiving In-Reply-To: <4927.1287598484@parc.com> References: <83790.1287349272@parc.com> <20101018142859.4553f808@mission> <92627.1287450806@parc.com> <20101019144634.2400f766@mission> <99375.1287517117@parc.com> <1140.1287552331@parc.com> <4927.1287598484@parc.com> Message-ID: >Any stats on new messages per day? 60K yesterday. -------------- next part -------------- An HTML attachment was scrubbed... URL: From janssen at parc.com Thu Oct 21 01:28:21 2010 From: janssen at parc.com (Bill Janssen) Date: Wed, 20 Oct 2010 16:28:21 PDT Subject: [Archiver-dev] UpLib and archiving In-Reply-To: References: <83790.1287349272@parc.com> <20101018142859.4553f808@mission> <92627.1287450806@parc.com> <20101019144634.2400f766@mission> <99375.1287517117@parc.com> <1140.1287552331@parc.com> <4927.1287598484@parc.com> Message-ID: <12681.1287617301@parc.com> Jeff Breidenbach wrote: > >Any stats on new messages per day? > > 60K yesterday. OK, so I'm on target. Good to know. Bill