From janssen at parc.com  Sun Oct 17 23:01:12 2010
From: janssen at parc.com (Bill Janssen)
Date: Sun, 17 Oct 2010 14:01:12 PDT
Subject: [Archiver-dev] UpLib and archiving
Message-ID: <83790.1287349272@parc.com>

Just noticed this list, and thought I'd sign up.

I build the UpLib archive system, at http://uplib.parc.com/.

The latest release includes new support for building very large
archives.  UpLib has some support for email archiving already, including
thread analysis and a built-in IMAP server, but that support needs to be
re-worked for efficiency to support large archives.  So I'm thinking
about that just now.

Some topics:

1.  An email thread analysis library which works on a mixin, say
    ThreadableEmail, so that different email packages could use it.

2.  Support for multipart/related parsing.

3.  Indexing for search.  UpLib currently indexes email into PyLucene
    with the following fields:

      date (untokenized)
      contents (tokenized -- just the body text, not the headers)
      email-message-id (untokenized)
      email-guid (untokenized -- a hash of the message-id)
      email-subject (tokenized)
      email-from-name (tokenized, only used if present)
      email-from-address (untokenized)
      email-attachment-to (untokenized, for attachments, guid of message)
      email-thread-index (untokenized, thread ID)
      email-references (untokenized, zero or more email-guids)
      email-in-reply-to (untokenized, zero or more email-guids)
      email-recipient-names (untokenized [should be tokenized])
      email-recipients (untokenized -- who the message was sent to)

    Attachments are extracted, and indexed separately, with links from the
    attachment to the message, and links from the message to its
    attachments.  This is a nice feature of UpLib over more specifically
    mail-archiving systems -- it can also archive images, Word, PDF, etc.,
    and do proper metadata indexing on all of the various types.

    It also tries to leverage Lucene's multi-language support, by
    running a language guesser over the text of the email, and selecting
    the Lucene Analyzer which most closely matches that language.

    So, is this a good list of indexing fields?  Bad list?  Where does
    the Dublin Core factor into this?

4.  Archive server frameworks.  My IMAP server is currently built on top
    of Medusa, like the rest of UpLib.  No one's working on Medusa.

Bill

From barry at python.org  Mon Oct 18 20:28:59 2010
From: barry at python.org (Barry Warsaw)
Date: Mon, 18 Oct 2010 14:28:59 -0400
Subject: [Archiver-dev] UpLib and archiving
In-Reply-To: <83790.1287349272@parc.com>
References: <83790.1287349272@parc.com>
Message-ID: <20101018142859.4553f808@mission>

On Oct 17, 2010, at 02:01 PM, Bill Janssen wrote:

>I build the UpLib archive system, at http://uplib.parc.com/.
>
>The latest release includes new support for building very large
>archives.  UpLib has some support for email archiving already, including
>thread analysis and a built-in IMAP server, but that support needs to be
>re-worked for efficiency to support large archives.  So I'm thinking
>about that just now.

Very cool!  The state of the art in open source email archivers has been
stagnant for years.  I think a huge number of people would like to see a new
offering, but getting things off the ground has always been too daunting.
Maybe uplib will be the platform to build a nextgen email archiver on top of.

>1.  An email thread analysis library which works on a mixin, say
>    ThreadableEmail, so that different email packages could use it.

Which different email packages do you mean?  Is that "different versions of
the stdlib email package" or something else?

>2.  Support for multipart/related parsing.

That would be nice.

>3.  Indexing for search.  UpLib currently indexes email into PyLucene
>    with the following fields:
>
>      date (untokenized)
>      contents (tokenized -- just the body text, not the headers)

How does it (or how do you envision it) working with non-text/plain parts?

>      email-message-id (untokenized)
>      email-guid (untokenized -- a hash of the message-id)

There's also this: http://wiki.list.org/display/DEV/Stable+URLs

Stable URLs on archive regeneration is absolutely critical and predictable
URLs without communication between the MLM and archiver is highly desirable.
The algorithm is simple, but I don't know how that works with uplib's notion
of an email message's canonical URL.

>      email-subject (tokenized)
>      email-from-name (tokenized, only used if present)
>      email-from-address (untokenized)
>      email-attachment-to (untokenized, for attachments, guid of message)
>      email-thread-index (untokenized, thread ID)
>      email-references (untokenized, zero or more email-guids)
>      email-in-reply-to (untokenized, zero or more email-guids)
>      email-recipient-names (untokenized [should be tokenized])
>      email-recipients (untokenized -- who the message was sent to)
>
>    Attachments are extracted, and indexed separately, with links from the
>    attachment to the message, and links from the message to its
>    attachments.  This is a nice feature of UpLib over more specifically
>    mail-archiving systems -- it can also archive images, Word, PDF, etc.,
>    and do proper metadata indexing on all of the various types.

And that is *very* cool.  How do you handle security issues, i.e. html parts
with evil content (javascript) or Content-Disposition filenames that lie about
their type?

>    It also tries to leverage Lucene's multi-language support, by
>    running a language guesser over the text of the email, and selecting
>    the Lucene Analyzer which most closely matches that language.

Wow, neat!

>    So, is this a good list of indexing fields?  Bad list?  Where does
>    the Dublin Core factor into this?

It seems like a reasonably good start.  Dunno about Dublin Core.

>4.  Archive server frameworks.  My IMAP server is currently built on top
>    of Medusa, like the rest of UpLib.  No one's working on Medusa.

How hard would it be to slot in Twisted?  Something I've always wanted to see
was an archiver that supported IMAP, NNTP, and web access.  Twisted seems like
the obvious choice.

What about plugins?  There are a few areas that come to mind about things I'd
like to have pluggable:

* Content hyperlinks.  Let's say you've got an archive for a -commits list.
  It would be nice to be able to dig out things like bug numbers and vcs
  revisions and hyperlink them to tracker or viewvcs pages.

* Take-down support.  If a list admin wants to remove a posting, she should be
  able to do that without disrupting email threads or breaking URLs.  One way
  I've thought about doing that is a dynamic rendering plugin that checked the
  to-be-displayed message against a blacklist, and if there's a hit, it would
  substitute the body of the message with something like "Content unavailable
  due to take-down notice.  Contact postmaster at python.org for detail."

* Email address obfuscation.  Obviously we'd want to support that, but using
  what algorithm?  xxx'ing out the domain?  Using a central forwarding
  service?  How do we recognize email addresses?

* Send-me-this-message button.  I do a Google search and find a message in an
  archive from 4 years ago.  It's relevant to the problem I'm now having and
  I'd like to respond to it in my normal email reader.  Maybe IMAP/NNTP is the
  right way to go, or there could be a button to allow the user to forward the
  message to herself.

Interested to hear your thoughts.  This would be a cool project to work on,
and maybe we should also engage mailman-developers.  Thanks for releasing it
under the GPL[*].

-Barry

[*] While GPLv2 would be incompatible with Mailman 3's GPLv3, I don't think it
matters.  The two systems will be lightly connected, though we would have to
think about the integration points.  MM3 has a plugin system for archivers.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/archiver-dev/attachments/20101018/5d87bac0/attachment.pgp>

From janssen at parc.com  Tue Oct 19 03:13:26 2010
From: janssen at parc.com (Bill Janssen)
Date: Mon, 18 Oct 2010 18:13:26 PDT
Subject: [Archiver-dev] UpLib and archiving
In-Reply-To: <20101018142859.4553f808@mission>
References: <83790.1287349272@parc.com> <20101018142859.4553f808@mission>
Message-ID: <92627.1287450806@parc.com>

Barry Warsaw <barry at python.org> wrote:

> On Oct 17, 2010, at 02:01 PM, Bill Janssen wrote:
> 
> >I build the UpLib archive system, at http://uplib.parc.com/.
> >
> >The latest release includes new support for building very large
> >archives.  UpLib has some support for email archiving already, including
> >thread analysis and a built-in IMAP server, but that support needs to be
> >re-worked for efficiency to support large archives.  So I'm thinking
> >about that just now.
>
> Very cool!  The state of the art in open source email archivers has been
> stagnant for years.  I think a huge number of people would like to see a new
> offering, but getting things off the ground has always been too daunting.
> Maybe uplib will be the platform to build a nextgen email archiver on top of.

UpLib is not specifically about email -- it's a general purpose digital
document archiver.  (But most email is not just about email, either.
See for instance
http://www.parc.com/content/attachments/email_habitat_exploration_4360_parc.pdf.)
I pour my email into it, so I've added some features to it to help with
the process of reading and finding email, like the threading code.  It
knows how to parse emails (using the email package) and render them, and
do threading, etc.

> 
> >1.  An email thread analysis library which works on a mixin, say
> >    ThreadableEmail, so that different email packages could use it.
> 
> Which different email packages do you mean?  Is that "different versions of
> the stdlib email package" or something else?

Different email archival implementations, I was thinking.  For instance,
I'm re-writing the thread support in UpLib to handle email thread
forests backed by an SQLite DB.  I'd like to be able to use a library to
deduce the threads, but keep them in my own format.  To implement the
two separate IMAP threading algorithms in RFC 5256, you need "Subject",
"Date", "Message-ID" (the *normalized* Message-ID), "References", and
"In-Reply-To".  So such a library would need an email Message type
which provides these fields in some fashion.

In UpLib, I'd want these to be either subtypes of Document, or perhaps a
separate record type created solely for the purposes of threading.

> 
> >2.  Support for multipart/related parsing.
> 
> That would be nice.
> 
> >3.  Indexing for search.  UpLib currently indexes email into PyLucene
> >    with the following fields:
> >
> >      date (untokenized)
> >      contents (tokenized -- just the body text, not the headers)
> 
> How does it (or how do you envision it) working with non-text/plain parts?

It's got some code to deal with that.  If the Content-Type is non-text,
it just processes it as it would any other document of that
content-type.  Otherwise, if it's text/* or multipart, the email parsing
code picks one part, either text/plain or text/html (or a series of
such), and makes it the "main" part.  The text extracted from that is
the "contents".

Other parts are typically classified as "attachments", broken off, and
separately indexed.  So text in them is also indexed, but it's indexed
in the role of an attachment to a particular email message.  Attachments
show up as little icons in the visual rendering of the document.

> >      email-message-id (untokenized)
> >      email-guid (untokenized -- a hash of the message-id)
> 
> There's also this: http://wiki.list.org/display/DEV/Stable+URLs
> 
> Stable URLs on archive regeneration is absolutely critical and predictable
> URLs without communication between the MLM and archiver is highly desirable.
> The algorithm is simple, but I don't know how that works with uplib's notion
> of an email message's canonical URL.
> 
> >      email-subject (tokenized)
> >      email-from-name (tokenized, only used if present)
> >      email-from-address (untokenized)
> >      email-attachment-to (untokenized, for attachments, guid of message)
> >      email-thread-index (untokenized, thread ID)
> >      email-references (untokenized, zero or more email-guids)
> >      email-in-reply-to (untokenized, zero or more email-guids)
> >      email-recipient-names (untokenized [should be tokenized])
> >      email-recipients (untokenized -- who the message was sent to)
> >
> >    Attachments are extracted, and indexed separately, with links from the
> >    attachment to the message, and links from the message to its
> >    attachments.  This is a nice feature of UpLib over more specifically
> >    mail-archiving systems -- it can also archive images, Word, PDF, etc.,
> >    and do proper metadata indexing on all of the various types.
>
> And that is *very* cool.  How do you handle security issues, i.e. html parts
> with evil content (javascript) or Content-Disposition filenames that lie about
> their type?

I don't run Javascript, so evil parts get to stick around and
potentially do damage in the future.  UpLib has an extensible system of
data analysis engines called "rippers" which are automatically run on
each document; if you were concerned about the possibility of lingering
malware a malware-detector ripper could be added to flag and/or remove
such content.  I actually do process Javascript lightly to remove some
irritating pieces, like "capture" redirects.

As for the Content-Disposition filenames: UpLib runs its own
content-type determiner over the content to try to see what it is rather
than just relying on the filename, though it will fall back to the
filename if it can't figure it out.  And I've hardcoded some typical
situations.

> >    It also tries to leverage Lucene's multi-language support, by
> >    running a language guesser over the text of the email, and selecting
> >    the Lucene Analyzer which most closely matches that language.
> 
> Wow, neat!
> 
> >    So, is this a good list of indexing fields?  Bad list?  Where does
> >    the Dublin Core factor into this?
> 
> It seems like a reasonably good start.  Dunno about Dublin Core.
> 
> >4.  Archive server frameworks.  My IMAP server is currently built on top
> >    of Medusa, like the rest of UpLib.  No one's working on Medusa.
> 
> How hard would it be to slot in Twisted?  Something I've always wanted to see
> was an archiver that supported IMAP, NNTP, and web access.  Twisted seems like
> the obvious choice.

I've got the Web access and the IMAP support, but not NNTP -- never had
the need for it.  Twisted seems to have support going forward that
Medusa no longer has, and at some point I plan to port UpLib from Medusa
to Twisted.

> What about plugins?  There are a few areas that come to mind about things I'd
> like to have pluggable:
> 
> * Content hyperlinks.  Let's say you've got an archive for a -commits list.
>   It would be nice to be able to dig out things like bug numbers and vcs
>   revisions and hyperlink them to tracker or viewvcs pages.

UpLib automatically recognizes and stores hyperlinks found in Word, PDF,
Powerpoint, etc. as part of the standard metadata extraction process.
There's also a ripper which recognizes URLs and stores them as links.

In house, we have some support for entity-finding: person or corporation
or location names, dates, etc.  They are also automatically turned into
links, and show up as hyperlinks in the Web and Java UI tools.  What
you're suggesting is more of that.

> * Take-down support.  If a list admin wants to remove a posting, she should be
>   able to do that without disrupting email threads or breaking URLs.  One way
>   I've thought about doing that is a dynamic rendering plugin that checked the
>   to-be-displayed message against a blacklist, and if there's a hit, it would
>   substitute the body of the message with something like "Content unavailable
>   due to take-down notice.  Contact postmaster at python.org for detail."

Yeah, kind of provide a stand-in for the real message.  UpLib re-threads
if you modify the corpus, so removal is automatic.  It also includes a
capability to "replace" the content of an existing document, which
sounds like what you'd want for the above.

> * Email address obfuscation.  Obviously we'd want to support that, but using
>   what algorithm?  xxx'ing out the domain?  Using a central forwarding
>   service?  How do we recognize email addresses?

I don't obfuscate anything, really.  But this is an issue for a public
Web UI design, I think.

> * Send-me-this-message button.  I do a Google search and find a message in an
>   archive from 4 years ago.  It's relevant to the problem I'm now having and
>   I'd like to respond to it in my normal email reader.  Maybe IMAP/NNTP is the
>   right way to go, or there could be a button to allow the user to forward the
>   message to herself.

Nice idea.  I've got an extension which (sort of) supports this (you can
email a copy of any document to anyone), and the user can define new
buttons to add to the UI in her config file.  I, for instance, added a
button which shows me all the email threads which have been updated
today:

Today's Mail, /action/basic/email_threads?query=date:today+$email, _blank

Of course, the normal UpLib Web UI puts apppropriate "mailto:" links
around people's names, and adds "Reply-To" and "Reply-To-All" links to
the message.  Just click on that and it opens up in your MUA.

> Interested to hear your thoughts.  This would be a cool project to work on,
> and maybe we should also engage mailman-developers.  Thanks for releasing it
> under the GPL[*].

Well, there's lots to do :-).  The current IMAP server, for instance, is
more about getting the IMAP protocol right than it is efficiency.  When
you go into python-dev size archives without breaking it into chunks
(like the per-month view in Mailman archives), it poops out.  Shouldn't
do that.

My normal development process is to write any new code as an UpLib
extension, then if it works I eventually fold it into the codebase.
Extensions are easy to add (just plunk them in a directory, and point
the repository at that directory), and there are a number of examples
included with the source code.  The IMAP server is an extension, for
instance.

Bill

> 
> -Barry
> 
> [*] While GPLv2 would be incompatible with Mailman 3's GPLv3, I don't think it
> matters.  The two systems will be lightly connected, though we would have to
> think about the integration points.  MM3 has a plugin system for archivers.
> _______________________________________________
> Archiver-dev mailing list
> Archiver-dev at python.org
> http://mail.python.org/mailman/listinfo/archiver-dev

From earl at earlhood.com  Tue Oct 19 03:56:46 2010
From: earl at earlhood.com (Earl Hood)
Date: Mon, 18 Oct 2010 20:56:46 -0500
Subject: [Archiver-dev] UpLib and archiving
In-Reply-To: <92627.1287450806@parc.com>
References: <83790.1287349272@parc.com> <20101018142859.4553f808@mission>
	<92627.1287450806@parc.com>
Message-ID: <AANLkTimoM7pVOjgEx5d0pWZfyfwBaPWQNFj=qcQQSJta@mail.gmail.com>

On Mon, Oct 18, 2010 at 8:13 PM, Bill Janssen <janssen at parc.com> wrote:

>> And that is *very* cool. ?How do you handle security issues, i.e. html parts
>> with evil content (javascript) or Content-Disposition filenames that lie about
>> their type?
>
> I don't run Javascript, so evil parts get to stick around and
> potentially do damage in the future. ?UpLib has an extensible system of
> data analysis engines called "rippers" which are automatically run on
> each document; if you were concerned about the possibility of lingering
> malware a malware-detector ripper could be added to flag and/or remove
> such content.

Sanitizing javascript should be the default behavior.  This is a
major XSS exploit, and if you want others to utilize your software
for their sites, they will open their site to XSS if this
is not done.

> As for the Content-Disposition filenames: UpLib runs its own
> content-type determiner over the content to try to see what it is rather
> than just relying on the filename, though it will fall back to the
> filename if it can't figure it out. ?And I've hardcoded some typical
> situations.

Falling back to filename should be a configurable option, and
it should be disabled by default.

If you are really paranoid about security, you should have
a whitelist of filename extensions that you allow.
At a minimum, at least have a list of extensions that are
forbidden (e.g. .shtml, .cgi).

IMO, the content-type should be the authoritative source
of what the type of file is, but scanning the data is
reasonable depending on how robust it is.  Attackers are known to
give incorrect values in an attempt to fool email processors,
but such attempts are usually done with the content-disposition
filename parameter since some popular MUAs display it to
the user, which can mislead them on its true contents.

I recommend that all attachments be saved into an attachments
area so you can place restrictive web server configuration
settings on it.  This approach assumes you serve up attachment
data directly via the file system via standard HTTP server
retrieval.  If you serve up attachments via custom
web service (e.g. servlet, CGI), then filenaming concerns of
attachments are not as critical.

>> * Email address obfuscation. ?Obviously we'd want to support that, but using
>> ? what algorithm? ?xxx'ing out the domain? ?Using a central forwarding
>> ? service? ?How do we recognize email addresses?
>
> I don't obfuscate anything, really. ?But this is an issue for a public
> Web UI design, I think.

If your plugin architecture has the support for data filtering
step before archiving, this could be done with plugins.

Or, if you have plugins that allow the filter of content on
retrieval, that may be better.  This way the stored data
still has the email addresses intact, but they get obfuscated
on rendering.

--ewh

From janssen at parc.com  Tue Oct 19 08:18:39 2010
From: janssen at parc.com (Bill Janssen)
Date: Mon, 18 Oct 2010 23:18:39 PDT
Subject: [Archiver-dev] UpLib and archiving
In-Reply-To: <AANLkTimoM7pVOjgEx5d0pWZfyfwBaPWQNFj=qcQQSJta@mail.gmail.com>
References: <83790.1287349272@parc.com> <20101018142859.4553f808@mission>
	<92627.1287450806@parc.com>
	<AANLkTimoM7pVOjgEx5d0pWZfyfwBaPWQNFj=qcQQSJta@mail.gmail.com>
Message-ID: <96270.1287469119@parc.com>

Earl, good input.

Earl Hood <earl at earlhood.com> wrote:

> Sanitizing javascript should be the default behavior.  This is a
> major XSS exploit, and if you want others to utilize your software
> for their sites, they will open their site to XSS if this
> is not done.

Probably not, because this isn't a typical Web server.

> > As for the Content-Disposition filenames: UpLib runs its own
> > content-type determiner over the content to try to see what it is rather
> > than just relying on the filename, though it will fall back to the
> > filename if it can't figure it out. ?And I've hardcoded some typical
> > situations.
> 
> Falling back to filename should be a configurable option, and
> it should be disabled by default.

That could easily be done.

> I recommend that all attachments be saved into an attachments
> area so you can place restrictive web server configuration
> settings on it.  This approach assumes you serve up attachment
> data directly via the file system via standard HTTP server
> retrieval.  If you serve up attachments via custom
> web service (e.g. servlet, CGI), then filenaming concerns of
> attachments are not as critical.

Right.  The HTTP interface to UpLib is not a typical Web server -- it
doesn't allow direct access to files.  All access is mediated through my
code.

> >> * Email address obfuscation. ?Obviously we'd want to support that, but using
> >> ? what algorithm? ?xxx'ing out the domain? ?Using a central forwarding
> >> ? service? ?How do we recognize email addresses?
> >
> > I don't obfuscate anything, really. ?But this is an issue for a public
> > Web UI design, I think.
> 
> If your plugin architecture has the support for data filtering
> step before archiving, this could be done with plugins.
> 
> Or, if you have plugins that allow the filter of content on
> retrieval, that may be better.  This way the stored data
> still has the email addresses intact, but they get obfuscated
> on rendering.

Right, that's how I'd do it.  Though I do think input obfuscation could
be done without losing much, if anything.  UpLib typically serves up a
rendering of a document, rather than the actual bits (though there is an
API to retrieve the actual bits).  The current rendering for email
messages doesn't obfuscate addresses by design, but it does typicall
obfuscate them a bit for clarity -- email addresses were never designed
to be shown to the user.

Bill

From barry at python.org  Tue Oct 19 20:46:34 2010
From: barry at python.org (Barry Warsaw)
Date: Tue, 19 Oct 2010 14:46:34 -0400
Subject: [Archiver-dev] UpLib and archiving
In-Reply-To: <92627.1287450806@parc.com>
References: <83790.1287349272@parc.com> <20101018142859.4553f808@mission>
	<92627.1287450806@parc.com>
Message-ID: <20101019144634.2400f766@mission>

Leading question: is the source available in a vcs?  I could only find tarball
downloads.  To really get good collaboration, you really should make the vcs
publicly available.

If it is available and I missed it on the download page, sorry!  Otherwise,
please do add a link from that page.  If it's not publicly available, I'd of
course recommend Bazaar and would be happy to help in making the code
available on Launchpad and in Ubuntu.

On Oct 18, 2010, at 06:13 PM, Bill Janssen wrote:

>I've got the Web access and the IMAP support, but not NNTP -- never had
>the need for it.  Twisted seems to have support going forward that
>Medusa no longer has, and at some point I plan to port UpLib from Medusa
>to Twisted.

NNTP does seem to be a dying art, probably for the dumb reason that Usenet is
useless :).  Gmane keeps it alive and as I mentioned, I'd love for an archiver
to provide it (my mail reader supports NNTP).  It may be that really solid
read-only IMAP support solves that use case though.  I think a Twisted port
would open up lots of possible avenues for vending the content.

>> What about plugins?  There are a few areas that come to mind about things I'd
>> like to have pluggable:
>> 
>> * Content hyperlinks.  Let's say you've got an archive for a -commits list.
>>   It would be nice to be able to dig out things like bug numbers and vcs
>>   revisions and hyperlink them to tracker or viewvcs pages.
>
>UpLib automatically recognizes and stores hyperlinks found in Word, PDF,
>Powerpoint, etc. as part of the standard metadata extraction process.
>There's also a ripper which recognizes URLs and stores them as links.
>
>In house, we have some support for entity-finding: person or corporation
>or location names, dates, etc.  They are also automatically turned into
>links, and show up as hyperlinks in the Web and Java UI tools.  What
>you're suggesting is more of that.

Yep.  Pluggable entity finders is what I'm thinking about.  E.g. for a
python-dev archive, you'd probably search for strings like "issue XYZ" and
"bug XYZ" and hyperlink them to the tracker issue.

Architecturally, I've gone back and forth about where these types of
transformations should go.  Pipermail always had the view that these happened
at message input time.  What I mean is, when Mailman sends a message to the
subscribers, it also "sends" the message to the archiver.  The archiver
immediately works out where in the thread it should go, and statically creates
the rendered view (HTML) with any transformations done at that time.

Pipermail is of course ancient, and 12 years ago it made sense to do all the
processing upfront so that when someone wanted to view the page, it was super
cheap.  I'd always thought that it would be better to stitch together the
final rendered page (with caching) at the time the page was requested.  This
would allow a site administer to invalidate the cache as an easy way to update
the rendering rules (add, modify entity filters, dynamic take downs, updated
style sheets, new obfuscation rules, etc.).

>> * Take-down support.  If a list admin wants to remove a posting, she should
>>   be able to do that without disrupting email threads or breaking URLs.
>>   One way I've thought about doing that is a dynamic rendering plugin that
>>   checked the to-be-displayed message against a blacklist, and if there's a
>>   hit, it would substitute the body of the message with something like
>>   "Content unavailable due to take-down notice.  Contact
>>   postmaster at python.org for detail."
>
>Yeah, kind of provide a stand-in for the real message.  UpLib re-threads
>if you modify the corpus, so removal is automatic.  It also includes a
>capability to "replace" the content of an existing document, which
>sounds like what you'd want for the above.

Yep, that's what I'm thinking.  There's a strong preference by postmasters not
to modify the original message backing whatever is used to generate the
displayed page.  So a take-down is more like marking a message-id for
wholesale replacement.

>Nice idea.  I've got an extension which (sort of) supports this (you can
>email a copy of any document to anyone), and the user can define new
>buttons to add to the UI in her config file.  I, for instance, added a
>button which shows me all the email threads which have been updated
>today:
>
>Today's Mail, /action/basic/email_threads?query=date:today+$email, _blank
>
>Of course, the normal UpLib Web UI puts apppropriate "mailto:" links
>around people's names, and adds "Reply-To" and "Reply-To-All" links to
>the message.  Just click on that and it opens up in your MUA.

Nice.

>> Interested to hear your thoughts.  This would be a cool project to work on,
>> and maybe we should also engage mailman-developers.  Thanks for releasing it
>> under the GPL[*].
>
>Well, there's lots to do :-).  The current IMAP server, for instance, is
>more about getting the IMAP protocol right than it is efficiency.  When
>you go into python-dev size archives without breaking it into chunks
>(like the per-month view in Mailman archives), it poops out.  Shouldn't
>do that.
>
>My normal development process is to write any new code as an UpLib
>extension, then if it works I eventually fold it into the codebase.
>Extensions are easy to add (just plunk them in a directory, and point
>the repository at that directory), and there are a number of examples
>included with the source code.  The IMAP server is an extension, for
>instance.

Cool.  I think the first step is to make the vcs available.  I'd like to grab
the source and start taking a look.

Cheers,
-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/archiver-dev/attachments/20101019/19c6e962/attachment.pgp>

From janssen at parc.com  Tue Oct 19 21:38:37 2010
From: janssen at parc.com (Bill Janssen)
Date: Tue, 19 Oct 2010 12:38:37 PDT
Subject: [Archiver-dev] UpLib and archiving
In-Reply-To: <20101019144634.2400f766@mission>
References: <83790.1287349272@parc.com> <20101018142859.4553f808@mission>
	<92627.1287450806@parc.com> <20101019144634.2400f766@mission>
Message-ID: <99375.1287517117@parc.com>

Barry Warsaw <barry at python.org> wrote:

> >I've got the Web access and the IMAP support, but not NNTP -- never had
> >the need for it.  Twisted seems to have support going forward that
> >Medusa no longer has, and at some point I plan to port UpLib from Medusa
> >to Twisted.
> 
> NNTP does seem to be a dying art, probably for the dumb reason that Usenet is
> useless :).  Gmane keeps it alive and as I mentioned, I'd love for an archiver
> to provide it (my mail reader supports NNTP).  It may be that really solid
> read-only IMAP support solves that use case though.  I think a Twisted port
> would open up lots of possible avenues for vending the content.

To implement Twisted support, two modules would have to change
uplib.angelHandler and uplib.startAngel.  I believe all the Medusa code
is localized there.  So two new modules (or an extension which
strategically replaces some of the functions and classes in those
modules using a before_repository_instantiation function) would be
necessary to run this under Twisted instead of Medusa.

The IMAP server has its own submodule, medusaHandler, which localizes
all the dependencies on Medusa.  To port that to Twisted, a twisted
version of that submodule would be needed.

> >> What about plugins?  There are a few areas that come to mind about things I'd
> >> like to have pluggable:
> >> 
> >> * Content hyperlinks.  Let's say you've got an archive for a -commits list.
> >>   It would be nice to be able to dig out things like bug numbers and vcs
> >>   revisions and hyperlink them to tracker or viewvcs pages.
> >
> >UpLib automatically recognizes and stores hyperlinks found in Word, PDF,
> >Powerpoint, etc. as part of the standard metadata extraction process.
> >There's also a ripper which recognizes URLs and stores them as links.
> >
> >In house, we have some support for entity-finding: person or corporation
> >or location names, dates, etc.  They are also automatically turned into
> >links, and show up as hyperlinks in the Web and Java UI tools.  What
> >you're suggesting is more of that.
> 
> Yep.  Pluggable entity finders is what I'm thinking about.  E.g. for a
> python-dev archive, you'd probably search for strings like "issue XYZ" and
> "bug XYZ" and hyperlink them to the tracker issue.

Yep, that's the kind of thing the ripper architecture combined with the
extensions architecture supports really well.

The supplied NYTimes extension (extensions/NYTimes.py) is a good
example.  It provides both a custom DocumentParser specialized for NY
Times Web articles, and a ripper that updates the standard metadata
information with info gleaned from comments in the HTML of the article.

> Architecturally, I've gone back and forth about where these types of
> transformations should go.  Pipermail always had the view that these happened
> at message input time.  What I mean is, when Mailman sends a message to the
> subscribers, it also "sends" the message to the archiver.  The archiver
> immediately works out where in the thread it should go, and statically creates
> the rendered view (HTML) with any transformations done at that time.
> 
> Pipermail is of course ancient, and 12 years ago it made sense to do all the
> processing upfront so that when someone wanted to view the page, it was super
> cheap.  I'd always thought that it would be better to stitch together the
> final rendered page (with caching) at the time the page was requested.  This
> would allow a site administer to invalidate the cache as an easy way to update
> the rendering rules (add, modify entity filters, dynamic take downs, updated
> style sheets, new obfuscation rules, etc.).

Yep, I'm completely in the same mindset.  UpLib built renderings of the
document at ingestion time, to save time on the UI.  For some docs,
that's still appropriate, but for things like email thread display,
that's given way to dynamically constructed views built when asked for.

> >> * Take-down support.  If a list admin wants to remove a posting, she should
> >>   be able to do that without disrupting email threads or breaking URLs.
> >>   One way I've thought about doing that is a dynamic rendering plugin that
> >>   checked the to-be-displayed message against a blacklist, and if there's a
> >>   hit, it would substitute the body of the message with something like
> >>   "Content unavailable due to take-down notice.  Contact
> >>   postmaster at python.org for detail."
> >
> >Yeah, kind of provide a stand-in for the real message.  UpLib re-threads
> >if you modify the corpus, so removal is automatic.  It also includes a
> >capability to "replace" the content of an existing document, which
> >sounds like what you'd want for the above.
> 
> Yep, that's what I'm thinking.  There's a strong preference by postmasters not
> to modify the original message backing whatever is used to generate the
> displayed page.  So a take-down is more like marking a message-id for
> wholesale replacement.

In UpLib, you just upload the replacement version with a specified
metadata field, "replacement-contents-for", set to a doc ID, specifying
that the current contents of that doc should be replaced by this new
doc.  That way the UpLib doc ID doesn't change, and structures build
around that doc ID are still good.

> Cool.  I think the first step is to make the vcs available.  I'd like
> to grab the source and start taking a look.

Why is a VCS necessary for that?  I usually just visit the tarball with
Emacs :-).

I'm not set up for that right now (and not sure I want to be), so I
don't think that will happen in the near future.  I still like using CVS
(I know, I know, but I like the annotations on individual files) with
point-release tarballs.  I suppose each point-release tarball could be
used to update a Mercurial VCS, right?  Clearly needs more discussion.

But the source is up there -- feel free to unpack it into something
you're more comfortable with to look at it.

Bill

From jeff at jab.org  Wed Oct 20 06:45:34 2010
From: jeff at jab.org (Jeff Breidenbach)
Date: Tue, 19 Oct 2010 21:45:34 -0700
Subject: [Archiver-dev] UpLib and archiving
In-Reply-To: <99375.1287517117@parc.com>
References: <83790.1287349272@parc.com> <20101018142859.4553f808@mission>
	<92627.1287450806@parc.com> <20101019144634.2400f766@mission>
	<99375.1287517117@parc.com>
Message-ID: <AANLkTikBUcXWruXLm1QNpVaZMM2o5oXp2qAT=RW7E3AE@mail.gmail.com>

>The latest release includes new support for building very large archives.

How large?  (I'm imagining you holding your arms wide and saying "This
large!")
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/archiver-dev/attachments/20101019/7f7769b3/attachment.html>

From janssen at parc.com  Wed Oct 20 07:25:31 2010
From: janssen at parc.com (Bill Janssen)
Date: Tue, 19 Oct 2010 22:25:31 PDT
Subject: [Archiver-dev] UpLib and archiving
In-Reply-To: <AANLkTikBUcXWruXLm1QNpVaZMM2o5oXp2qAT=RW7E3AE@mail.gmail.com>
References: <83790.1287349272@parc.com> <20101018142859.4553f808@mission>
	<92627.1287450806@parc.com> <20101019144634.2400f766@mission>
	<99375.1287517117@parc.com>
	<AANLkTikBUcXWruXLm1QNpVaZMM2o5oXp2qAT=RW7E3AE@mail.gmail.com>
Message-ID: <1140.1287552331@parc.com>

Jeff Breidenbach <jeff at jab.org> wrote:

> >The latest release includes new support for building very large archives.
> 
> How large?  (I'm imagining you holding your arms wide and saying "This
> large!")

Exactly!

The target is 100 million docs in one UpLib.  So far, I've built repos
with about 500K docs in them.  CiteSeer, for instance, says that they
have a 1.6 million document archive.

How many mail messages do you have in mail-archive.com, 85 million?

Bill

From jeff at jab.org  Wed Oct 20 19:28:46 2010
From: jeff at jab.org (Jeff Breidenbach)
Date: Wed, 20 Oct 2010 10:28:46 -0700
Subject: [Archiver-dev] UpLib and archiving
In-Reply-To: <1140.1287552331@parc.com>
References: <83790.1287349272@parc.com> <20101018142859.4553f808@mission>
	<92627.1287450806@parc.com> <20101019144634.2400f766@mission>
	<99375.1287517117@parc.com>
	<AANLkTikBUcXWruXLm1QNpVaZMM2o5oXp2qAT=RW7E3AE@mail.gmail.com>
	<1140.1287552331@parc.com>
Message-ID: <AANLkTinBhMTBg7XDyi9_MC8iSeyf6OxiPr-Ha8vt6=qP@mail.gmail.com>

Correct, about 85 million messages and ~150 page views per second when
things are busy.

-Jeff
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/archiver-dev/attachments/20101020/ba930769/attachment.html>

From barry at python.org  Wed Oct 20 19:51:13 2010
From: barry at python.org (Barry Warsaw)
Date: Wed, 20 Oct 2010 13:51:13 -0400
Subject: [Archiver-dev] UpLib and archiving
In-Reply-To: <99375.1287517117@parc.com>
References: <83790.1287349272@parc.com> <20101018142859.4553f808@mission>
	<92627.1287450806@parc.com> <20101019144634.2400f766@mission>
	<99375.1287517117@parc.com>
Message-ID: <20101020135113.14f84290@mission>

Hi Bill, thanks for the response.  Let me address one issue first.

On Oct 19, 2010, at 12:38 PM, Bill Janssen wrote:

>Why is a VCS necessary for that?  I usually just visit the tarball with
>Emacs :-).

Good thing you didn't say "vim" :)

>I'm not set up for that right now (and not sure I want to be), so I
>don't think that will happen in the near future.  I still like using CVS
>(I know, I know, but I like the annotations on individual files) with
>point-release tarballs.  I suppose each point-release tarball could be
>used to update a Mercurial VCS, right?  Clearly needs more discussion.
>
>But the source is up there -- feel free to unpack it into something
>you're more comfortable with to look at it.

If you're more comfortable with CVS, that's totally cool.  One advantage of
making the CVS (read-only) publicly available is that we can set up imports
into other vcs's.  For example, we could point Launchpad at your CVS
repository and import the code into Bazaar.  I'll bet bitbucket can do
something similar for Mercurial.

The advantage of that then is that folks can use the vcs they are comfortable
with but have access to all the revision history, more easily generate patches
that won't conflict when you try to merge them, and more easily follow your
development.

The alternative is to manually import the tarballs into the vcs, and hope
there are no conflicts, file moves, or other changes that are near impossible
to make work manually.  Another problem of course is that everyone will have
to do that manually for the vcs of their choice.  Or if we don't use a vcs and
just hack the tarball, it will be pretty difficult to get you reliable patches
that you can review and integrate.  Not fun!

If you made your CVS repo (read-only) available, I'd create a project on
Launchpad for uplib, link it to your repo and bug tracker, and work to get the
package into Debian and Ubuntu.

Just something to think about. :)

-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/archiver-dev/attachments/20101020/2e0a6e2f/attachment.pgp>

From janssen at parc.com  Wed Oct 20 20:14:44 2010
From: janssen at parc.com (Bill Janssen)
Date: Wed, 20 Oct 2010 11:14:44 PDT
Subject: [Archiver-dev] UpLib and archiving
In-Reply-To: <AANLkTinBhMTBg7XDyi9_MC8iSeyf6OxiPr-Ha8vt6=qP@mail.gmail.com>
References: <83790.1287349272@parc.com> <20101018142859.4553f808@mission>
	<92627.1287450806@parc.com> <20101019144634.2400f766@mission>
	<99375.1287517117@parc.com>
	<AANLkTikBUcXWruXLm1QNpVaZMM2o5oXp2qAT=RW7E3AE@mail.gmail.com>
	<1140.1287552331@parc.com>
	<AANLkTinBhMTBg7XDyi9_MC8iSeyf6OxiPr-Ha8vt6=qP@mail.gmail.com>
Message-ID: <4927.1287598484@parc.com>

Jeff Breidenbach <jeff at jab.org> wrote:

> Correct, about 85 million messages and ~150 page views per second when
> things are busy.

Any stats on new messages per day?  I'd like to at least design a
Hadoop-based UpLib deployment that would support something on the order
of 100K adds per day.

Bill

From jeff at jab.org  Thu Oct 21 01:16:42 2010
From: jeff at jab.org (Jeff Breidenbach)
Date: Wed, 20 Oct 2010 16:16:42 -0700
Subject: [Archiver-dev] UpLib and archiving
In-Reply-To: <4927.1287598484@parc.com>
References: <83790.1287349272@parc.com> <20101018142859.4553f808@mission>
	<92627.1287450806@parc.com> <20101019144634.2400f766@mission>
	<99375.1287517117@parc.com>
	<AANLkTikBUcXWruXLm1QNpVaZMM2o5oXp2qAT=RW7E3AE@mail.gmail.com>
	<1140.1287552331@parc.com>
	<AANLkTinBhMTBg7XDyi9_MC8iSeyf6OxiPr-Ha8vt6=qP@mail.gmail.com>
	<4927.1287598484@parc.com>
Message-ID: <AANLkTikSbZZNJuFjRs3Lim1J6M0afXjMjdMjGTyHenEt@mail.gmail.com>

>Any stats on new messages per day?

60K yesterday.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/archiver-dev/attachments/20101020/10363e82/attachment.html>

From janssen at parc.com  Thu Oct 21 01:28:21 2010
From: janssen at parc.com (Bill Janssen)
Date: Wed, 20 Oct 2010 16:28:21 PDT
Subject: [Archiver-dev] UpLib and archiving
In-Reply-To: <AANLkTikSbZZNJuFjRs3Lim1J6M0afXjMjdMjGTyHenEt@mail.gmail.com>
References: <83790.1287349272@parc.com> <20101018142859.4553f808@mission>
	<92627.1287450806@parc.com> <20101019144634.2400f766@mission>
	<99375.1287517117@parc.com>
	<AANLkTikBUcXWruXLm1QNpVaZMM2o5oXp2qAT=RW7E3AE@mail.gmail.com>
	<1140.1287552331@parc.com>
	<AANLkTinBhMTBg7XDyi9_MC8iSeyf6OxiPr-Ha8vt6=qP@mail.gmail.com>
	<4927.1287598484@parc.com>
	<AANLkTikSbZZNJuFjRs3Lim1J6M0afXjMjdMjGTyHenEt@mail.gmail.com>
Message-ID: <12681.1287617301@parc.com>

Jeff Breidenbach <jeff at jab.org> wrote:

> >Any stats on new messages per day?
> 
> 60K yesterday.

OK, so I'm on target.  Good to know.

Bill