[Archiver-dev] UpLib and archiving

Tue Oct 19 21:38:37 CEST 2010

Barry Warsaw <barry at python.org> wrote:

> >I've got the Web access and the IMAP support, but not NNTP -- never had
> >the need for it.  Twisted seems to have support going forward that
> >Medusa no longer has, and at some point I plan to port UpLib from Medusa
> >to Twisted.
> 
> NNTP does seem to be a dying art, probably for the dumb reason that Usenet is
> useless :).  Gmane keeps it alive and as I mentioned, I'd love for an archiver
> to provide it (my mail reader supports NNTP).  It may be that really solid
> read-only IMAP support solves that use case though.  I think a Twisted port
> would open up lots of possible avenues for vending the content.

To implement Twisted support, two modules would have to change
uplib.angelHandler and uplib.startAngel.  I believe all the Medusa code
is localized there.  So two new modules (or an extension which
strategically replaces some of the functions and classes in those
modules using a before_repository_instantiation function) would be
necessary to run this under Twisted instead of Medusa.

The IMAP server has its own submodule, medusaHandler, which localizes
all the dependencies on Medusa.  To port that to Twisted, a twisted
version of that submodule would be needed.

> >> What about plugins?  There are a few areas that come to mind about things I'd
> >> like to have pluggable:
> >> 
> >> * Content hyperlinks.  Let's say you've got an archive for a -commits list.
> >>   It would be nice to be able to dig out things like bug numbers and vcs
> >>   revisions and hyperlink them to tracker or viewvcs pages.
> >
> >UpLib automatically recognizes and stores hyperlinks found in Word, PDF,
> >Powerpoint, etc. as part of the standard metadata extraction process.
> >There's also a ripper which recognizes URLs and stores them as links.
> >
> >In house, we have some support for entity-finding: person or corporation
> >or location names, dates, etc.  They are also automatically turned into
> >links, and show up as hyperlinks in the Web and Java UI tools.  What
> >you're suggesting is more of that.
> 
> Yep.  Pluggable entity finders is what I'm thinking about.  E.g. for a
> python-dev archive, you'd probably search for strings like "issue XYZ" and
> "bug XYZ" and hyperlink them to the tracker issue.

Yep, that's the kind of thing the ripper architecture combined with the
extensions architecture supports really well.

The supplied NYTimes extension (extensions/NYTimes.py) is a good
example.  It provides both a custom DocumentParser specialized for NY
Times Web articles, and a ripper that updates the standard metadata
information with info gleaned from comments in the HTML of the article.

> Architecturally, I've gone back and forth about where these types of
> transformations should go.  Pipermail always had the view that these happened
> at message input time.  What I mean is, when Mailman sends a message to the
> subscribers, it also "sends" the message to the archiver.  The archiver
> immediately works out where in the thread it should go, and statically creates
> the rendered view (HTML) with any transformations done at that time.
> 
> Pipermail is of course ancient, and 12 years ago it made sense to do all the
> processing upfront so that when someone wanted to view the page, it was super
> cheap.  I'd always thought that it would be better to stitch together the
> final rendered page (with caching) at the time the page was requested.  This
> would allow a site administer to invalidate the cache as an easy way to update
> the rendering rules (add, modify entity filters, dynamic take downs, updated
> style sheets, new obfuscation rules, etc.).

Yep, I'm completely in the same mindset.  UpLib built renderings of the
document at ingestion time, to save time on the UI.  For some docs,
that's still appropriate, but for things like email thread display,
that's given way to dynamically constructed views built when asked for.

> >> * Take-down support.  If a list admin wants to remove a posting, she should
> >>   be able to do that without disrupting email threads or breaking URLs.
> >>   One way I've thought about doing that is a dynamic rendering plugin that
> >>   checked the to-be-displayed message against a blacklist, and if there's a
> >>   hit, it would substitute the body of the message with something like
> >>   "Content unavailable due to take-down notice.  Contact
> >>   postmaster at python.org for detail."
> >
> >Yeah, kind of provide a stand-in for the real message.  UpLib re-threads
> >if you modify the corpus, so removal is automatic.  It also includes a
> >capability to "replace" the content of an existing document, which
> >sounds like what you'd want for the above.
> 
> Yep, that's what I'm thinking.  There's a strong preference by postmasters not
> to modify the original message backing whatever is used to generate the
> displayed page.  So a take-down is more like marking a message-id for
> wholesale replacement.

In UpLib, you just upload the replacement version with a specified
metadata field, "replacement-contents-for", set to a doc ID, specifying
that the current contents of that doc should be replaced by this new
doc.  That way the UpLib doc ID doesn't change, and structures build
around that doc ID are still good.

> Cool.  I think the first step is to make the vcs available.  I'd like
> to grab the source and start taking a look.

Why is a VCS necessary for that?  I usually just visit the tarball with
Emacs :-).

I'm not set up for that right now (and not sure I want to be), so I
don't think that will happen in the near future.  I still like using CVS
(I know, I know, but I like the annotations on individual files) with
point-release tarballs.  I suppose each point-release tarball could be
used to update a Mercurial VCS, right?  Clearly needs more discussion.

But the source is up there -- feel free to unpack it into something
you're more comfortable with to look at it.

Bill