[Archiver-dev] UpLib and archiving

Tue Oct 19 20:46:34 CEST 2010

Leading question: is the source available in a vcs?  I could only find tarball
downloads.  To really get good collaboration, you really should make the vcs
publicly available.

If it is available and I missed it on the download page, sorry!  Otherwise,
please do add a link from that page.  If it's not publicly available, I'd of
course recommend Bazaar and would be happy to help in making the code
available on Launchpad and in Ubuntu.

On Oct 18, 2010, at 06:13 PM, Bill Janssen wrote:

>I've got the Web access and the IMAP support, but not NNTP -- never had
>the need for it.  Twisted seems to have support going forward that
>Medusa no longer has, and at some point I plan to port UpLib from Medusa
>to Twisted.

NNTP does seem to be a dying art, probably for the dumb reason that Usenet is
useless :).  Gmane keeps it alive and as I mentioned, I'd love for an archiver
to provide it (my mail reader supports NNTP).  It may be that really solid
read-only IMAP support solves that use case though.  I think a Twisted port
would open up lots of possible avenues for vending the content.

>> What about plugins?  There are a few areas that come to mind about things I'd
>> like to have pluggable:
>> 
>> * Content hyperlinks.  Let's say you've got an archive for a -commits list.
>>   It would be nice to be able to dig out things like bug numbers and vcs
>>   revisions and hyperlink them to tracker or viewvcs pages.
>
>UpLib automatically recognizes and stores hyperlinks found in Word, PDF,
>Powerpoint, etc. as part of the standard metadata extraction process.
>There's also a ripper which recognizes URLs and stores them as links.
>
>In house, we have some support for entity-finding: person or corporation
>or location names, dates, etc.  They are also automatically turned into
>links, and show up as hyperlinks in the Web and Java UI tools.  What
>you're suggesting is more of that.

Yep.  Pluggable entity finders is what I'm thinking about.  E.g. for a
python-dev archive, you'd probably search for strings like "issue XYZ" and
"bug XYZ" and hyperlink them to the tracker issue.

Architecturally, I've gone back and forth about where these types of
transformations should go.  Pipermail always had the view that these happened
at message input time.  What I mean is, when Mailman sends a message to the
subscribers, it also "sends" the message to the archiver.  The archiver
immediately works out where in the thread it should go, and statically creates
the rendered view (HTML) with any transformations done at that time.

Pipermail is of course ancient, and 12 years ago it made sense to do all the
processing upfront so that when someone wanted to view the page, it was super
cheap.  I'd always thought that it would be better to stitch together the
final rendered page (with caching) at the time the page was requested.  This
would allow a site administer to invalidate the cache as an easy way to update
the rendering rules (add, modify entity filters, dynamic take downs, updated
style sheets, new obfuscation rules, etc.).

>> * Take-down support.  If a list admin wants to remove a posting, she should
>>   be able to do that without disrupting email threads or breaking URLs.
>>   One way I've thought about doing that is a dynamic rendering plugin that
>>   checked the to-be-displayed message against a blacklist, and if there's a
>>   hit, it would substitute the body of the message with something like
>>   "Content unavailable due to take-down notice.  Contact
>>   postmaster at python.org for detail."
>
>Yeah, kind of provide a stand-in for the real message.  UpLib re-threads
>if you modify the corpus, so removal is automatic.  It also includes a
>capability to "replace" the content of an existing document, which
>sounds like what you'd want for the above.

Yep, that's what I'm thinking.  There's a strong preference by postmasters not
to modify the original message backing whatever is used to generate the
displayed page.  So a take-down is more like marking a message-id for
wholesale replacement.

>Nice idea.  I've got an extension which (sort of) supports this (you can
>email a copy of any document to anyone), and the user can define new
>buttons to add to the UI in her config file.  I, for instance, added a
>button which shows me all the email threads which have been updated
>today:
>
>Today's Mail, /action/basic/email_threads?query=date:today+$email, _blank
>
>Of course, the normal UpLib Web UI puts apppropriate "mailto:" links
>around people's names, and adds "Reply-To" and "Reply-To-All" links to
>the message.  Just click on that and it opens up in your MUA.

Nice.

>> Interested to hear your thoughts.  This would be a cool project to work on,
>> and maybe we should also engage mailman-developers.  Thanks for releasing it
>> under the GPL[*].
>
>Well, there's lots to do :-).  The current IMAP server, for instance, is
>more about getting the IMAP protocol right than it is efficiency.  When
>you go into python-dev size archives without breaking it into chunks
>(like the per-month view in Mailman archives), it poops out.  Shouldn't
>do that.
>
>My normal development process is to write any new code as an UpLib
>extension, then if it works I eventually fold it into the codebase.
>Extensions are easy to add (just plunk them in a directory, and point
>the repository at that directory), and there are a number of examples
>included with the source code.  The IMAP server is an extension, for
>instance.

Cool.  I think the first step is to make the vcs available.  I'd like to grab
the source and start taking a look.

Cheers,
-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/archiver-dev/attachments/20101019/19c6e962/attachment.pgp>