[Mailman-Users] Integration with external search engine

Mark Sapiro mark at msapiro.net
Sat Dec 18 04:31:05 CET 2010


On 12/17/2010 4:55 AM, Lukáš Vlček wrote:
> 
> I am looking at a best practice way how to integrate mailman with external
> search engine. I found the following Wiki page [1] which contains a link to
> Ext_Arch.py template which is brainchild of Mark Sapiro and Cedric Jeanneret
> [2]. Cerdic was after indexing emails using Xapian and his implementation of
> the Ext_Arch.py can be found here [3]. This all looks very promising but I
> have a few questions/concerns:
> 
> To me it seems that the PUBLIC_EXTERNAL_ARCHIVER and
> PRIVATE_EXTERNAL_ARCHIVER commands (which are both set in mm_cfg.py) are
> executed only when a new message arrives, that means it is not executed when
> bin/arch is executed. This means that if there has been running some mail
> list on mailman for a few years now and now I would like to allow searching
> its content via new external search engine (like Xapian) it is simply no
> enough to add external archiver and restart mailman because this would index
> only newly added messages. Am I right?


Yes, you are right. The design intent of external archivers is that they
provide a hook to use an external process for both archiving and
searching of the external archive. External archivers were never
intended to be used to index the built-in pipermail archive. Thus, the
Ext_Arch.py template is just a kludge which is admittedly incomplete in
this respect.


> How can I then have reindexed old content from that mail list into Xapian as
> well? bin/arch <maillist> does not do that as it does not execute external
> archivers. Moreover, running bin/arch can change URLs of individual public
> emails (re-numbering) and that is probably unacceptable. So is there any way
> how to iterate over existing emails, parse them and get an existing URL
> value for them? (Such information could be then used to re-index old content
> into external search server without need to run bin/arch).


find /path/to/archives/private/LISTNAME \
 | egrep "[0-9]{6}.html" \
 | sed "s;.*archives/private;http://www.example.com/pipermail;"

with the obvious modification will get the URLs. Will that be enough?

> 
> [1]
> http://wiki.list.org/display/DOC/4.87+How+do+I+invoke+some+process+on+messages+as+they+are+added+to+the+pipermail+archive
> [2] http://www.mail-archive.com/mailman-users@python.org/msg56679.html
> [3]
> https://bugs.launchpad.net/mailman/+bug/531942/+attachment/1199211/+files/archive-and-index.py

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan



More information about the Mailman-Users mailing list