[Mailman-Users] Integration with external search engine

Lukáš Vlček lukas.vlcek at gmail.com
Sat Dec 18 14:45:42 CET 2010


Hi Mark,

First of all, thanks for your time. Your help is very valuable.

see below for my comments and questions:

On Sat, Dec 18, 2010 at 4:31 AM, Mark Sapiro <mark at msapiro.net> wrote:

> On 12/17/2010 4:55 AM, Lukáš Vlček wrote:
> >
> > I am looking at a best practice way how to integrate mailman with
> external
> > search engine. I found the following Wiki page [1] which contains a link
> to
> > Ext_Arch.py template which is brainchild of Mark Sapiro and Cedric
> Jeanneret
> > [2]. Cerdic was after indexing emails using Xapian and his implementation
> of
> > the Ext_Arch.py can be found here [3]. This all looks very promising but
> I
> > have a few questions/concerns:
> >
> > To me it seems that the PUBLIC_EXTERNAL_ARCHIVER and
> > PRIVATE_EXTERNAL_ARCHIVER commands (which are both set in mm_cfg.py) are
> > executed only when a new message arrives, that means it is not executed
> when
> > bin/arch is executed. This means that if there has been running some mail
> > list on mailman for a few years now and now I would like to allow
> searching
> > its content via new external search engine (like Xapian) it is simply no
> > enough to add external archiver and restart mailman because this would
> index
> > only newly added messages. Am I right?
>
>
> Yes, you are right. The design intent of external archivers is that they
> provide a hook to use an external process for both archiving and
> searching of the external archive. External archivers were never
> intended to be used to index the built-in pipermail archive. Thus, the
> Ext_Arch.py template is just a kludge which is admittedly incomplete in
> this respect.
>
>
> > How can I then have reindexed old content from that mail list into Xapian
> as
> > well? bin/arch <maillist> does not do that as it does not execute
> external
> > archivers. Moreover, running bin/arch can change URLs of individual
> public
> > emails (re-numbering) and that is probably unacceptable. So is there any
> way
> > how to iterate over existing emails, parse them and get an existing URL
> > value for them? (Such information could be then used to re-index old
> content
> > into external search server without need to run bin/arch).
>
>
> find /path/to/archives/private/LISTNAME \
>  | egrep "[0-9]{6}.html" \
>  | sed "s;.*archives/private;http://www.example.com/pipermail;"
>
> with the obvious modification will get the URLs. Will that be enough?
>

Not exactly. I need to index mail list content by external search server and
for each indexed mail I need to know working mailman public URL of that
mail. Ext_Arch.py allows me to hook into archiving process and gives me a
chance to index content of newly added mails and also gives me public URL
for them. That is nice but it does not give me a chance to learn URL for
existing mails that are already in mbox file.

My question is: if I take the <list-name>.mbox file is there any way how I
can deduce working URL of individual emails?
Say I can split the mbox file using:
csplit -s -b %06d.mbox -z <list-name>.mbox '/^From /' {*}
into individual emails. Would the numbering be the same as the one produced
by mailman in this case? (Providing mailman numbering starts from zero)

I learned that if I use this csplit technique with public archives then the
numbering is not guarantied to match (the order in which the mails are
stored in public archives does not match the numbering order of mailman
produced HTML files). Moreover public archive files do not contain all the
email headers (charset, encoding, content-type, ...) and I don't want to
index generated HTML files for now.

Thanks a lot,
Lukas


>
> >
> > [1]
> >
> http://wiki.list.org/display/DOC/4.87+How+do+I+invoke+some+process+on+messages+as+they+are+added+to+the+pipermail+archive
> > [2] http://www.mail-archive.com/mailman-users@python.org/msg56679.html
> > [3]
> >
> https://bugs.launchpad.net/mailman/+bug/531942/+attachment/1199211/+files/archive-and-index.py
>
> --
> Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
> San Francisco Bay Area, California    better use your sense - B. Dylan
>
>


More information about the Mailman-Users mailing list