[Mailman-Users] Integration with external search engine

Wed Dec 22 14:26:39 CET 2010

Hi Mark,

again, thanks for your time. See my comments and questions below:

On Tue, Dec 21, 2010 at 11:33 PM, Mark Sapiro <mark at msapiro.net> wrote:

> On 12/18/2010 5:45 AM, Lukáš Vlček wrote:
> >
> > On Sat, Dec 18, 2010 at 4:31 AM, Mark Sapiro <mark at msapiro.net
> > <mailto:mark at msapiro.net>> wrote:
> >
> >     find /path/to/archives/private/LISTNAME \
> >      | egrep "[0-9]{6}.html" \
> >      | sed "s;.*archives/private;http://www.example.com/pipermail;"
> >
> >     with the obvious modification will get the URLs. Will that be enough?
> >
> >
> > Not exactly. I need to index mail list content by external search server
> > and for each indexed mail I need to know working mailman public URL of
> > that mail.
>
>
> The above shell command will get you a list of the URLs. If you are
> saying you need to know the message content together with the URL, you
> could still do this easily from the existing pipermail archive. The
> point is that each individual message in the archive is in a file of the
> form archives/private/LISTNAME/yyyy-Month/nnnnnn.html and the
> LISTNAME/yyyy-Month/nnnnnn.html portion of that path is also the
> variable part of the URL used to access the message.
>

Yes, I need email content and its valid public URL. However, when I say
content I do not mean the HTML rendered by Mailman. Technically it is
possible to parse the HTML and extract the content from it but this sounds
like an extra work while there is a raw mail source in mbox format. Not to
say that the HTML does not contain all the metadata I would like to extract
and it contains extra content which has to be identified and stripped out
(like threads info for example). Moreover, Mailman can be configured to
treat multipart/alternative in specific way (putting alternative into
attachment folder) which means that if I want to get all alternative
representations of the content I would have to crawl one more extra file...
etc.

>
>
> > My question is: if I take the <list-name>.mbox file is there any way how
> > I can deduce working URL of individual emails?
> > Say I can split the mbox file using:
> > csplit -s -b %06d.mbox -z <list-name>.mbox '/^From /' {*}
> > into individual emails. Would the numbering be the same as the one
> > produced by mailman in this case? (Providing mailman numbering starts
> > from zero)
>
>
> It will be the same as the numbering produced by running bin/arch
> --wipe. As you note below, this is not guaranteed to be the same as that
> in the existing archive.

OK, let's ask this fundamental question:

What is the Mailman algorithm to number individual HTML representations of
mails?

My understanding was that once the new mail is received by Mailman then it
is processed, appended to mbox accumulated file and put into private/public
archive folder (i.e. HTML representation is rendered and stored on the
disk). If the flow is that smooth then the numbering would really match the
order of individual messages in accumulated mbox file. May be if the new
message has to undergo admin moderation then this can influence the result
numbering (resulting in numbering gaps?), but I am just speculating here...

Do you think you could shed more light on the numbering process?
To me it seems unfortunate that there is really no simple way how to
determine valid URL for individual mails in mbox file.

> > I learned that if I use this csplit technique with public archives then
> > the numbering is not guarantied to match (the order in which the mails
> > are stored in public archives does not match the numbering order of
> > mailman produced HTML files). Moreover public archive files do not
> > contain all the email headers (charset, encoding, content-type, ...) and
> > I don't want to index generated HTML files for now.
>
>
> If you really need information from the cummulative .mbox which is not
> available in the existing pipermail html files, I see two choices.
>
> If you don't want to rebuild the pipermail archive and possibly renumber
> messages, you will need to develop some script to go through the .mbox
> and parse the archive period (year/month or whatever the period is in
> your case) from the messages and search the nnnn.html files in that
> directory for a match.

Search for the match using Message-ID value?
Message-ID is not always present in HTML version, is it? All I can see is
that the Message-ID value is encoded into mailto: link as a In-Reply-To
value. Other than that some advanced heuristics would have been used...

>
> If you don't mind possibly renumbering messages, you could first check
> the .mbox with bin/cleanarch and then rebuild the archive from the .mbox
> with bin/arch --wipe, and then your csplit above will give the correct
> new numbers.
>

Renumbering is really not an option for me.

> Before rebuilding the archive however, you might check if the numbering
> in the mbox really doesn't match. While it is not guaranteed to match,
> it often does, particularly if the archive is not too old - i.e., if the
> oldest messages were archived by Mailman 2.1.x and not 2.0.x or older.
>
> --
> Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
> San Francisco Bay Area, California    better use your sense - B. Dylan
>
>
Thanks,
Lukas