[Mailman-Users] Integration with external search engine

Wed Dec 22 19:49:12 CET 2010

Thanks a lot Mark. Appreciate this!

Regards,
Lukas
Dne 22.12.2010 18:35 "Mark Sapiro" <mark at msapiro.net> napsal(a):
> On 12/22/2010 5:26 AM, Lukáš Vlček wrote:
>>
>> What is the Mailman algorithm to number individual HTML representations
>> of mails?
>
>
> Sequential in order of arrival.
>
>
>> My understanding was that once the new mail is received by Mailman then
>> it is processed, appended to mbox accumulated file and put into
>> private/public archive folder (i.e. HTML representation is rendered and
>> stored on the disk). If the flow is that smooth then the numbering would
>> really match the order of individual messages in accumulated mbox file.
>
>
> This is correct. Further, the list is locked during this process so even
> with "simultaneous" arrival of two messages to be archived, the order in
> the .mbox should match the sequence in the pipermail archive.
>
>
>> May be if the new message has to undergo admin moderation then this can
>> influence the result numbering (resulting in numbering gaps?), but I am
>> just speculating here...
>
>
> No. It is not archived until after moderator approval.
>
>
>> Do you think you could shed more light on the numbering process?
>> To me it seems unfortunate that there is really no simple way how to
>> determine valid URL for individual mails in mbox file.
>
>
> The number in the archive *should* match the sequence in the .mbox. The
> reasons why it doesn't include manual editing of the .mbox file, running
> bin/arch to add messages to the archive without adding them in the same
> sequence to the .mbox file, and messages with embedded, unescaped "^From
> " lines in the body.
>
>
>
>> If you don't want to rebuild the pipermail archive and possibly renumber
>> messages, you will need to develop some script to go through the .mbox
>> and parse the archive period (year/month or whatever the period is in
>> your case) from the messages and search the nnnn.html files in that
>> directory for a match.
>>
>>
>> Search for the match using Message-ID value?
>> Message-ID is not always present in HTML version, is it? All I can see
>> is that the Message-ID value is encoded into mailto: link as a
>> In-Reply-To value. Other than that some advanced heuristics would have
>> been used...
>
>
> In Mailman 2.1.10 and later, the mailto: always contains the message-id
> of this message in the In-Reply-To fragment. Prior to 2.1.10 there was
> not always a message-id in the mailto: and if there was, it was not the
> message-id of this message but rather the in-reply-to of this message.
>
>
> I suggest you simply test your .mbox file to see if the sequence numbers
> you generate from the From_ lines match those in the archive. As long as
> you have not manually manipulated the .mbox or merged separate .mbox
> files, there's a good chance this will be OK. You don't have to check
> every single message. If the numbering is off, there will be places
> where the numbering jumps from being correct to "off by one" and then to
> "off by two", etc. I.e., I don't think you have to worry about things
> like an mbox sequence of n, n+1, n+2, n+3, ... corresponding to an
> archive sequence of n, n+2, n+1, n+3, ... See the FAQ at
> <http://wiki.list.org/x/RIA9> for a description of what happened to this
> list when the archive was rebuilt in 2006.
>
> --
> Mark Sapiro <mark at msapiro.net> The highway is for gamblers,
> San Francisco Bay Area, California better use your sense - B. Dylan
>