[Mailman-Users] Mailman + htdig integration howto

Odhiambo Washington wash at wananchi.com
Wed Aug 28 07:46:16 CEST 2002


* curt brune <curt at acm.org> [20020828 03:29]: wrote:
> Is there a HOWTO for integrating htdig with Mailman ?

Yes, there is, although I don't remember where it resides, but I have attached
it.



        cheers
       - wash 
+----------------------------------+-----------------------------------------+
Odhiambo Washington, wash at wananchi.com	. WANANCHI ONLINE LTD (Nairobi, KE)  |
http://ns2.wananchi.com/~wash/		. 1ere Etage, Loita Hse, Loita St.,  |
GSM: (+254) 722 743 223			. # 10286, 00100 NAIROBI             |
+---------------------------------+------------------------------------------+
"Oh My God! They killed init! You Bastards!"  
						 --from a /. post
-------------- next part --------------
Installing and Using the Mailman-htdig Integration
==================================================

This patch:

http://sourceforge.net/tracker/index.php?func=detail&aid=444884&group_id=103&atid=300103

Contents
========

Prereqisites
Compatibility
History
Introduction
Installing and Building Mailman with this patch
What is Installed by the Patch
Configuration of Mailman-htdig Integration
    Health Warning on the packet!
    Starting from Scratch (Again)
    General
    htdig Permissions Considerations
    Local htdig Configuration
    Remote htdig Configuration
    Upgrading an Existing Standard Mailman Installation
    Changing from local to remote htdig or vice versa
    Coping with htdig Upgrades
    Changing the Addressing Scheme of your web_page_url
Operational Information
Notes and Warnings
Contributors
Appendices
    Appendix 1 -Technique for htdigging when Mailman's DEFAULT_URL uses the
    https 

Prerequisites
============

Prior to installing this patch you should also have installed the patch that 
provides enhanced indexing of Mailman archives see:

http://sourceforge.net/tracker/index.php?func=detail&aid=444879&group_id=103&atid=300103

You must have a working installation of htdig with htsearch available via CGI on 
your HTTP server installed on either the machine on which you are running 
Mailman or on another machine which has access to Mailman list archives via NFS 
or some similarly competent network file sharing scheme.

Regardless of how you configure things to provide Mailman's Web UI, if its gives 
normal operation of the /mailman/private CGI script for providing access to 
private list archives, it should also support access to htdig search results via 
the /mailman/htdig CGI script.

Compatibility
=============

htdig-2.0.12-0.1.patch - Mailman 2.0.12

htdig-2.0.11-0.1.patch - Mailman 2.0.11

htdig-2.0.10-0.2.patch - Mailman 2.0.10

htdig-2.0.10-0.1.patch - Mailman 2.0.10

htdig-2.0.9-0.1.patch - Mailman 2.0.9

htdig-2.0.8-0.1.patch - Mailman 2.0.8, 2.0.7, 2.0.6 and probably 2.0.3, 2.0.4
and 2.0.5

History
=======

Previous versions - original versions of this patch provided most of the 
features described here with the main exception being support for remote htdig, 
that is running htdig on a different system to Mailman. They were also baked in 
some configuration assumptions, which are now configurable.

htdig-2.0.12-0.1.patch - latest version:

    1. Rebuilt patch to get no-comment application on Mailman 2.0.12

    2. Added HTDIG_EXTRAS facility to allow arbitrary htdig configuration
       parameters to be specified for addition to every htdig.conf file
       created i.e. site wide additions. See comments below on the use of
       HTDIG_EXTRAS.

htdig-2.0.11-0.1.patch:

    1. No substantive change. Simply rebuilt patch to get no-comment application
       on Mailman 2.0.11

htdig-2.0.10-0.2.patch:

    1. Python 2.2 compatibility fixes to nightly_htdig cron script and its
       relatives. Doing import * inside a function removed.

    2. Added note on potential problems with htdig and file permissions.

htdig-2.0.10-0.1.patch:

    1. change in src/Makefile.in to get clean patch application to MM 2.0.10

htdig-2.0.9-0.1.patch:

    1. minor cosmetic changes to get clean patch application to MM 2.0.9

htdig-2.0.8-0.1.patch:

    1. resolves a problem with the integration of htdig when the web_page_url
for a list, which is usually the same as DEFAULT_URL from either
$prefix/Mailman/Defaults.py or $prefix/Mailman/mm_cfg.py, doesn't use the http
addressing scheme. This arises because htdig will only build indices if the URLs
for pages use the http addressing scheme. There is a work-around for this
problem posted in htdig's mail archives - see the copy in Appendix 1 to this
document.

    2. This patch revision implements the solution documented in that e-mail. If
non-http URLs are used by the web_page_url of a list an additional htdig
configuration file for use by htsearch is generated.

    3. In all other respects the operation of the Mailman-htdig integration
remains unchanged. There is no benefit in upgrading to this revised patch unless
you need to use other than http addressing in your DEFAULT_URL or set other than
http addressing in the web_page_url configuration of any of your lists.

    4. If changing to or from a non-http addressing scheme then the per list
htdig config files of the lists affected and their associated htdig indices must
be reconstructed. See the section below entitled 'Changing the Addressing Scheme
of your web_page_url' for details of how to do this.

htdig-2.0.6-0.3.patch:

    1. adds support for remote htdig, that is: running htdig on a different 
system to Mailman.

    2. enhances the configurability of the integration. Some of the programmed 
assumptions made in previous versions are now configurable in mm_cfg.py. The 
configuration variables concerned default to the previous fixed values so 
that this version is backwards compatible with earlier versions.

    3. does some minor cosmetic code changes.

    4. extends the associated documentation.

Introduction
============

This integration enables use of the htdig (http://www.htdig.org) search engine 
for searching mail list archives produced by pipermail, Mailman's built-in 
archiver.

You can use htdig without applying these patches to Mailman but you may find it 
awkward to achieve some of the features offered by this patch.

The main features of the patch are:

    1. per list search facility with a search form on each list's TOC page. 

    2. maintenance of privacy of private archives. The user has to establish 
their credentials via the normal private archive access mechanism before any 
access via htdig is allowed. 

    3. a common base URL for both public and private archive access via htsearch 
results. This means that htdig indices are unaffected by changing an archive 
from private to public and vice versa. All access to archives via htdig is 
controlled by a wrapped CGI script called htdig.py.

    4. Choice of running htdig on the machine running Mailman (aka local htdig) 
or running htdig on another machine which has access to Mailman's archives 
via NFS or some similarly competent network file sharing scheme (aka remote 
htdig).

    5. cron activated scripts and crontab entry to run htdig regularly to 
maintain the per list search indices. 

    6. automatic creation, deletion and maintenance of htdig configuration files 
and such. Beyond installing htdig and telling Mailman where it is via mm_cfg 
you do not have to do much other setup.

Installing and Building Mailman with this patch
==============================================

Create your Mailman build directory in the normal way.

You can apply the patch to either a fresh expansion of the Mailman source 
distribution or the one you used to build a currently working Mailman 
installation.

Execute the following command in the Mailman build directory:

    patch -p1 < htdig-2.0.8-0.1.patch

Follow the configure and make procedures for regular Mailman as given in the 
$build/INSTALL file

Then follow the Mailman-htdig configuration instructions given below.

What is Installed by the Patch
==============================

The patch amends:
----------------

$prefix/Mailman/Archiver/HyperArch.py

    the changes in this file set up the per list htdig stuff such as config 
    files and adds the search forms to the list TOC pages.

$build/Mailman/Defaults.py.in

    adds the default configuration variables needed to support the mailman-htdig 
    integration

$build/cron/crontab.in.in

    adds the nightly_htdig cron script to the default crontab

$build/Makefile.in
$build/cron/Makefile.in
$build/src/Makefile.in
$build/bin/Makefile.in

    necessary changes to Makefiles used for installing Mailman

The patch adds:
--------------

$prefix/cgi-bin/htdig
$prefix/Mailman/Cgi/htdig.py

    these are a CGI script and its wrapper, which is always on the path of URLs 
    returned from searches of htdig indices. The script provides secure access 
    to such URLs in the same way that the $prefix/cgi-bin/private and 
    $prefix/Mailman/Cgi/private.py. htdig.py ensures private archives are kept 
    private, applying the same criteria for permitting access as private.py, 
    and delivering material from public archives without demanding any 
    authentication. 

$prefix/bin/blow_away_htdig

    this is a utility script for removing per list htdig data, e.g. the config 
    file and indices/db files. This is necessary when:

        a. ceasing use of the Mailman-htdig integration

        b. moving from local to remote htdig or vice-versa

        c. upgrading to a version of htdig which has an incompatible
           index/db file format
        
        d. changing the addressing scheme (http versus https) in the
           web_page_url configuration variable of a list

$prefix/cron/nightly_htdig
$prefix/cron/remote_nightly_htdig
$prefix/cron/remote_nightly_htdig_noshare
$prefix/cron/remote_nightly_htdig.pl

    These scripts all do the same thing; they can be installed as a cron task 
    and run regularly to invoke htdig's rundig script to update mailing list 
    search indices. Only one of these scripts is used, the choice of which 
    depending on your system configuration.

    nightly_htdig is used where Mailman and htdig run on the same system.

    the remote_... scripts are used where Mailman and htdig live on different 
    systems. You choose which one suits your needs best:

        remote_nightly_htdig uses the same python files on both systems, that is 
        the same .py and .pyc files are accessed, and it hence depends on 
        compatible bytecode between the Mailman system and htdig system. It also 
        accesses Mailman data files and depends on compatibility of data files 
        contents, for example pickled python values. This should work OK if the 
        same version of python is being run on both systems even where the 
        systems are not heterogeneous, for example one is Sun/Solaris and the 
        other is PC/Linux.

        remote_nightly_htdig_noshare shares no python files between the two 
        systems. While it is still written in python it but acquires information 
        from the file system using directory listings and stat operations.

        remote_nightly_htdig.pl is a rewrite of remote_nightly_htdig_noshare in 
        Perl. It is for use where the htdig system does not have python 
        available on it: in which case, shame on you.

$prefix/cgi-bin/updateTOC
$prefix/Mailman/Cgi/updateTOC.py

    these are a CGI script and its wrapper, for use where Mailman and htdig 
    live on different systems. The script is a work-around for the problem of 
    using remote_nightly_htdig, remote_nightly_htdig_noshare or 
    remote_nightly_htdig.pl which precludes these scripts from directly updating 
    the TOC page of each archived list. Instead, these scripts call this CGI 
    script to do that for them. This CGI script will not operate when entered as 
    a URL from a browser.

Configuration of Mailman-htdig Integration
==========================================

Configuration of the Mailman-htdig integration is carried out on the Mailman 
side. While you must have to hand some information about your htdig 
installation, you should not have to tinker with htdig for the integration to 
work.

Most of the configuration of the integration is done by values assigned to 
python variables in either $prefix/Mailman/Defaults.py or 
$prefix/Mailman/mm_cfg.py.

If you opt to run htdig on a different machine or under a different HTTP server 
to the one running the HTTP server which provides Mailman's Web UI you will also 
have to edit whichever of the patch's three htdig related cron scripts you opt 
to run (remote_nightly_htdig, remote_nightly_htdig_noshare, or 
remote_nightly_htdig.pl) to add a small amount of configuration information.

Health Warning on the packet!
-----------------------------

Be careful when editing configuration information in $prefix/Mailman/mm_cg.py: 
the only Mailman config file you should be editing. Check, double check and then 
recheck before going ahead. If you get either variable names or their values 
wrong a lot of confusion in the operation of both Mailman and htdig can result. 
You (and others supporting you) can spend hours trying to identify problems and 
looking for non-existent bugs as a consequence of such editing errors. Expect to 
find errors in these instructions; compensate for them and tell me when you do 
(r.barrett at ftel.co.uk).

Also do read the htdig documentation, release notes etc. This patch integrates a 
working htdig with htsearch available through CGI. These notes are about Mailman 
and integrating it with that working htdig. It is up to you to sort out the 
htdig end of things.

Starting from Scratch (Again)
-----------------------------

This is getting ahead of things but some of you may already be asking "What if 
I've already been using an older version of this patch and want to start 
afresh", or "I want to change from local to remote htdig or vice versa"

In these cases your friend will be the $prefix/bin/blow_away_htdig script. It 
removes existing htdig related stuff out of your Mailman installation to the 
extent that it was added by this patch and added to by the normal operation of 
pipermail and nightly_htdig. With that removed and a revised Mailman 
configuration, the patched code will start rebuilding the htdig data.

But before you get carried away with blow_away_htdig, read the rest of these 
notes.
 
General
-------
This patch adds a number of default variables to the file 
$prefix/Mailman/Defaults.py that affect operation of the Mailman-htdig 
integration. These are in addition to the standard Mailman defaults in that 
file. If, in the light of what is said below, you decide any of these are 
incorrect, you can override them in $prefix/Mailman/mm_cfg.py [NOT IN 
Defaults.py! See the comments in Defaults.py for details].

By default the Mailman-htdig integration is NOT ENABLED by the installation of 
this patch; a default variable in Defaults.py turns off the operation of the 
integration. You have to actively override that default in mm_cfg.py to turn on 
operation of the integration.

Once a list is created, changing most of these variables will have either no 
effect or a bad effect. You will need to run $prefix/bin/blow_away_htdig script 
and/or $prefix/bin/arch to rebuild the archive pages if you make significant 
changes to the Mailman-htdig integration configuration variables.

The install process will not overwrite an existing mm_cfg.py file so you can 
freely make changes to this file. If you are re-installing a later version of 
this patch you may have to change what is already configured in the existing 
file and, if necessary, add extra configuration variables to it.
     
Most of the Mailman-htdig control variables default to sensible values which you 
will not need to change, especially if you are using local htdig. The semantics 
of most variables apply to both local and remote htdig operation but with some 
the values assigned will depend on whether htdig is viewing things from the same 
or a remote machine.

The first two variables control what is indexed by htdig. The values assigned 
are both embedded in the HTML generated by pipermail in the list archives and 
added. Changing the values of these variables will mean that all previously 
generated HTML pages in list archives will be out of date and you will probably 
want to rebuild existing archives using $prefix/bin/arch:

ARCHIVE_INDEXING_ENABLE

    defines a string telling htdig that it should look at the following material 
    when building it indices.

    Default: ARCHIVE_INDEXING_ENABLE = '<!--/htdig_noindex-->'

ARCHIVE_INDEXING_DISABLE

    defines a string telling htdig that it not should not look at the following 
    material when building it indices.

    Default: ARCHIVE_INDEXING_DISABLE = '<!--htdig_noindex-->'

USE_HTDIG - Semantics 0 - don't use integrated htdig, 1 - use it

    turns Mailman-htdig integration on or off.

    Defaults: USE_HTDIG = 0

    Notes:

    1. when USE_HTDIG is turned on the patched code in Mailman will start adding 
       htdig stuff for any archiving-enabled mail lists as new posts for each
       list are handled by Mailman. Until a new post is made after enabling with 
       USE_HTDIG an existing mail list's archive will not be htdig searchable.
       When the new post is handled:

        a. the list's personalised htdig config file is created 

        b. necessary links to the htdig config file are created 
        
        c. a search form is added to the TOC page for the list
    
       Even with this done, htdig searches only become available when htdig
       indices are constructed. This is done when one or other of the patch's
       htdig related cron scripts are run (nightly_htdig, remote_nightly_htdig, 
       remote_nightly_htdig_noshare, or remote_nightly_htdig.pl, depending on
       how you configure your system). These can be run from the command line
       ahead of their scheduled cron time to get htdig searches operational.

    2. Turning USE_HTDIG off will not remove htdig indices or search forms from 
       existing archive-enabled lists. It will however stop htdig features from 
       being added to newly created lists. If you want to eliminate htdig from
       your existing lists then use the  $prefix/bin/blow_away_htdig script.

HTDIG_ARCHIVE_URL

    this is the URL path that equates to the wrapper $prefix/cgi-bin/htdig which 
    controls access to the $prefix/Mailman/Cgi/htdig.py script.

    Default: HTDIG_ARCHIVE_URL = '/mailman/htdig'

    It is highly unlikely that you will want to change from the default value 
    unless you are also changing other variables such as PRIVATE_ARCHIVE_URL 
    because of some non-standard installation decisions on your part.

HTDIG_SEARCH_URL

    this is the URL of the htsearch CGI program part of the htdig package. 

    Default: HTDIG_SEARCH_URL = '/cgi-bin/htsearch'

    The default assumes a single HTTP server providing access to htdig and to 
    Mailman's web UI are on the Mailman machine and htsearch has been installed 
    in the HTTP server's cgi-bin directory. This value will depend on your htdig 
    installation decisions and HTTP server configuration files (typically 
    /etc/httpd/httpd.conf on a late model Apache installation) i.e the 
    ScriptAlias through which the htsearch CGI program is reached.

HTDIG_FILES_URL

    this is the URL of the directory containing various HTML and Graphics files 
    installed by htdig; files such as buttonr.gif, buttonl.gif and 
    button1-10.gif. The URL must end with a '/'.

    Default: HTDIG_FILES_URL = '/htdig/'

    The default assumes the HTTP servers providing access to htdig and to 
    Mailman's web UI are on the same machine and a symbolic link called 'htdig' 
    has been put into your HTTP server's top level HTML directory which points 
    to the directory your htdig install has put the actual files into; this link 
    is often to /usr/share/htdig. This value will depend on your htdig 
    installation decisions and HTTP server's configuration files (typically 
    /etc/httpd/httpd.conf on a late model Apache installation) i.e the Alias 
    through which the link to the htdig files are reached.

HTDIG_CONF_LINK_DIR

    this is the name of a directory in which links to list specific htdig config 
    files are placed. 

    Default: HTDIG_CONF_LINK_DIR = os.path.join(VAR_PREFIX, 'archives', 'htdig')

    The VAR_PREFIX of the default is resolved to an actual file system path when 
    when Mailman's 'make install' is run. The 'os.path.join' creates a full file 
    system path by gluing together the three pieces when Mailman is run. This 
    definition puts the directory alongside the default PUBLIC_ARCHIVE_FILE_DIR 
    and PRIVATE_ARCHIVE_FILE_DIR. Unless you are changing the value of these 
    variables you probably do not want to change HTDIG_CONF_LINK_DIR.

HTDIG_RUNDIG_PATH

    this is the path in you file system to the rundig shell script that is 
    installed as part of htdig. This tells one or other of the patch's htdig 
    related cron scripts (nightly_htdig and remote_nightly_htdig) where to find 
    rundig in order that they can execute it.

    Default: HTDIG_RUNDIG_PATH = '/usr/local/bin/rundig'

HTDIG_MAILMAN_LINK

    the value of this is the name of a symbolic link you must create in the 
    directory where htdig expects to find its configuration files. The target of 
    this link is the directory whose path is the value of HTDIG_CONF_LINK_DIR. 
    The value of this variable is embedded in the per list search forms in each 
    list's TOC page generated by the patched code, where it tells htsearch where 
    to find the list's htdig config file.

    Default: HTDIG_MAILMAN_LINK = 'htdig-mailman'

REMOTE_HTDIG - Semantics 0 - htdig runs on local machine, 1 -on remote machine

    says whether htdig is run on the same machine as Mailman or on another 
    machine. 

    Default: REMOTE_HTDIG = 0

REMOTE_PRIVATE_ARCHIVE_FILE_DIR

    only relevant if REMOTE_HTDIG = 1. It is the file system path to the 
    directory in which Mailman stores private archives, as seen by the machine 
    running htdig. 
        
    Default: REMOTE_PRIVATE_ARCHIVE_FILE_DIR =  = os.path.join(VAR_PREFIX,
    'archives', 'private')

    The VAR_PREFIX of the default is resolved to an actual file system path when 
    when Mailman's 'make install' is run. The 'os.path.join' creates a full file 
    system path by gluing together the three pieces when Mailman is run. If you 
    assign a value to this in mm_cfg.pfg, just put the relevant explicit file 
    system path in.

HTDIG_EXTRAS

    You can assign a string value to this config variable and that string will
    be included in all of your site's list specific htidg configuration files
    when they are created. The value of the string can be any attribute 
    declarations as defined at http://www.htdig.org/confindex.html.

    Be cautious in what you do with this. Most sites will not need to use
    this at all. But if you have some idiosyncratic htdig installation it
    might help overcome problems in integrating with Mailman. If you think
    you need to use it I suggest:
    
    1. You try creating a test list without assigning a value to HTDIG_EXTRAS
       in  $prefix/Mailman/mm.cfg.py
    
    2. Enable archiving for that test list.

    3. Send a message to the test list so that its archive is created 
       together with its htdig configuration file.

    4. Review the content of the list's htdig conf file in 
       $prefix/archives/private/<listname>/htdig/<listname>.conf. 
    
    5. You will see where the default value of HTDIG_EXTRAS from 
       $prefix/Mailman/Defaults.py has been inserted. This value is only
       an htdig comment and does nothing.
    
    6. Consider whether what you will assign to HTDIG_EXTRAS in  
       $prefix/Mailman/mm.cfg.py will make sense in the context of the rest
       of the htdig conf file's contents.

htdig Permissions Considerations
------------------------------------

Python scripts added by this patch (nightly_htdig and its relatives) run the 
htdig rundig script identified by HTDIG_RUNDIG_PATH to build search indices
for Mailman archives. Code added by this patch generates per list htdig 
configuration files which are passed as a parameter to the rundig script. 
These configuration files identify a list specific directory 
($prefix/archives/private/<listname>/htdig) in which list specific data files 
generated by and used by htdig are to be placed.

However, the rundig script identified by HTDIG_RUNDIG_PATH may attempt to 
generate some files in htdig's COMMON_DIR when it is first run by nightly_htdig;
the files concerned are likely to be root2word.db, word2root.db, synonyms.db and
possibly some others generated by htidg's htfuzzy program. The standard rundig 
script generates these files selectively if they do not already exist. Depending 
on how you have installed htdig and how the rundig script is first run, there 
may be a permissions problem when nightly_hdig executes rundig under the mailman
UID if it tries to generate these files.

You may need to either give the mailman UID write permission over htdig's
COMMON_DIR or, before the nightly_htdig script is first run, run htdig's htfuzzy
executable with a sufficiently privileged UID in the manner that the rundig script 
would run htfuzzy, to create any necessary files in COMMON_DIR. 

See htdig's documentation for further information on this topic.


Local htdig Configuration
-------------------------

This configuration is for when you are running Mailman, htdig, the HTTP server 
used to provide Mailman's web UI and htdig's htsearch CGI script, on the same 
machine.

You will need to:  

    1. Set up a symbolic link in the directory where htdig expects to find its 
       configuration files; this depends on how you configured and installed
       htdig but it is usually the directory containing htdig's default
       htdig.conf file. The target of this link is the directory whose path is
       assigned as the value of HTDIG_CONF_LINK_DIR. The name of the link must
       be same as the value you assign to HTDIG_MAILMAN_LINK. For example, use
       the command:
    
        ln -s /home/mailman/archives/htdig /etc/htdig-mailman

    2. If different to the default value, add the definition of 
       HTDIG_MAILMAN_LINK to file $prefix/Mailman/mm_cfg.py

    3. If different to the default value, add the definition of 
       HTDIG_RUNDIG_PATH to file $prefix/Mailman/mm_cfg.py.

    4. Add the definition of USE_HTDIG with the value 1 to 
       $prefix/Mailman/mm_cfg.py.

        USE_HTDIG = 1


If necessary you can override the values of any of the other configuration 
variables in file $prefix/Mailman/mm_cfg.py. In particular you might need to 
change the following URL variables from their defaults: HTDIG_SEARCH_URL and 
HTDIG_FILES_URL.
    
These URLs can be just the path i.e. absolute URL on the same server as that 
which serves Mailman's Web UI, or a full URL identifying the protocol (http), 
server, server port and path, for example 
http://mailer.your.com:8080/cgi-bin/htdig/htsearch.

Remote htdig Configuration
--------------------------

This configuration is for when you are running htdig and an HTTP server 
providing access to htsearch on a different machine to that running Mailman and 
the HTTP server used to provide Mailman's web interface.

For this configuration to work, htdig's programs, both those run from command 
lines such as rundig and those run via CGI such as htsearch, must be able to see 
Mailman archives through NFS. In the examples below we'll assume that 
/mnt/mailman-archives on the htdig machine maps to $prefix/mailman/archives on 
the Mailman machine.

You should also arrange for he mailman UID and its GID to be common to both 
machines. Remember that when rundig is called on the htdig machine to produce 
search indices for each list it will be trying to write those files via NFS in 
Mailman's archive area and will thus need to run with an appropriate identity 
and permissions.

The differences between the local and remote configuration are:

    1. configuration values telling htdig where to find files are as viewed from 
       the remote machine.

    2. configuration values giving URLs that refer to htdiggy things have to be 
       as viewed from the Mailman machine.

You will need to:  

    1. Set up a symbolic link in the directory where htdig expects to find its 
       configuration files; this depends on how you configured and installed
       htdig but it is usually the directory containing htdig's default
       htdig.conf file. The target of this link is the directory whose path is
       assigned as the value of HTDIG_CONF_LINK_DIR as seen from the remote
       machine running htdig. The name of the link must be same as the value you
       assign to HTDIG_MAILMAN_LINK. For example, use the command:

        ln -s /mnt/mailman-archives/htdig /etc/htdig-mailman

    2. Add the definition of HTDIG_MAILMAN_LINK to file 
       $prefix/Mailman/mm_cfg.py. For example:

        HTDIG_MAILMAN_LINK = 'htdig-mailman'

    3. Add the definition of HTDIG_RUNDIG_PATH to file 
       $prefix/Mailman/mm_cfg.py. This is path to rundig on the remote machine 
       running htdig. For example:

        HTDIG_RUNDIG_PATH = '/usr/local/bin/rundig'

    4. Add the definition of HTDIG_SEARCH_URL to file $prefix/Mailman/mm_cfg.py. 
       This must be a full URL referring to the htsearch CGI program on the
       remote htdig machine, as seen from the Mailman local machine. For
       example:
    
        HTDIG_SEARCH_URL = 'http://htdiggy.your.com/cgi-bin/htsearch'

    5. Add the definition of HTDIG_FILES_URL to file $prefix/Mailman/mm_cfg.py. 
       This must be a full URL referring to the directory containing htdig files
       on the remote htdig machine as seen from the Mailman local machine. This
       URL must end with a '/'. For example:

        HTDIG_FILES_URL = 'http://htdiggy.your.com/htdig/'
    
    6. Add the definition of REMOTE_PRIVATE_ARCHIVE_FILE_DIR to 
       $prefix/Mailman/mm_cfg.py. This must be the absolute file system path to
       the directory in which Mailman stores private archives as seen by the
       machine running htdig. For example:

        REMOTE_PRIVATE_ARCHIVE_FILE_DIR = '/mnt/mailman-archives/private'

    7. Add the definition of USE_HTDIG with the value 1 to 
       $prefix/Mailman/mm_cfg.py.

        USE_HTDIG = 1

    8. Add the definition of REMOTE_HTDIG with the value 1 to 
       $prefix/Mailman/mm_cfg.py.

        REMOTE_HTDIG = 1

You have to choose one of the three remote_nightly_htdig scripts found in 
$prefix/cron - remote_nightly_htdig, remote_nightly_htdig_noshare and 
remote_nightly_htdig.pl - and transfer it to the htdig machine. See above under 
heading "What is Installed by the Patch/What the patch adds" for an explanation 
of the differences between these scripts, which all do the same basic job. You 
should add the script to the crontab for the mailman UID on the htdig machine. 
But first you need to edit the selected script to add some configuration 
information. What has to be added depends on which script you opt to use. In 
each case the variables concerned are declared near the top of the script and 
you just have to enter the appropriate values:

    remote_nightly_htdig
        you only need to set the value of the python variable MAILMAN_PATH to be 
        the directory $prefix as seen from the htdig machine. The whole Mailman 
        installation must be accessible via NFS in order to use this script.

    remote_nightly_htdig_noshare
        you need to copy the values for the following configuration 
        variables from either $prefix/Mailman/mm_cfg.py or 
        $prefix/Mailman/Defaults.py to the script: DEFAULT_URL, 
        REMOTE_PRIVATE_ARCHIVE_FILE_DIR, HTDIG_RUNDIG_PATH. The variables 
        declared in remote_nightly_htdig_noshare use the same names. This script 
        only requires that the archives directory of the Mailman installation be 
        accessible via NFS.

        Note: DEFAULT_URL is not a Mailman-htdig integration specific
        configuration variable. In most installations DEFAULT_URL is setup
        automatically by the 'make install' in $prefix/Mailman/Defaults.py and
        not usually overridden in $prefix/Mailman/mm_cfg.py. You should find it
        defined near the top of Defaults.py.

    remote_nightly_htdig.pl
        you need to copy the values for the following configuration 
        variables from either $prefix/Mailman/mm_cfg.py or 
        $prefix/Mailman/Defaults.py to the script: DEFAULT_URL, 
        REMOTE_PRIVATE_ARCHIVE_FILE_DIR, HTDIG_RUNDIG_PATH. Being a Perl script, 
        the variables in remote_nightly_htdig.pl use the same names but prefixed 
        with the '$' character. This script only requires that the archives 
        directory of the Mailman installation be accessible via NFS.

        Note 1: DEFAULT_URL is not a Mailman-htdig integration specific
        configuration variable. In most installations DEFAULT_URL is setup
        automatically by the 'make install' in $prefix/Mailman/Defaults.py and
        not usually overridden in $prefix/Mailman/mm_cfg.py. You should find it
        defined near the top of  Defaults.py

        Note 2: You may need to change the '#! /usr/bin/env perl' on the first
        line of this script if that doesn't find your Perl executable. You may
        also need to verify the Perl packages used by this script are installed
        on your system.

As with the nightly_htdig script when running with local htdig, these scripts 
can be run from the command line using the mailman UID in order to  get htdig to 
construct an initial set of indices.

Upgrading an Existing Standard Mailman Installation
---------------------------------------------------

You will want to suspend operation of Mailman while doing the upgrade. Consider 
doing a shutdown of the MTA delivering mail to Mailman and removing Mailman's 
crontab.

Configure and install as described above.

Restart Mailman's crontab and restart your MTA's delivery to Mailman.

If your installation already has archives:

    1. Send a message to each of your archive-enabled lists. This will stimulate 
       the setup of the new per list htdig config files in the Mailman archives.

    2. Consider rebuilding your existing archives with $prefix/bin/arch. This 
       will embed the ARCHIVE_INDEXING_ENABLE and ARCHIVE_INDEXING_DISABLE in
       the regenerated archive pages and, after nightly_htdig has been run, give 
       improved search results.

    3. Run the nightly_htdig script from the command line to generate a new set 
       of per list htdig search indices.

Changing from local to remote htdig or vice versa
-------------------------------------------------

You will want to suspend operation of Mailman while making this change. Consider 
doing a shutdown of the MTA delivering mail to Mailman and removing Mailman's 
crontab.

Run the $prefix/bin/blow_away_htdig script to remove all existing per list htdig 
config files and htdig indices/db files.

Configure per the instructions above for the local or remote target.

Restart Mailman's crontab and restart your MTA's delivery to Mailman.

Send a message to each of your archive-enabled lists. This will stimulate the 
set up of the new per list htdig config files in Mailman archives.

Run the nightly_htdig script from the command line to generate a new set of per 
list htdig search indices.

Coping with htdig Upgrades
--------------------------

If you change the version of htdig you run, you may find that the indices built 
with the ealier version are not compatible with the newer version of htdig's 
programs. In that case do the following:

    1. You will want to suspend operation of Mailman while making this change. 
       Consider doing a shutdown of the MTA delivering mail to Mailman and
       removing Mailman's crontab.

    2. Run the $prefix/bin/blow_away_htdig script with the -i flag to remove all 
       existing per list htdig indices/db files.

    3. Restart Mailman's crontab and restart your MTA's delivery to Mailman.

    4. Run the nightly_htdig script from the command line to generate new sets
       of per list htdig search indices.

Changing the Addressing Scheme of your web_page_url
---------------------------------------------------

If you change the addressing scheme of the web_page_url for a list to or from
http then you will need to rebuild the list's htdig configuration file(s) and
the related htdig indices. Do the following:

    1. You may want to suspend operation of Mailman while making this change. 
       Consider doing a shutdown of the MTA delivering mail to Mailman and
       removing Mailman's crontab.

    2. Run the $prefix/bin/blow_away_htdig script to remove all existing per
       list htdig material for the list(s) concerned.

    3. Restart Mailman's crontab and restart your MTA's delivery to Mailman.

    4. Send a message to each affected list to provoke reconstruction of the
       list's htdig config file(s).

    5. Run the nightly_htdig script from the command line to generate new sets
       of per list htdig search indices.


Operational Information
=======================

If you have just turned USE_HTDIG on or just used $prefix/bin/blow_away_htdig 
(without the -i flag) there will initially be no per list htdig information 
saved in the archives.

When the first post to each archive-enabled list is archived by pipermail, the 
per list htdig config file will be constructed and some directories and links 
added to your Mailman archive directories. The htdig search form will be added 
to list's TOC page. 

However, until one of the nightly_htdig scripts is run no htdig indices will be 
constructed. You can either wait for the script to run as a cron job or run it 
(while using the mailman UID) from the command line.

Notes and Warnings
==================

Redhat 7.1 and 7.2 installations:

    If you install htdig from the htdig-3.2.0 binary rpm of RH7.1/2 Binary CD 1 
    of 2 you also have to install the htdig-web-3.2.0 binary rpm. This may be 
    from RH 7.1/2 Binary CD 2 of 2 or CD 1 of 2 depending on whether you are 
    using actual CDs or downloaded CD images.

Apache/htdig issues

    The htsearch CGI script part of htdig and some associated HTML and graphics
    file must be accessible via you web server and the Mailman configuration
    variables HTDIG_SEARCH_URL and HTDIG_FILES_URL setup accordingly. Depending
    on how you install htdig and Apache you may need to add Alias and/or
    ScriptAlias directives to you Apache configuration file to make the htdig
    components accessible. Check the Apache and htdig documentation.

Contributors
============

Original author and  maintainer: Richard Barrett - r.barrett at ftel.co.uk

Past bug fixes: Nigel Metheringham <Nigel.Metheringham at VData.co.uk>

Testers: Mark T. Valites <valites at geneseo.edu>,
         Rehan van der Merwe <Rehan at nha.co.za>

Appendices
==========

Appendix 1 -Technique for htdigging when Mailman's DEFAULT_URL uses the https 
----------------------------------------------------------------------------

A technique for htdigging when Mailman's DEFAULT_URL uses the https addressing scheme
is described in this archived e-mail: http://www.htdig.org/mail/1999/10/0187.html

The text of that e-mail is as follows:

[htdig] Re: Help about htdig indexing https files

--------------------------------------------------------------------------------

Gilles Detillieux (grdetil at scrc.umanitoba.ca)
Wed, 27 Oct 1999 10:18:31 -0500 (CDT) 


Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] 
Next message: Avi Rappoport: "[htdig] indexing SSL (was: Help building the database)" 
Previous message: Gilles Detillieux: "Re: Fw: [htdig] mutiple search results" 
In reply to: Torsten Neuer: "Re: Fw: [htdig] mutiple search results" 

--------------------------------------------------------------------------------

According to Edouard DESSIOUX: 
> >Currently, htdig will not support URLs that begin with https://, even when 
> >using local_urls to bypass the server. A trick that might work would be 
> >to index using http:// instead, but use local_urls to point to the directory 
> >that contains the contents of the secure server. 
> 
> I used that, and now, when i use htsearch, it work, except the fact 
> that all my URL are http://x.y.z/ instead of https://x.y.z/ 
> 
> >You'd need to use separate 
> >configuration files for digging and searching, and use url_part_aliases in 
> >each of these configuration files to rewrite the http:// into https:// in the 
> >search results. 
> 
> This is the part i dont understand, and i would like you to explain. 


It basically works as a search and replace. One url_part_aliases in the 
configuration file used by htdig maps the http://x.y.z/ into some special 
code like "*site", and another url_part_aliases in the configuration file 
used by htsearch maps the "*site" back into the value you want, i.e. 
https://x.y.z/. The substitution is left to right in htdig, and right to 
left in htsearch. So, if you use the same config file for both, or the 
same setting for both, you get back what you started with (but saved some 
space in the database because of the encoding). However, if you use two 
separate config files with different url_part_aliases setting for htdig 
and htsearch, you can remap parts of URLs from one substring to another. 


I hope this makes things clearer. I thought the current description 
at http://www.htdig.org/attrs.html#url_part_aliases was already quite clear. 



-- 
Gilles R. Detillieux              E-mail: <grdetil at scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------


More information about the Mailman-Users mailing list