[Mailman-Developers] Indexing pipermail archives

Thu, 22 Jun 2000 13:51:20 +0100

I've just been looking at the problem of indexing/searching a large 
Mailman/pipermail archive.

htdig will do the job, but the (htdig) indexes get bloated by the 
pipermail index pages (which are *very* rarely what you want when 
searching for something).

Indexing can be controlled by meta tags.  [See http://info.webcrawler.co
m/mak/projects/robots/meta-user.html ].  The pipermail HTML generation 
is hard coded into the archiver code.

As a short term fix, would people be happy with me adding the following 
meta tags to the pipermail HTML generation:-

  On top level (ie list of weeks/months etc) and by-date index pages:-
     <meta name="robots" content="noindex,follow">
	[ie do not index the page, follow links down to the articles]

  On thread/subject/author indexes
     <meta name="robots" content="noindex,nofollow">
        [skip page and linked pages - nofollow is superluous since the
	indexing robot should realise that the pages are already included
	but it doesn't hurt much]

  On article pages
     <meta name="robots" content="index,nofollow">
	[you may disagree with the nofollow, but I think there is no general
	requirement for the indexer to follow links off the list]

Its a hack for now, but will make htdig and other indexing robots 
behave better.

Comments?

	Nigel.
-- 
[ - Opinions expressed are personal and may not be shared by VData - ]
[ Nigel Metheringham                  Nigel.Metheringham@VData.co.uk ]
[ Phone: +44 1423 850000                         Fax +44 1423 858866 ]