[Mailman-Developers] Requirements for a new archiver

Wed Oct 29 22:37:32 EST 2003

At 7:00 PM -0800 2003/10/29, Chuq Von Rospach wrote:

>  you could, but is it worth doing it yourself when MySQL is building
>  it for you?
>
>  http://www.mysql.com/doc/en/Fulltext_Search.html

	From the top of this page:

6.8  MySQL Full-text Search

As of Version 3.23.23, MySQL has support for full-text indexing and 
searching.  Full-text indexes in MySQL are an index of type FULLTEXT. 
FULLTEXT indexes are used with MyISAM tables only and can be created 
from CHAR, VARCHAR, or TEXT columns at CREATE TABLE time or added 
later with ALTER TABLE or CREATE INDEX.  For large datasets, it will 
be much faster to load your data into a table that has no FULLTEXT 
index, then create the index with ALTER TABLE (or CREATE INDEX). 
Loading data into a table that already has a FULLTEXT index could be 
significantly slower.

	Moreover, mail messages will be a undetermined variable length. 
Can MySQL support a 32-bit VARCHAR?  What about type TEXT?  Or 8-bit 
or even 16-bit character sets?  Since you might be storing a lot of 
MIME bodypart types, can it handle BLOBs, and can it handle them 
well?  Or, do you do parsing within your archive application and 
store the entire message somewhere outside of the database, while 
storing a FULLTEXT index of only the bodypart types you declare to be 
human-readable?

	What if you want to do a case-sensitive search?  In that case, it 
doesn't look like FULLTEXT or MATCH will do you any good, since MATCH 
is declared to be case-insensitive.  Or what if you want to search 
for hyphenated literals?  It seems that MATCH considers them to be 
word breaks even within literal searches.

>  If you were just storing into a TEXT and then doing SELECT LIKE into it,
>  I'd agree with you. But MySQL is doing interesting things here. Why not
>  leverage it?

	I'm not sure it really helps in this case.  I'm not sure it can 
handle the amounts of data that might need to be stored into a field, 
or the different character sets that might need to be used.  I'm also 
concerned about what using this function might do to the overall 
speed and size of the database.

	On the page quoted above, look for benchmark data reported by Jim 
Nguyen  and John Takacs.  Two million rows with text and multiple 
word searches (three or more) taking 30-seconds to a minute to 
complete, is not good performance.  Three to five million rows, with 
searches taking 50 seconds or more for single words, is not good 
performance.

	Now, consider how many words might be in a single message 
(hundreds to thousands or even tens of thousands), and how many 
messages might be in a single archive (thousands to millions).  If 
each message was contained within a row, this would be dead-Universe 
slow.

-- 
Brad Knowles, <brad.knowles at skynet.be>

"They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety."
     -Benjamin Franklin, Historical Review of Pennsylvania.

GCS/IT d+(-) s:+(++)>: a C++(+++)$ UMBSHI++++$ P+>++ L+ !E-(---) W+++(--) N+
!w--- O- M++ V PS++(+++) PE- Y+(++) PGP>+++ t+(+++) 5++(+++) X++(+++) R+(+++)
tv+(+++) b+(++++) DI+(++++) D+(++) G+(++++) e++>++++ h--- r---(+++)* z(+++)