[Mailman-Developers] Requirements for a new archiver
Brad Knowles
brad.knowles at skynet.be
Wed Oct 29 22:37:32 EST 2003
At 7:00 PM -0800 2003/10/29, Chuq Von Rospach wrote:
> you could, but is it worth doing it yourself when MySQL is building
> it for you?
>
> http://www.mysql.com/doc/en/Fulltext_Search.html
From the top of this page:
6.8 MySQL Full-text Search
As of Version 3.23.23, MySQL has support for full-text indexing and
searching. Full-text indexes in MySQL are an index of type FULLTEXT.
FULLTEXT indexes are used with MyISAM tables only and can be created
from CHAR, VARCHAR, or TEXT columns at CREATE TABLE time or added
later with ALTER TABLE or CREATE INDEX. For large datasets, it will
be much faster to load your data into a table that has no FULLTEXT
index, then create the index with ALTER TABLE (or CREATE INDEX).
Loading data into a table that already has a FULLTEXT index could be
significantly slower.
Moreover, mail messages will be a undetermined variable length.
Can MySQL support a 32-bit VARCHAR? What about type TEXT? Or 8-bit
or even 16-bit character sets? Since you might be storing a lot of
MIME bodypart types, can it handle BLOBs, and can it handle them
well? Or, do you do parsing within your archive application and
store the entire message somewhere outside of the database, while
storing a FULLTEXT index of only the bodypart types you declare to be
human-readable?
What if you want to do a case-sensitive search? In that case, it
doesn't look like FULLTEXT or MATCH will do you any good, since MATCH
is declared to be case-insensitive. Or what if you want to search
for hyphenated literals? It seems that MATCH considers them to be
word breaks even within literal searches.
> If you were just storing into a TEXT and then doing SELECT LIKE into it,
> I'd agree with you. But MySQL is doing interesting things here. Why not
> leverage it?
I'm not sure it really helps in this case. I'm not sure it can
handle the amounts of data that might need to be stored into a field,
or the different character sets that might need to be used. I'm also
concerned about what using this function might do to the overall
speed and size of the database.
On the page quoted above, look for benchmark data reported by Jim
Nguyen and John Takacs. Two million rows with text and multiple
word searches (three or more) taking 30-seconds to a minute to
complete, is not good performance. Three to five million rows, with
searches taking 50 seconds or more for single words, is not good
performance.
Now, consider how many words might be in a single message
(hundreds to thousands or even tens of thousands), and how many
messages might be in a single archive (thousands to millions). If
each message was contained within a row, this would be dead-Universe
slow.
--
Brad Knowles, <brad.knowles at skynet.be>
"They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety."
-Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++)>: a C++(+++)$ UMBSHI++++$ P+>++ L+ !E-(---) W+++(--) N+
!w--- O- M++ V PS++(+++) PE- Y+(++) PGP>+++ t+(+++) 5++(+++) X++(+++) R+(+++)
tv+(+++) b+(++++) DI+(++++) D+(++) G+(++++) e++>++++ h--- r---(+++)* z(+++)
More information about the Mailman-Developers
mailing list