Clustering text-documents in bundles

Wed Sep 26 02:51:42 EDT 2007

On Sep 25, 7:52 pm, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:
> "exhuma.twn" <exh... at gmail.com> writes:
> > Is it possible to calculate a distance between two chunks of text? I
> > suppose one could simply do a simple word-count on the chunks
> > (removing common noise words of course). And then go from there. Maybe
> > even assigning different weighting to words. But maybe there is a well-
> > tested and useful algorithm already available?
>
> There's a huge field of text mining that attempts to do things like
> this.  http://en.wikipedia.org/wiki/Latent_semantic_analysisfor some
> info about one approach.  Manning & Schutz's book "Foundations of Statistical
> Natural Language Processing" (http://nlp.stanford.edu/fsnlp/) is
> a standard reference about text processing.  They also have a
> new one about information retrieval (downloadable as a pdf) that
> looks very good: <http://informationretrieval.org>.

Thanks a lot. This gives me some bed-time reading.