Clustering text-documents in bundles

Tue Sep 25 11:11:33 EDT 2007

Hi,

This *is* off-topic but with python being a language with a somewhat
scientific audience, I might get lucky ;)
I have a set of documents (helpdesk tickets in fact) and I would like
to automatically collect them in bundles so I can visualise some
statistics depending on content.

A while ago I wrote a very simple clustering library which can cluster
about everything where you can calculate some form of distance.
Meaning: You can supply a function that calculates numeric value given
two objects (helpdesk request text-body in this case). The closer the
two objects are related, the smaller the returned value with 0.0
meaning that the two objects are identical.

Is it possible to calculate a distance between two chunks of text? I
suppose one could simply do a simple word-count on the chunks
(removing common noise words of course). And then go from there. Maybe
even assigning different weighting to words. But maybe there is a well-
tested and useful algorithm already available?

Text processing is a very blurry area for me. I don't expect any
solutions for the problem right away. Maybe just some pointers as to
*what* I can google for. I'll pick the rest up from there.

Eventually I would like to have the possibility to say: "This set of
texts contains 20 requests dealing with emails, 30 requests dealing
with Office Applications and 210 requests dealing with databases". I
am aware that labelling the different text-bundles will have to be
done manually I suppose.  But I will aim for no more than 10 bundles
anyway. So that's OK.