in place list modification necessary? What's a better idiom?

Tue Apr 7 05:11:30 EDT 2009

Carl Banks wrote:

> On Apr 7, 12:38 am, Peter Otten <__pete... at web.de> wrote:
>> MooMaster wrote:
>> > Now we can't calculate a meaningful Euclidean distance for something
>> > like "Iris-setosa" and "Iris-versicolor" unless we use string-edit
>> > distance or something overly complicated, so instead we'll use a
>> > simple quantization scheme of enumerating the set of values within the
>> > column domain and replacing the strings with numbers (i.e. Iris-setosa
>> > = 1, iris-versicolor=2).
>>
>> I'd calculate the distance as
>>
>> def string_dist(x, y, weight=1):
>> return weight * (x == y)
>>
>> You don't get a high resolution in that dimension, but you don't
>> introduce an element of randomness, either.
> 
> Does the algorithm require well-ordered data along the dimensions?
> Though I've never heard of it, the fact that it's called "bisecting
> Kmeans" suggests to me that it does, which means this wouldn't work.

I've read about K-Means in Segaran's "Collective Intelligence" which
describes it:

"K-Means clustering begins with k randomly placed centroids (points in space
that represent the center of the cluster), and assigns every item to the
nearest one. After the assignment, the centroids are moved to the average
location of all the nodes assigned to them, and the assignments are redone.
This process repeats until the assignments stop changing."

The book doesn't go into the theory, and "any distance would do" was my
assumption which may be wrong.

Peter