in place list modification necessary? What's a better idiom?

Thu Apr 9 01:09:49 EDT 2009

On Apr 7, 12:40 am, Peter Otten <__pete... at web.de> wrote:
> Peter Otten wrote:
> > MooMaster wrote:
>
> >> Now we can't calculate a meaningful Euclidean distance for something
> >> like "Iris-setosa" and "Iris-versicolor" unless we use string-edit
> >> distance or something overly complicated, so instead we'll use a
> >> simple quantization scheme of enumerating the set of values within the
> >> column domain and replacing the strings with numbers (i.e. Iris-setosa
> >> = 1, iris-versicolor=2).
>
> > I'd calculate the distance as
>
> > def string_dist(x, y, weight=1):
> >     return weight * (x == y)
>
> oops, this must of course be (x != y).
>
> > You don't get a high resolution in that dimension, but you don't introduce
> > an element of randomness, either.
>
> > Peter
>
>

The randomness doesn't matter too much, all K-means cares about is a
distance between two points in a coordinate space and as long as that
space is invariant it doesn't matter too much (i.e. we don't want
(1,1) becoming (3,1) on the next iteration, or the value for a
quantized column changing). With that in mind, I was hoping to be lazy
and just go with an enumeration approach...

Nevertheless, it does introduce a subtle ordering for nominal data, as
if Iris-Setosa =1, Iris-Versicolor=2, and Iris-Virginica=3 then on
that scale Iris-Versicolor is intuitively "closer" to virginica than
setosa is, when in fact such distances don't mean anything on a
nominal scale. I hadn't thought about a function like that, but it
makes a lot of sense. Thanks!