[I18n-sig] Unicode strings: an alternative

Fredrik Lundh Fredrik Lundh" <effbot@telia.com
Sun, 7 May 2000 17:49:02 +0200


Tom Emerson wrote:
> Just van Rossum writes:
>  > Good point. All this taken together still means to me that =
comparisons
>  > between wide and narrow strings should take place at the character =
level,
>  > which implies that coercion from narrow to wide is done at the =
character
>  > level, without looking at the encoding. (Which in my book in turn =
still
>  > implies that as long as we're talking about Unicode, narrow strings =
are
>  > effectively Latin-1.)
>=20
> Only true if "wide" strings are encoded in UCS-2 or UCS-4. If "wide
> characters" are Unicode, but stored in UTF-8 encoding, then you loose.

why?

if you're comparing byte arrays using different encodings, sure.

if you're comparing characters, it'll work.

I find it amazing that you're still stuck at the "visible encoding" =
level,
despite everything that's been posted to these mailing lists over the
last weeks.  let's spell it out again: a "character" is NOT the same
thing as a C char.

> Hmmmm... how often do you expect to compare narrow vs. wide strings,
> using default comparison (i.e. =3D or !=3D)?

all the time -- much more often than I compare integers with long
integers or floating point numbers.

the idea of standardizing on strings of characters is to make narrow
and wide strings interchangable.  just like you can mix standard and
long integers in today's python, *despite* the fact that they're not
using the same internal representation.

> What if I'm using Latin 3 and use the byte comparison?

if you have a byte array containing latin 3 encoded data, that's a
byte array, not a string...

> I may very well have two strings (one narrow, one wide) that
> compare equal, even though they're not.

if you decode both byte arrays to real strings and compare them,
they will only compare equal if they are in fact equal...

> Not exactly what I would expect.

I think you're still not getting what we're talking about here.  I =
suggest
reading the W3C paper (http://www.w3.org/TR/charmod) once again:

    "It should be clear, however, that characters and bytes
    are very different entities that SHOULD NOT be confused:
    in general, the relationship is many-to-many."

please follow their advice, and stop confusing characters and
bytes.

</F>