[I18n-sig] Re: [Python-Dev] Unicode debate

Neil Hodgson nhodgson@bigpond.net.au
Tue, 2 May 2000 21:40:44 +1000


>    u = aUnicodeStringFromSomewhere
>    s = an8bitStringFromSomewhere
>
>    DoSomething(s + u)

> in Guido's design, the first example may or may not result in
> an "UTF-8 decoding error: UTF-8 decoding error: unexpected
> code byte" exception.

   I would say it is less surprising for most people for this to follow the
silent-widening of each byte - the Fredrik-Paul position. With the current
scarcity of UTF-8 code, very few people will expect an automatic UTF-8 to
UTF-16 conversion. While complete prohibition of automatic conversion has
some appeal, it will just be more noise to many.

>    u = aUnicodeStringFromSomewhere
>    s = an8bitStringFromSomewhere
>
>    if len(u) + len(s) == len(u + s):
>        print "true"
>    else:
>        print "not true"

> the second example may result in a
> similar error, print "true", or print "not true", depending on the
> contents of the 8-bit string.

   I don't see this as important as its trying to take the Unicode strings
are equivalent to 8 bit strings too far. How much further before you have to
break? I always thought of len measuring the number of bytes rather than
characters when applied to strings. The same as strlen in C when you have a
DBCS string.

   I should correct some of the stuff Mark wrote about me. At Fujitsu we did
a lot more DBCS work than Unicode because that's what Japanese code uses.
Even with Java most storage is still DBCS. I was more involved with Unicode
architecture at Reuters 6 or so years ago.

   Neil