[I18n-sig] Re: [Python-Dev] Unicode debate

M.-A. Lemburg mal@lemburg.com
Tue, 02 May 2000 17:24:24 +0200


Just van Rossum wrote:
> 
> At 10:36 AM +0200 02-05-2000, M.-A. Lemburg wrote:
> >Just a small note on the subject of a character being atomic
> >which seems to have been forgotten by the discussing parties:
> >
> >Unicode itself can be understood as multi-word character
> >encoding, just like UTF-8. The reason is that Unicode entities
> >can be combined to produce single display characters (e.g.
> >u"e"+u"\u0301" will print "é" in a Unicode aware renderer).
> 
> Erm, are you sure Unicode prescribes this behavior, for this
> example? I know similar behaviors are specified for certain
> languages/scripts, but I didn't know it did that for latin.

The details are on the www.unicode.org web-site burried
in some of the tech reports on normalization and
collation.
 
> >Slicing such a combined Unicode string will have the same
> >effect as slicing UTF-8 data.
> 
> Not true. As Fredrik noted: no exception will be raised.

Huh ? You will always get an exception when you convert
a broken UTF-8 sequence to Unicode. This is per design
of UTF-8 itself which uses the top bit to identify
multi-byte character encodings.

Or can you give an example (perhaps you've found a bug 
that needs fixing) ?

> [ Speaking of exceptions,
> 
> after I sent off my previous post I realized Guido's
> non-utf8-strings-interpreted-as-utf8-will-often-raise-an-exception
> argument can easily be turned around, backfiring at utf-8:
> 
>     Defaulting to utf-8 when going from Unicode to 8-bit and
>     back only gives the *illusion* things "just work", since it
>     will *silently* "work", even if utf-8 is *not* the desired
>     8-bit encoding -- as shown by Fredrik's excellent "fun with
>     Unicode, part 1" example. Defaulting to Latin-1 will
>     warn the user *much* earlier, since it'll barf when
>     converting a Unicode string that contains any character
>     code > 255. So there.
> ]
> 
> >It seems that most Latin-1 proponents seem to have single
> >display characters in mind. While the same is true for
> >many Unicode entities, there are quite a few cases of
> >combining characters in Unicode 3.0 and the Unicode
> >nomarization algorithm uses these as basis for its
> >work.
> 
> Still, two combining characters are still two input characters for
> the renderer! They may result in one *glyph*, but trust me,
> that's an entirly different can of worms.

No. Please see my other post on the subject...
 
> However, if you'd be talking about Unicode surrogates,
> you'd definitely have a point. How do Java/Perl/Tcl deal with
> surrogates?

Good question... anybody know the answers ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/