Newbie question about text encoding

Fri Mar 6 11:26:02 EST 2015

random832 at fastmail.us wrote:

> My point is there are very few
> problems to which "count of Unicode code points" is the only right
> answer - that UTF-32 is good enough for but that are meaningfully
> impacted by a naive usage of UTF-16, to the point where UTF-16 is
> something you have to be "safe" from.

I'm not sure why you care about the "count of Unicode code points", although
that *is* a problem. Not for end-user reasons like "how long is my
password?", but because it makes your job as a programmer harder.

[steve at ando ~]$ python2.7 -c "print (len(u'\U00004444:\U00014445'))"
4
[steve at ando ~]$ python3.3 -c "print (len(u'\U00004444:\U00014445'))"
3

It's hard to reason about your code when something as fundamental as the
length of a string is implementation-dependent. (By the way, the right
answer should be 3, not 4.)

But an even more important problem is that broken-UTF-16 lets you create
invalid, impossible Unicode strings *by accident*. Naturally you can create
broken Unicode if you assemble strings of surrogates yourself, but
broken-UTF-16 means it can happen from otherwise innocuous operations like
reversing a string:

py> s = u'\U00004444:\U00014445'  # Python 2.7 narrow build
py> s[::-1]
u'\udc45\ud811:\u4444'

It's hard for me to demonstrate that the reversed string is broken because
the shell I am using does an amazingly good job of handling broken Unicode.
Even if I print it, the shell just prints missing-character glyphs instead
of crashing (fortunately for me!). But the first two code points are in
illegal order:

\udc45 is a high surrogate, and must follow a low surrogate;
\ud811 is a low surrogate, and must precede a high surrogate;

I'm not convinced you should be allowed to create Unicode strings containing
mismatched surrogates like this deliberately, but you certainly shouldn't
be able to do so by accident.

-- 
Steven