Newbie question about text encoding
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Fri Mar 6 11:26:02 EST 2015
random832 at fastmail.us wrote:
> My point is there are very few
> problems to which "count of Unicode code points" is the only right
> answer - that UTF-32 is good enough for but that are meaningfully
> impacted by a naive usage of UTF-16, to the point where UTF-16 is
> something you have to be "safe" from.
I'm not sure why you care about the "count of Unicode code points", although
that *is* a problem. Not for end-user reasons like "how long is my
password?", but because it makes your job as a programmer harder.
[steve at ando ~]$ python2.7 -c "print (len(u'\U00004444:\U00014445'))"
4
[steve at ando ~]$ python3.3 -c "print (len(u'\U00004444:\U00014445'))"
3
It's hard to reason about your code when something as fundamental as the
length of a string is implementation-dependent. (By the way, the right
answer should be 3, not 4.)
But an even more important problem is that broken-UTF-16 lets you create
invalid, impossible Unicode strings *by accident*. Naturally you can create
broken Unicode if you assemble strings of surrogates yourself, but
broken-UTF-16 means it can happen from otherwise innocuous operations like
reversing a string:
py> s = u'\U00004444:\U00014445' # Python 2.7 narrow build
py> s[::-1]
u'\udc45\ud811:\u4444'
It's hard for me to demonstrate that the reversed string is broken because
the shell I am using does an amazingly good job of handling broken Unicode.
Even if I print it, the shell just prints missing-character glyphs instead
of crashing (fortunately for me!). But the first two code points are in
illegal order:
\udc45 is a high surrogate, and must follow a low surrogate;
\ud811 is a low surrogate, and must precede a high surrogate;
I'm not convinced you should be allowed to create Unicode strings containing
mismatched surrogates like this deliberately, but you certainly shouldn't
be able to do so by accident.
--
Steven
More information about the Python-list
mailing list