Ah Python, you have spoiled me for all other languages

Steven D'Aprano steve at pearwood.info
Sat May 23 09:01:14 EDT 2015


On Sat, 23 May 2015 10:33 pm, Thomas 'PointedEars' Lahn wrote:

> If only characters were represented as sequences UTF-16 code units in
> ECMAScript implementations like JavaScript, there would not be a problem
> beyond the BMP;

Are you being sarcastic?

This is Rhino:

js> var c = String.fromCharCode(65535); // in the BMP
js> print(c.charCodeAt(0));
65535

So far so good.

js> var c = String.fromCharCode(65536); // astral character
js> print(c.charCodeAt(0));
0

Can you name any ECMAScript implementation which correctly handles code
points in the supplementary multilingual planes?


By the way, for many years Python implemented Unicode as UTF-16 code units,
the so-called "narrow build":

py> c = unichr(65536)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Let's try again:

py> c = u'\U00010000'  # a single code point
py> len(c)
2


I'm not saying that it is impossible to have a correct Unicode implemention
using UTF-16, but I've never seen one.



-- 
Steven




More information about the Python-list mailing list