PEP 393 vs UTF-8 Everywhere

Steve D'Aprano steve+python at pearwood.info
Sat Jan 21 21:42:57 EST 2017


On Sun, 22 Jan 2017 07:21 am, Pete Forman wrote:

> Marko Rauhamaa <marko at pacujo.net> writes:
> 
>>> py> low = '\uDC37'
>>
>> That should raise a SyntaxError exception.
> 
> Quite. My point was that with older Python on a narrow build (Windows
> and Mac) you need to understand that you are using UTF-16 rather than
> Unicode.

But you're *not* using UTF-16, at least not proper UTF-16, in older narrow
builds. If you were, then Unicode strings u'...' containing surrogate pairs
would be treated as supplementary single code points, but they aren't.

unichr() doesn't support supplementary code points in narrow builds:

[steve at ando ~]$ python2.7 -c "print len(unichr(0x10900))"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)


and even if you sneak a supplementary code point in, it is treated wrongly:

[steve at ando ~]$ python2.7 -c "print len(u'\U00010900')"
2


So Python narrow builds are more like a bastard hybrid of UCS-2 and UTF-16.




-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.




More information about the Python-list mailing list