[Python-ideas] Support Unicode code point notation

Andrew Barnert abarnert at yahoo.com
Sun Jul 28 02:30:25 CEST 2013


On Jul 28, 2013, at 1:18, Chris Angelico <rosuav at gmail.com> wrote:

> On Sun, Jul 28, 2013 at 12:14 AM, Greg Ewing
> <greg.ewing at canterbury.ac.nz> wrote:
>> Steven D'Aprano wrote:
>>> 
>>> Aside: you keep writing H..HHHHHH for Unicode code points. Unicode code
>>> points go up to hex 10FFFF,
>> 
>> They do *now*, but we can't be sure that they will stay that
>> way in the future.
> 
> They will for as long as UTF-16 is supported. Really, it would have
> been better all round if UTF-16 had never existed, and everyone just
> had to switch up to UTF-32; sure, memory would have been wasted, but
> concepts like PEP 393 would have been devised to deal with that, and
> we wouldn't have stupid bugs in 99% of programming languages.

UTF-16 wouldn't have been a problem if it weren't almost compatible with UCS2, allowing all kinds of Unicode 1.0 software to misleadingly claim Unicode 2.0 support. (For example, for a long time, both Windows and Java "supported" UTF-16 by treating surrogate pairs as two characters instead of one, which is like "supporting" UTF-8 by treating it like ASCII--except that the bugs are much less likely to hit developers early in the cycle.) There are use cases for which UTF-16 is perfectly reasonable. For example, strings with lots of BMP CJK characters and an occasional non-BMP character aren't helped by PEP 393, or by UTF-8, but they are helped by UTF-16. (So long as you can rely on software not treating it as UCS2…) But anyway, this is pretty far off topic.

Unicode could go past 10FFFF without dropping UTF-16, either by adding more surrogate pair ranges, or by adding surrogate triplets. It's really no different from extending UTF-8, which is no problem.

The problem is that we have no way to predict how they will extend UTF-16, UTF-8, or code point notation if that ever happens. Assuming that the max length for a code point is six nibbles does sound like assuming nobody will ever need more than 640k characters.



More information about the Python-ideas mailing list