Ah Python, you have spoiled me for all other languages

Steven D'Aprano steve at pearwood.info
Sat May 23 10:09:42 EDT 2015


On Sat, 23 May 2015 11:35 pm, Ned Batchelder wrote:

> On Saturday, May 23, 2015 at 9:01:29 AM UTC-4, Steven D'Aprano wrote:
>> On Sat, 23 May 2015 10:33 pm, Thomas 'PointedEars' Lahn wrote:
>> 
>> > If only characters were represented as sequences UTF-16 code units in
>> > ECMAScript implementations like JavaScript, there would not be a
>> > problem beyond the BMP;
>> 
>> Are you being sarcastic?
> 
> IIUC, Thomas' point is that *characters* should be sequences of
> codepoints, not that *strings* should be.

Like Python, Javascript/ECMAScript doesn't have a distinct character type,
it has strings which happen to be of length one. So I'm not sure I
understand the point you are trying to make.

There's also a bit of a problem in deciding what counts as a character. Is
IJ a single character, or two? The answer depends on whether you are Dutch
or not. Unicode punts on that decision, and leaves it up to the
application.

Unicode only concerns itself with code points, which are complex enough, and
generally programming languages follow Unicode (usually imperfectly). Each
code point (a.k.a. "character" if we're being sloppy) requires either one
or two 16-bit code units in UTF-16. I'm not sure that "1 or 2" counts as a
sequence.


-- 
Steven




More information about the Python-list mailing list