Oh look, another language (ceylon)

Steven D'Aprano steve+comp.lang.python at pearwood.info
Mon Nov 18 08:31:33 EST 2013


On Mon, 18 Nov 2013 21:04:41 +1100, Chris Angelico wrote:

> On Mon, Nov 18, 2013 at 8:44 PM,  <wxjmfauth at gmail.com> wrote:
>> string
>> Satisfied Interfaces: Category, Cloneable<List<Element>>,
>> Collection<Element>, Comparable<String>,
>> Correspondence<Integer,Element>, Iterable<Element,Null>,
>> List<Character>, Ranged<Integer,String>, Summable<String> A string of
>> characters. Each character in the string is a 32-bit Unicode character.
>> The internal UTF-16 encoding is hidden from clients. A string is a
>> Category of its Characters, and of its substrings:
> 
> I'm trying to figure this out. Reading the docs hasn't answered this. If
> each character in a string is a 32-bit Unicode character, and (as can be
> seen in the examples) string indexing and slicing are supported, then
> does string indexing mean counting from the beginning to see if there
> were any surrogate pairs?

I can't figure out what that means, since it contradicts itself. First it 
says *every* character is 32-bits (presumably UTF-32), then it says that 
internally it uses UTF-16. At least one of these statements is wrong. 
(They could both be wrong, but they can't both be right.)

Unless they have done something *really* clever, the language designers 
lose a hundred million points for screwing up text strings. There is 
*absolutely no excuse* for a new, modern language with no backwards 
compatibility concerns to choose one of the three bad choices:

* choose UTF-16 or UTF-8, and have O(n) primitive string operations (like 
Haskell and, apparently, Ceylon);

* or UTF-16 without support for the supplementary planes (which makes it 
virtually UCS-2), like Javascript;

* choose UTF-32, and use two or four times as much memory as needed.


-- 
Steven



More information about the Python-list mailing list