Oh look, another language (ceylon)
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Mon Nov 18 08:31:33 EST 2013
On Mon, 18 Nov 2013 21:04:41 +1100, Chris Angelico wrote:
> On Mon, Nov 18, 2013 at 8:44 PM, <wxjmfauth at gmail.com> wrote:
>> string
>> Satisfied Interfaces: Category, Cloneable<List<Element>>,
>> Collection<Element>, Comparable<String>,
>> Correspondence<Integer,Element>, Iterable<Element,Null>,
>> List<Character>, Ranged<Integer,String>, Summable<String> A string of
>> characters. Each character in the string is a 32-bit Unicode character.
>> The internal UTF-16 encoding is hidden from clients. A string is a
>> Category of its Characters, and of its substrings:
>
> I'm trying to figure this out. Reading the docs hasn't answered this. If
> each character in a string is a 32-bit Unicode character, and (as can be
> seen in the examples) string indexing and slicing are supported, then
> does string indexing mean counting from the beginning to see if there
> were any surrogate pairs?
I can't figure out what that means, since it contradicts itself. First it
says *every* character is 32-bits (presumably UTF-32), then it says that
internally it uses UTF-16. At least one of these statements is wrong.
(They could both be wrong, but they can't both be right.)
Unless they have done something *really* clever, the language designers
lose a hundred million points for screwing up text strings. There is
*absolutely no excuse* for a new, modern language with no backwards
compatibility concerns to choose one of the three bad choices:
* choose UTF-16 or UTF-8, and have O(n) primitive string operations (like
Haskell and, apparently, Ceylon);
* or UTF-16 without support for the supplementary planes (which makes it
virtually UCS-2), like Javascript;
* choose UTF-32, and use two or four times as much memory as needed.
--
Steven
More information about the Python-list
mailing list