Oh look, another language (ceylon)

Mon Nov 18 09:30:54 EST 2013

On Mon, 18 Nov 2013 13:31:33 +0000, Steven D'Aprano wrote:

> On Mon, 18 Nov 2013 21:04:41 +1100, Chris Angelico wrote:
> 
>> On Mon, Nov 18, 2013 at 8:44 PM,  <wxjmfauth at gmail.com> wrote:
>>> string
>>> Satisfied Interfaces: Category, Cloneable<List<Element>>,
>>> Collection<Element>, Comparable<String>,
>>> Correspondence<Integer,Element>, Iterable<Element,Null>,
>>> List<Character>, Ranged<Integer,String>, Summable<String> A string of
>>> characters. Each character in the string is a 32-bit Unicode
>>> character. The internal UTF-16 encoding is hidden from clients. A
>>> string is a Category of its Characters, and of its substrings:
>> 
>> I'm trying to figure this out. Reading the docs hasn't answered this.
>> If each character in a string is a 32-bit Unicode character, and (as
>> can be seen in the examples) string indexing and slicing are supported,
>> then does string indexing mean counting from the beginning to see if
>> there were any surrogate pairs?
> 
> I can't figure out what that means, since it contradicts itself. First
> it says *every* character is 32-bits (presumably UTF-32), then it says
> that internally it uses UTF-16. At least one of these statements is
> wrong. (They could both be wrong, but they can't both be right.)

Mystery solved: characters are only 32-bits in isolation, when plucked 
out of a string.

http://ceylon-lang.org/documentation/tour/language-module/
#characters_and_character_strings

Ceylon strings are arrays of UTF-16 characters. However, the language 
supports characters in the Supplementary Multilingual Plane by having 
primitive string operations walk the string a code point at a time. When 
you extract a character out of the string, Ceylon gives you four bytes. 
Presumably, if you do something like like this:

# Python syntax, not Ceylon
mystring = "a\U0010FFFF"
c = mystring[0]
d = mystring[1]

c will consist of bytes 0000 0061 and d will consist of the surrogate 
pair DBFF DFFF (the UTF-16BE encoding of code point U+10FFFF, modulo big-
endian versus little-ending). Or possibly the UTF-32 encoding, 0010 FFFF.

I suppose that's not terrible, except for the O(n) string operations 
which is just dumb. Yes, it's better than buggy, broken strings. But 
still dumb, because those aren't the only choices. For example, for the 
sake of an extra two bytes at the start of each string, they could store 
a flag and a length:

- one bit to flag whether the string contained any surrogate pairs or 
not; if not, string ops could assume two-bytes per char and be O(1), if 
the flag was set it could fall back to the slower technique;

- 15 bits for a length.

15 bits give you a maximum length of 32767. There are ways around that. 
E.g. a length of 0 through 32766 means exactly what it says; a length of 
32767 means that the next two bytes are part of the length too, giving 
you a maximum of 4294967295 characters per string. That's an 8GB string. 
Surely big enough for anyone :-)

That gives you O(1) length for *any* string, and O(1) indexing operations 
for those that are entirely in the BMP, which will be most strings for 
most people. It's not 1970 anymore, it's time for strings to be treated 
more seriously and not just as dumb arrays of char. Even back in the 
1970s Pascal had a length byte. It astonishes me that hardly any low-
level language follows their lead.

-- 
Steven