Ah Python, you have spoiled me for all other languages

Sun Jun 7 07:42:05 EDT 2015

On Sun, 7 Jun 2015 06:21 pm, Thomas 'PointedEars' Lahn wrote:

> Ned Batchelder wrote:
> 
>> On Saturday, May 23, 2015 at 9:01:29 AM UTC-4, Steven D'Aprano wrote:
>>> On Sat, 23 May 2015 10:33 pm, Thomas 'PointedEars' Lahn wrote:
>>> > If only characters were represented as sequences UTF-16 code units in
>>> > ECMAScript implementations like JavaScript, there would not be a
>>> > problem beyond the BMP;
>>> 
>>> Are you being sarcastic?
>> 
>> IIUC, Thomas' point is that *characters* should be sequences of
>> codepoints, not that *strings* should be.
> 
> No, my point is that one character should be a sequence of code _units_
> (for a code point value).  

I don't understand this sentence. "Code point value" doesn't appear to be
meaningful. "Code point" is a value in the Unicode codespace, informally "a
character" (but see below); code points can take on values in the range 0
to 1114111, usually written in hex as U+0000 to U+10FFFF.

"Code value" is an obsolete term for code unit, that is, the smallest chunk
of memory used to represent a code point. For example, UTF-8 uses 8-bit
code units, UTF-32 uses 32 bit code units.

But "code point value", I'm not sure what you mean by that. Consequently I
have no idea what you think a character should be. Is "Hello World" a
character? How about "Æ" or "û"?

The term "character" is problematic, because what counts as a character
depends on where you are and how the string is normalised. For example:

"ij" could be two characters, the letters i followed by j, or one, the 25th
letter of the Dutch language [and not even the Dutch agree on this];
conversely, "ĳ" could be a single character, or a ligature of two
characters.

"Ḗ" (U+1E16 LATIN CAPITAL LETTER E WITH MACRON AND ACUTE) could be
considered one character, or three 'E\u0304\u0301', depending on whether it
is normalised or not.

So I'm afraid I do not understand your sentence.

Code point: http://www.unicode.org/glossary/#code_point

Code unit: http://www.unicode.org/glossary/#code_unit

Code value: http://www.unicode.org/glossary/#code_value

See also http://unicode.org/faq/char_combmark.html

> But in ECMAScript implementations (so far), a *code 
> point value* equals a character, and that is a problem in ECMAScript
> because
> there the value range is limited to what can be encoded in 16 bit.  The
> problem starts beyond the BMP where 16 bit are no longer sufficient for a
> code sequence and code point value, and code sequence and code point value
> are no longer equal.

This is no clearer.

I *think* what you are trying to say is that ECMAScript assumes that one
code point is always represented by a single code unit. So a sequence of
code points ABCD will be correctly interpreted as four "characters" so long
as each of those code points are in the BMP (i.e. between U+0000 and U+FFFF
inclusive), but *not* if they are from one of the supplementary planes.

This is the same problem that older Python "narrow builds" suffered from.
The solutions in Python was to use a wide-build (each code point is
represented by a single UTF-32 code unit, that is, four bytes) or to
upgrade to Python 3.3, which uses a compressed coding scheme where strings
are represented by either 1-byte per code point, 2-bytes per code point, or
4-bytes per code point, whichever is the minimum needed for that particular
string.

My opinion is that a programming language like Python or ECMAScript should
operate on *code points*. If we want to call them "characters" informally,
that should be allowed, but whenever there is ambiguity we should remember
we're dealing with code points. The implementation shouldn't matter:
compliant Python interpreters might choose to use UTF-8 internally, or
UTF-16, or UTF-32, or something else, and still agree on how many
characters a string contains. Normalisation is still an issue, of course,
but any decent Unicode implementation will include a way to normalise or
denormalise strings.

The question of graphemes (what "ordinary people" consider letters and
characters, e.g. "ch" is two letters to an English speaker but one letter
to a Czech speaker) should be left to libraries. It's a much harder problem
to solve in the full general case, requires localisation, and is overkill
for many string-processing tasks.

-- 
Steven