[Python-Dev] UCS2/UCS4 default

Sat Jul 5 07:35:18 CEST 2008

>> The premise is the OP's idea that Python should switch to all UCS4 to
>> create a more pure ('ideal') situation or the idea that len(s) should
>> count codepoints (correct term?) for all builds as a matter of purity
>> even though on it would be time-costly on 16-bit builds as a matter
>> of practicality.

> No Terry did definitely mean Unicode scalar values.

True. However, using the word "code point" to refer to "Unicode scalar
values" is also correct. He (rather, the OP) wanted to count code
points (i.e. not count code units).

> Practical len() returns the number of code units of the internal storage format.

No, it returns the number of code units.

> Pure len() allegedly would return the number of Unicode scalar values (obviously
> a surrogate pair would be considered a single Unicode scalar value).

Perhaps-not-so-obviously-but-still-intendended, a pure len counting
surrogate pairs as one would *also* count code points.

> Please keep in mind that encodings encode Unicode scalar values.

A "coded character set" is "a character set in which each character is
assigned a numeric code point". So clearly, a character encoding form
encodeds code points.

> Thus a utf-8
> code unit sequence (or UTF-32 code unit) that would give a code point in the
> surrogate sections is technically in error. 

Sure, but this has nothing to do with Terry's terminology use.

Regards,
Martin