Unicode is not UTF-32 [was Re: Cult-like behaviour]

Mon Jul 16 20:58:09 EDT 2018

On Mon, 16 Jul 2018 22:40:13 +0300, Marko Rauhamaa wrote:

> Terry Reedy <tjreedy at udel.edu>:
> 
>> On 7/15/2018 5:28 PM, Marko Rauhamaa wrote:
>>> if your new system used Python3's UTF-32 strings as a foundation,
>>
>> Since 3.3, Python's strings are not (always) UFT-32 strings.
> 
> You are right. Python's strings are a superset of UTF-32. More
> accurately, Python's strings are UTF-32 plus surrogate characters.

The first thing you are doing wrong is conflating the semantics of the 
data type with one possible implementation of that data type. UTF-32 is 
implementation, not semantics: it specifies how to represent Unicode code 
points as bytes in memory, not what Unicode code points are.

Python 3 strings are sequences of abstract characters ("code points") 
with no mandatory implementation. In CPython, some string objects are 
encoded in Latin-1. Some are encoded in UTF-16. Some are encoded in 
UTF-32. Some implementations (MicroPython) use UTF-8.

Your second error is a more minor point: it isn't clear (at least not to 
me) that "Unicode plus surrogates" is a superset of Unicode. Surrogates 
are part of Unicode. The only extension here is that Python strings are 
not necessarily well-formed surrogate-free Unicode strings, but they're 
still Unicode strings.

>> Nor are they always UCS-2 (or partly UTF-16) strings. Nor are the
>> always Latin-1 or Ascii strings. Python's Flexible String
>> Representation uses the narrowest possible internal code for any
>> particular string. This is all transparent to the user except for
>> memory size.
> 
> How CPython chooses to represent its strings internally is not what I'm
> talking about.

Then why do you repeatedly talk about the internal storage representation?

UTF-32 is not a character set, it is an encoding. It specifies how to 
implement a sequence of Unicode abstract characters.

>>> UTF-32, after all, is a variable-width encoding.
>>
>> Nope.  It a fixed-width (32 bits, 4 bytes) encoding.
>>
>> Perhaps you should ask more questions before pontificating.
> 
> You mean each code point is one code point wide. But that's rather an
> irrelevant thing to state.

No, he means that each code point is one code unit wide.

> The main point is that UTF-32 (aka Unicode)

UTF-32 is not a synonym for Unicode. Many legacy encodings don't 
distinguish between the character set and the mapping between bytes and 
characters, but Unicode is not one of those.

> uses one or more code points to represent what people would consider an
> individual character.

That's a reasonable observation to make. But that's not what fixed- and 
variable-width refers to.

So does ASCII, and in both cases, it is irrelevant since the term of art 
is to define fixed- and variable-width in terms of *code points* not 
human meaningful characters. "Character" is context- and language-
dependent and frequently ambiguous. "LL" or "CH" (for example) could be a 
single character or a double character, depending on context and language.

Even in ASCII English, something as large as "ough" might be considered 
to be a single unit of language, which some people might choose to call a 
character. (But not a single letter, naturally.) If you don't like that 
example, "qu" is probably a better one: aside from acronyms and loan 
words, no modern English word can fail to follow a Q with a U.

> Code points are about as interesting as individual bytes in UTF-8.

That's your opinion. I see no justification for it.

-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson