Cult-like behaviour [was Re: Kindness]

Chris Angelico rosuav at gmail.com
Mon Jul 16 16:07:33 EDT 2018


On Tue, Jul 17, 2018 at 5:40 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
> Terry Reedy <tjreedy at udel.edu>:
>
>> On 7/15/2018 5:28 PM, Marko Rauhamaa wrote:
>>> if your new system used Python3's UTF-32 strings as a foundation,
>>
>> Since 3.3, Python's strings are not (always) UFT-32 strings.
>
> You are right. Python's strings are a superset of UTF-32. More
> accurately, Python's strings are UTF-32 plus surrogate characters.
>
>> Nor are they always UCS-2 (or partly UTF-16) strings. Nor are the
>> always Latin-1 or Ascii strings. Python's Flexible String
>> Representation uses the narrowest possible internal code for any
>> particular string. This is all transparent to the user except for
>> memory size.
>
> How CPython chooses to represent its strings internally is not what I'm
> talking about.

Then don't talk about UTF-32, which is a representation format.

>>> UTF-32, after all, is a variable-width encoding.
>>
>> Nope.  It a fixed-width (32 bits, 4 bytes) encoding.
>>
>> Perhaps you should ask more questions before pontificating.
>
> You mean each code point is one code point wide. But that's rather an
> irrelevant thing to state. The main point is that UTF-32 (aka Unicode)
> uses one or more code points to represent what people would consider an
> individual character.

No, each code point is one code unit wide. It's not irrelevant.

> The letter "a" is encoded as a single code point, but 🇬🇧 (Flag, United
> Kingdom) is two code points wide and 🏴 (Flag, England) is seven (!)
> code points wide, not to forget 🧖‍♂️ (Man in Steamy Room) with four code
> points. <URL: https://unicode.org/emoji/charts/full-emoji-list.html>
>
> And of course, regular West-European letters can be represented by
> multiple code points.
>
> Code points are about as interesting as individual bytes in UTF-8.

Individual bytes in UTF-8 do not have individual meaning. Individual
code points do, with the partial exception of the flag characters
(which are pretty poorly supported anyway). Otherwise, every code
point is either a base character with general meaning, or a combining
character (or variant selector) that represents a specific change.
They can be composed in different ways. For example:

U+006F U+0301 "ó" LATIN SMALL LETTER O WITH ACUTE
U+006F U+030B "ő" LATIN SMALL LETTER O WITH DOUBLE ACUTE
U+0075 U+0301 "ú" LATIN SMALL LETTER U WITH ACUTE
U+0075 U+030B "ű" LATIN SMALL LETTER U WITH DOUBLE ACUTE

The UTF-8 representations of the combined forms of these characters are:
C3 B3
C5 91
C3 BA
C5 B1

What does byte value C5 mean? What does 91 mean? None of these has
meaning on its own. The only way you can interpret them is as a full
set. In contrast, the combining characters have meaning: a base
character, or a combining character.

So, no, individual code points are very interesting.

ChrisA



More information about the Python-list mailing list