evaluation question

Wed Feb 1 20:21:07 EST 2023

On 2/1/23 3:59 AM, Muttley at dastardlyhq.com wrote:
> On Wed, 1 Feb 2023 11:59:25 +1300
> Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:
>> On 31/01/23 10:24 pm, Muttley at dastardlyhq.com wrote:
>>> All languages have their ugly corners due to initial design mistakes and/or
>>> constraints. Eg: java with the special behaviour of its string class, C++
>>> with "=0" pure virtual declaration. But they don't dump them and make all old
>>> code suddenly cease to execute.
>> No, but it was decided that Python 3 would have to be backwards
>> incompatible, mainly to sort out the Unicode mess. Given that,
>> the opportunity was taken to clean up some other mistakes as well.
> Unicode is just a string of bytes. C supports it with a few extra library
> functions to get unicode length vs byte length and similar. Its really
> not that hard. Rewriting an entire language just to support that sounds a
> bit absurd to me but hey ho...
>
No, Unicode is a string of 21 bit characters. UTF-8 is a representation 
that uses bytes, but isn't itself "Unicode".

The key fact is that a "String" variable is indexed not by bytes of 
UTF-8 encoding, but by actual characters.

Python3 will store a string as either a sequence of Bytes if the data is 
all Latin-1, as a sequence of 16-bit words if the data all fits on th 
BMP, and a sequence of 32 bit words if it has a value outside the BMP.

-- 
Richard Damon