Glyphs and graphemes [was Re: Cult-like behaviour]

Tue Jul 17 02:20:16 EDT 2018

On Tue, Jul 17, 2018 at 2:31 PM Marko Rauhamaa <marko at pacujo.net> wrote:
>
> Steven D'Aprano <steve+comp.lang.python at pearwood.info>:
> > On Mon, 16 Jul 2018 22:51:32 +0300, Marko Rauhamaa wrote:
> >> UTF-8 bytes can only represent the first 128 code points of Unicode.
> >
> > This is DailyWTF material. Perhaps you want to rethink your wording
> > and maybe even learn a bit more about Unicode and the UTF encodings
> > before making such statements.
> >
> > The idea that UTF-8 bytes cannot represent the whole of Unicode is not
> > even wrong. Of course a *single* byte cannot, but a single byte is not
> > "UTF-8 bytes".
>
> So I hope that by now you have understood my point and been able to
> decide if you agree with it or not.
>
>
> Marko

I still don't understand what's your original point.
I think UTF-8 vs UTF-32 is totally different from Python 2 vs 3.

For example, string in Rust and Swift (2010s languages!) are *valid* UTF-8.
There are strong separation between byte array and string, even they use UTF-8.
They looks similar to Python 3, not Python 2.

And Python can use UTF-8 for internal encoding in the future.
AFAIK, PyPy tries it now.  After they succeeded,  I want to try port it
to CPython after we removed legacy Unicode APIs. (ref PEP 393)

So "UTF-8 is better than UTF-32" is totally different problem from
"Python 2 is better than 3".

Is your point "accepting invalid UTF-8 implicitly by default is better
than explicit 'surrogateescape' error handler" like Go?
(It's 2010s languages with UTF-8 based string too, but accept invalid
UTF-8).

Regards,

--
INADA Naoki  <songofacandy at gmail.com>