Glyphs and graphemes [was Re: Cult-like behaviour]

Tue Jul 17 04:04:29 EDT 2018

On Tue, 17 Jul 2018 15:20:16 +0900, INADA Naoki wrote (replying to Marko):

> I still don't understand what's your original point. I think UTF-8 vs
> UTF-32 is totally different from Python 2 vs 3.
> 
> For example, string in Rust and Swift (2010s languages!) are *valid*
> UTF-8. There are strong separation between byte array and string, even
> they use UTF-8. They looks similar to Python 3, not Python 2.
> 
> And Python can use UTF-8 for internal encoding in the future. AFAIK,
> PyPy tries it now.  After they succeeded,  I want to try port it to
> CPython after we removed legacy Unicode APIs. (ref PEP 393)

I'm not sure about PyPy, but I'm fairly certain that MicroPython uses 
UTF-8.

I would be very interested to see the results of using UTF-8 in CPython. 
At the least, it would remove the need to keep a separate UTF-8 
representation in the string object, as they do now. It might even be 
more compact, although a naive implementation would lose the ability to 
do constant time indexing into strings.

That might be a tradeoff worth keeping, if indexing remained sufficiently 
fast.

-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson