Grapheme clusters, a.k.a.real characters

Sat Jul 15 07:08:38 EDT 2017

On Sat, 15 Jul 2017 05:50 pm, Marko Rauhamaa wrote:

> Steve D'Aprano <steve+python at pearwood.info>:
> 
>> On Sat, 15 Jul 2017 04:10 am, Marko Rauhamaa wrote:
>>> Python3's strings don't give me any better random access than UTF-8.
>>
>> Say what? Of course they do.
>>
>> Python 3 strings (since 3.3) are a compact form of UTF-32. Without loss of
>> generality, we can say that each string is an array of four-byte code units.
> 
> Yes, and a UTF-8 byte array gives me random access to the UTF-8
> single-byte code units.

Which is irrelevant. Single code units in UTF-8 aren't important. Nobody needs
to start a slice in the middle byte of a three byte code point in UTF-8. It's
not a useful operation, and allowing slices to occur at arbitrary positions
inside UTF-8 sequences means you soon won't have valid UTF-8 any more.

Now since I am interested in a good faith discussion, I can even point out
something that supports your argument: perhaps we could introduce restrictions
on where you can slice, and ensure that they only occur at code point
boundaries. So if you try to slice string[100:120], say, what you actually get
is string[98:119] because that's where the nearest code point boundaries fall.

Or should it move forward? string[101:122], say.

Perhaps the Zen of Python is better: when faced with ambiguity, avoid the
temptation to guess. We should either prohibit slicing anywhere except on a
code point boundary, or better still use a data structure that doesn't expose
the internal implementation of code points.

Whichever way we go, it doesn't get us any closer to our ultimate aim, which is
a text data type based on graphemes rather than code points. All it does is
give us what Python's unicode strings already give us: code points.

So what does that extra complexity forced on us by UTF-8 give us, apart from a
headache? Why use UTF-8?

> Neither gives me random access to the "Grapheme clusters, a.k.a.real
> characters". For example, the HFS+ file system stores uses a variant of
> NFD for filenames meaning both UTF-32 and UTF-8 give you random access
> to pure ASCII filenames only.

And they're not graphemes either. Normalisation doesn't give you graphemes.

It's ironic that you give the example of Apple using NFD, since that makes the
problem you are railing against *worse* rather than better. Decomposition has
its uses, but the specific problem this thread started with is made worse due
to decomposition.

>> UTF-8 is not: it is a variable-width encoding,
> 
> UTF-32 is a variable-width encoding as well.

No it isn't. All code points are exactly one four-byte code unit in size.

> For example, "baby: medium skin tone" is U+1F476 U+1F3FD:

That's two code points, not one. Variation selectors present the same issues as
combining characters.

>   <URL: http://unicode.org/emoji/charts/full-emoji-list.html#1f476_1f3fd>
> 
>> Go ignores this problem by simply not offering random access to code
>> points in strings.
> 
> Random access to code points is as uninteresting as random access to
> UTF-8 bytes.

I have random access to code points in Python right now, and I use it all the
time to extract code points and even build up new strings from slices. I
wouldn't do that with UTF-8 bytes, it's too bloody hard.

> I might want random access to the "Grapheme clusters, a.k.a.real
> characters".

That would be nice to have, but the truth is that for most coders, Unicode code
points are the low-hanging fruit that get you 95% of the way, and for many
applications that's "close enough".

Support for the Unicode grapheme breaking algorithm would get you probably 90%
of the rest of the way. And then some sort of configurable system where
defaults were based on the locale would probably get you a fairly complete
grapheme-based text library.

I'm interested in such a thing. That's why I pointed out the issue on the bug
tracker, to try to garner interest in it. As far as I can tell, you seem to be
more interested in cheap point scoring, digs against Unicode, and an insistence
that UTF-8 is better than strings (which doesn't even make sense).

> As you have pointed out, that wish is impossible to grant 
> unambiguously.

I never said that. Just because it is *difficult*, and that no one answer will
satisfy everyone all of the time, doesn't mean we can't solve the problem.

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.