Grapheme clusters, a.k.a.real characters

Fri Jul 14 10:54:48 EDT 2017

On Sat, Jul 15, 2017 at 12:32 AM, Michael Torrie <torriem at gmail.com> wrote:
> On 07/14/2017 08:05 AM, Rhodri James wrote:
>> On 14/07/17 14:31, Marko Rauhamaa wrote:
>>> Of course, UTF-8 in a bytes object doesn't make the situation any
>>> better, but does it make it any worse?
>>
>> Speaking as someone who has been up to his elbows in this recently, I
>> would say emphatically that it does make things worse.  It adds an extra
>> layer of complexity to all of the questions you were asking, and more.
>> A single codepoint is a meaningful thing, even if its meaning may be
>> modified by combining.  A single byte may or may not be meaningful.
>
> Are you saying that dealing with Unicode in Google Go, which uses UTF-8
> in memory, is adding an extra layer of complexity and makes things worse
> than they might be in Python?

Can you reverse a string in Go? How do you do it?

With Python, you can sometimes get tripped up, eg if you have:

* combining characters
* Arabic letters, which can look very different when reordered
* explicit directionality markers

But the semantics are at least easy to comprehend: you have a strict
reversal of code unit order. So you can reverse a string for parsing
purposes, and then re-reverse the subsections.

If you have a UTF-8 bytestring, a naive reversal will trip you up if
you have *any* non-ASCII values in there. You will have invalid UTF-8.
So *at very least*, your "reverse string" code has to be UTF-8 aware -
it has to keep continuation bytes with the correct start byte. And you
*still* have all the concerns that Python has.

Extra complexity. QED.

ChrisA