Grapheme clusters, a.k.a.real characters

Tue Jul 18 23:57:52 EDT 2017

On Mon, 17 Jul 2017 04:12 am, Ben Finney wrote:

> Steven D'Aprano <steve at pearwood.info> writes:
> 
>> On Sun, 16 Jul 2017 12:33:10 +1000, Ben Finney wrote:
>>
>> > And yet the ASCII and Unicode standard says code point 0x0A (U+000A
>> > LINE FEED) is a character, by definition.
>> [...]
>> > > Is an acute accent a character?
>> > 
>> > Yes, according to Unicode. ‘´’ (U+0301 ACUTE ACCENT) is a character.
>>
>> Do you have references for those claims?
> 
> The Unicode Standard <URL:http://www.unicode.org/versions/Unicode10.0.0/>
> frequently uses “character” as the unit of semantic value that Unicode
> deals in. See the “Contents” table for many references.
> 
> In §2.2 under the sub-heading “Characters, Not Glyphs” it defines the
> term, and thereafter uses “character” in a way that includes all such
> units, even formatting codes.

Thanks for that. TIL something new.

I'm not sure whether I had misunderstood, or whether the standard has changed,
but I recall them previously being very reticent about giving a formal
definition for the term character. (Or possibly a combination of both.)

Even now, they do seem to prefer to use "character" in the sense of an abstract
character, not necessarily something that ordinary users of language will
recognise as a character or letter. E.g. they include control codes, variation
codes, diacritic marks on their own with no base, and more.

Unicode defines exactly 66 noncharacters:

http://www.unicode.org/faq/private_use.html#noncharacters

I found the table on page 30 here:

http://www.unicode.org/versions/Unicode10.0.0/ch02.pdf#G25564

very useful. That helped to clarify my thinking.

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.