Grapheme clusters, a.k.a.real characters

Ben Finney ben+python at benfinney.id.au
Sun Jul 16 14:12:50 EDT 2017


Steven D'Aprano <steve at pearwood.info> writes:

> On Sun, 16 Jul 2017 12:33:10 +1000, Ben Finney wrote:
>
> > And yet the ASCII and Unicode standard says code point 0x0A (U+000A
> > LINE FEED) is a character, by definition.
> [...]
> > > Is an acute accent a character?
> > 
> > Yes, according to Unicode. ‘´’ (U+0301 ACUTE ACCENT) is a character.
>
> Do you have references for those claims?

The Unicode Standard <URL:http://www.unicode.org/versions/Unicode10.0.0/>
frequently uses “character” as the unit of semantic value that Unicode
deals in. See the “Contents” table for many references.

In §2.2 under the sub-heading “Characters, Not Glyphs” it defines the
term, and thereafter uses “character” in a way that includes all such
units, even formatting codes.

See §2.11 “Combining Characters” for a definition that includes accent
characters like U+0301:

    Combining Characters. Characters intended to be positioned relative
    to an associated base character are depicted in the character code
    charts above, below, or through a dotted circle.

The standard even uses the term “format characters” to refer to code
points with a functional purpose and no glyph representation, such as
U+000A LINE FEED.

> Because I'm pretty sure that Unicode is very, very careful to never
> use the word "character" in a formal or normative manner, only as an
> informal term for "the kinds of things that regular folk consider
> letters or characters or similar".

I don't know whether you consider the Core Specification document to be
speaking in “formal or normative manner”. Either way that doesn't affect
my point that Unicode does define “character” and it includes all code
points in that definition.

If you're going to disqualify anything that isn't “formal and normative
manner” from what we're allowed to infer as the Unicode Standard telling
us is a character, then you're going to have to either disregard most of
the Core Specification document, or allow it as formal and/or normative.

> And I don't think regular folks would know what a line feed was if it
> jumped out of their computer and bit them :-)

Are we talking about definitions, or are we talking about what regular
folks would know?

Regular folks know that “fish” has meaning, but I wouldn't want to try
matching that regular-folk knowledge with a definition of what a “fish”
is and is not. Quite frequently, a definition useful for a formal
standard is *not* coterminus with what regular folk will think is in our
out of that definition.

-- 
 \       “I have said to you to speak the truth is a painful thing. To |
  `\          be forced to tell lies is much worse.” —Oscar Wilde, _De |
_o__)                                                 Profundis_, 1897 |
Ben Finney




More information about the Python-list mailing list