Grapheme clusters, a.k.a.real characters

Rick Johnson rantingrickjohnson at gmail.com
Sun Jul 16 10:40:14 EDT 2017


On Sunday, July 16, 2017 at 2:55:57 AM UTC-5, Marko Rauhamaa wrote:
> Mikhail V <mikhailwas at gmail.com>:
> > On Sat, 15 Jul 2017 05:50 pm, Marko Rauhamaa wrote:
> > >
> > > Random access to code points is as uninteresting as
> > > random access to UTF-8 bytes. I might want random access
> > > to the "Grapheme clusters, a.k.a.real characters".
> >
> > What _real_ characters are you referring to? If your data
> > has "á" (U00E1), then it is one real character, if you
> > have "a" (U0061) and "ˊ" (U02CA) then it is _two_ real
> > characters. So in both cases you have access to code
> > points = real characters.
> 
> It's true that confusion is caused by the ambiguity of the
> term "character."
> 
> > For metaphysical discussion - in _my_ definition there is
> > no such "real" character as "á", since it is the "a" glyph
> > with some dirt, so according to my definition, it should
> > be two separate characters, both semantically and
> > technically seen.
> 
> Here's the problem: when the human user types in "á" (with
> one, two or three keyclicks), they don't know how the
> computer represents it internally. The Unicode standard
> allows for two *equivalent* code point sequences (<URL:
> https://en.wikipedia.org/wiki/Unicode_equivalence>). When
> the computer outputs the sequence, the visible result is
> the single letter "á". The human user doesn't know—or
> care—about the internal representation.

*EXACTLY*. But your statement is far too general. Not only
need not the _human_user_ be concerned with these low level
aspects of strings, but the _programmer_ need not be concerned
either. The programmer should only see strings from a
practical standpoint:
    
    "Can i index the chars within them?"

    "Can i determine the length of them?"
    
    "Can i slice and dice and combine them?"
    
    "Can i trust that the character positions will maintain
    order?"
    
    "Can i, and my target users, display them in a human
    readable form using various rendering specifications defined
    by graphic designers (aka: font-o-philes)?"
    
If the answer to all of these questions is *YES*, then you
know all you need to know about strings. Now get to work!!!

> The user's expectation is that the visible letter "á"
> should behave like any other single letter. For example, a
> text editor should move the cursor past it with a single
> click of a left or right arrow key. Also, if I perform a
> regular-expression search in the editor and look for
> 
>    Alv[aá]rez
> 
> I should get a match with either Alvarez or Alvárez.

While what you say is relevant to _text_editors_ and sub
string searching tools, you have wandered beyond the topic
we are discussing here, which is practical interfacing
between a programmer and his/her strings. How a text editor
handles strings is irrelevant to a programmer. Unless of
course we are writing a custome text editor software
ourselves. In which case we can be the BDFL for a day, or
two. *wink*

> > And, in my definition, the whole Unicode is a huge
> > junkyard, to start with.
> 
> I don't think anybody denies that. However, it's the best
> thing available and—more importantly—a universally accepted
> standard.
> 
> > But opinions may vary, and in case you prefer or forced to
> > write "á", then it can be impractical to store it as two
> > characters, regardless of encoding.
> 
> Now I'm not following you.

Mikhail is referring to the claims made earlier in this
thread that accents are themselves distinct characters.
Which i think is utter hooey. For instance, some folks here
would wish for len("á") to return 2. Does that seem
reasonable?




More information about the Python-list mailing list