Grapheme clusters, a.k.a.real characters

Rick Johnson rantingrickjohnson at gmail.com
Sun Jul 16 00:25:20 EDT 2017


On Saturday, July 15, 2017 at 8:54:40 PM UTC-5, MRAB wrote:
> You need to be careful about the terminology.

You are correct. I admit I was a little loose with my
terms there.

> Is linefeed a character? 

Since LineFeed is the same as NewLine, then yes, IMO,
linefeed is a character.

> You might call [linefeed] a "control character", but it's
> not really a _character_, it's control/format _code_.

True. 

Allow me try and define some concrete terms that we can
use.

In the old days, long before i was born, and even long
before i downloaded my first compiler (ah the memories!),
the concept of strings was so much simpler. Yep, back in
those days all you had was, basically, two discreate sub
components of a string: the "actual chars" and the "virtual
chars".

    (Disambiguation)

    The "actual chars"[1] are any chars that a programmer could
    insert by pressing a single key on the keyboard, such as:
    "1", "2", "3", "a", "b", "c" , "!", "@", "#" -- etc..    
    
    The "virtual chars" -- or the "control codes" as you put it
    (the ones that start with a "\") -- are the chars
    that represent "structural elements" of the string (f.i. \n,
    \t, etc..). But in reality, the implementation of strings
    has complicated the idea of "virtual chars as solely structural
    elements" of the display, by including such absurdities as:
         
        (1) Sounds ("\a")
        (2) Virtual interactions such as: BackSpace("\b"),
            CarrigeReturn ("\r") and FormFeed ("\f")
             
    intermixed with control codes that constitute _actual_
    structural elements such as:
        
        (1) LineFeed or NewLine ("\n")
        (2) HorizontalTab ("\t")
        (3) VericalTab ("\v")

    And a few other non-structural codes that allow embedding
    delimiters or hex or octals. 
            
And furthermore, two distinct "realms", if i may, in which
a string can exist: the "virtual character realm" and the
"display realm".

    (Disambiguation)
    
    The "virtual character realm" is sort of like an operating
    room where a doctor (aka: programmer) performs operations on
    the patient (aka: string), or if you like, a castle where a
    mad scientist builds a Unicode monster from a hodgepodge
    of body parts he stole from local grave yards and is later
    lynched by a mob of angry peasants for his perceived sins
    against nature. But i digress...
    
    Whereas the "display realm" is sort of like an awards
    ceremony for celebrities, except here, strings take the
    place of strung-out celebs and characters are dressed in the
    over-hyped rags (aka: font) of an overpaid fashion designer .
    
But the two "realms" and two "character types" are but only a
small sample of the syntactical complexity of Python
strings. For we haven't even discussed the many types of
string literals that Python defines. Some include:

    (1) "Normal Strings"
    (2) r"Raw Strings
    (3) b"Byte Strings"
    (4) u"Unicode Strings"
    (5) ru"Raw Unicode"
    (6) ur'Unicode "that is _raw_"'
    (7) f"Format literals"
    ...
        
Whew!

IMO, I think the reason why the implementation of strings has
been such a tough nut to crack (Python3000 notwithstanding),
is due very much to what i call a "syntactical circus". 

> Is an acute accent a character? No, it's a diacritic mark
> that's added to a character.

And i agree. 

Chris was arguing that zero width spaces should not be
counted as characters when the `len()` function is applied
to the string, for which i disagree on the basis of
consistency. My first reaction is: "Why would you inject a
char into a string -- even a zero-width char! -- and then
expect that the char should not affect the length of the
string as returned by `len`?"

Being that strings (on the highest level) are merely linear
arrays of chars, such an assumption defies all logic.
Furthermore, the length of a string (in chars) and the
"perceived" length of a string (when rendered on a screen,
or printed on paper), are in no way relevant to one another.

When we, as programmers, are manipulateing strings (slicing,
munging, concatenating, etc..) our only concern should be
that _every_ char is accessable, indexable, quantifiable and
will maintain its order. And whether or not a char will be
visible, when rendered on a screen or paper, is irrelevant to
these "programmer centric" operations. Rendering is the
domain of graphic designers, not software developers.

> When you're working with Unicode strings, you're not
> working with strings of characters as such, but with
> strings of 'codepoints', some of which are characters,
> others combining marks, yet others format codes, and so on.

Which is unfortunate for the programmer. Who would like to
get things done without a viscous implementation mucking up
the gears.

[1] Of course, even in the realms of ASCII, there are chars
that cannot be inserted by the programmer _simply_ by
pressing a single key on the keyboard. But most of these
chars were useless anyways. So we will ignore this small
detail for now. One point to mention is that Unicode
greatly increased the number of useless chars.




More information about the Python-list mailing list