Grapheme clusters, a.k.a.real characters

Rustom Mody rustompmody at gmail.com
Sun Jul 16 11:40:34 EDT 2017


On Sunday, July 16, 2017 at 8:10:41 PM UTC+5:30, Rick Johnson wrote:
> On Sunday, July 16, 2017 at 2:55:57 AM UTC-5, Marko Rauhamaa wrote:
> > Mikhail V :
> > > On Sat, 15 Jul 2017 05:50 pm, Marko Rauhamaa wrote:
> > > >
> > > > Random access to code points is as uninteresting as
> > > > random access to UTF-8 bytes. I might want random access
> > > > to the "Grapheme clusters, a.k.a.real characters".
> > >
> > > What _real_ characters are you referring to? If your data
> > > has "á" (U00E1), then it is one real character, if you
> > > have "a" (U0061) and "ˊ" (U02CA) then it is _two_ real
> > > characters. So in both cases you have access to code
> > > points = real characters.
> > 
> > It's true that confusion is caused by the ambiguity of the
> > term "character."
> > 
> > > For metaphysical discussion - in _my_ definition there is
> > > no such "real" character as "á", since it is the "a" glyph
> > > with some dirt, so according to my definition, it should
> > > be two separate characters, both semantically and
> > > technically seen.
> > 
> > Here's the problem: when the human user types in "á" (with
> > one, two or three keyclicks), they don't know how the
> > computer represents it internally. The Unicode standard
> > allows for two *equivalent* code point sequences (<URL:
> > https://en.wikipedia.org/wiki/Unicode_equivalence>). When
> > the computer outputs the sequence, the visible result is
> > the single letter "á". The human user doesn't know—or
> > care—about the internal representation.
> 
> *EXACTLY*. But your statement is far too general. Not only
> need not the _human_user_ be concerned with these low level
> aspects of strings, but the _programmer_ need not be concerned
> either. The programmer should only see strings from a
> practical standpoint:
>     
>     "Can i index the chars within them?"
> 
>     "Can i determine the length of them?"
>     
>     "Can i slice and dice and combine them?"
>     
>     "Can i trust that the character positions will maintain
>     order?"
>     
>     "Can i, and my target users, display them in a human
>     readable form using various rendering specifications defined
>     by graphic designers (aka: font-o-philes)?"
>     
> If the answer to all of these questions is *YES*, then you
> know all you need to know about strings. Now get to work!!!
> 
> > The user's expectation is that the visible letter "á"
> > should behave like any other single letter. For example, a
> > text editor should move the cursor past it with a single
> > click of a left or right arrow key. Also, if I perform a
> > regular-expression search in the editor and look for
> > 
> >    Alv[aá]rez
> > 
> > I should get a match with either Alvarez or Alvárez.
> 
> While what you say is relevant to _text_editors_ and sub
> string searching tools, you have wandered beyond the topic
> we are discussing here, which is practical interfacing
> between a programmer and his/her strings. How a text editor
> handles strings is irrelevant to a programmer. Unless of
> course we are writing a custome text editor software
> ourselves. In which case we can be the BDFL for a day, or
> two. *wink*
> 
> > > And, in my definition, the whole Unicode is a huge
> > > junkyard, to start with.
> > 
> > I don't think anybody denies that. However, it's the best
> > thing available and—more importantly—a universally accepted
> > standard.
> > 
> > > But opinions may vary, and in case you prefer or forced to
> > > write "á", then it can be impractical to store it as two
> > > characters, regardless of encoding.
> > 
> > Now I'm not following you.
> 
> Mikhail is referring to the claims made earlier in this
> thread that accents are themselves distinct characters.
> Which i think is utter hooey. For instance, some folks here
> would wish for len("á") to return 2. Does that seem
> reasonable?

$ python
Python 3.6.0 |Anaconda 4.3.1 (64-bit)| (default, Dec 23 2016, 12:22:00) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> len("á")
1
>>> len("á")
2

Shall we stipulate it to be 1.5? [¿ Maybe 1½ ?]



More information about the Python-list mailing list