Grapheme clusters, a.k.a.real characters

Rick Johnson rantingrickjohnson at gmail.com
Sun Jul 16 12:46:07 EDT 2017


On Sunday, July 16, 2017 at 10:41:02 AM UTC-5, Rustom Mody wrote:
> On Sunday, July 16, 2017 at 8:10:41 PM UTC+5:30, Rick Johnson wrote:
> > On Sunday, July 16, 2017 at 2:55:57 AM UTC-5, Marko Rauhamaa wrote:
> > > Mikhail V :
> > > > On Sat, 15 Jul 2017 05:50 pm, Marko Rauhamaa wrote:

[...]

> > Mikhail is referring to the claims made earlier in this
> > thread that accents are themselves distinct characters.
> > Which i think is utter hooey. For instance, some folks
> > here would wish for len("á") to return 2. Does that seem
> > reasonable?
> 
> $ python
> Python 3.6.0 |Anaconda 4.3.1 (64-bit)| (default, Dec 23 2016, 12:22:00) 
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> len("á")
> 1
> >>> len("á")
> 2
> 
> Shall we stipulate it to be 1.5? [¿ Maybe 1½ ?]

Well, heck. If we are wad into the fraction weeds as it
relates to "character decorations" (aka: accents), we should
at least be realistic about it. For instance, the bounding
box of that *AHEM* "spec of dirt" (aka: accent) above the
"a" is hardly half the size of the bounding box that
contains the "a" itself. If i were to guess, i would say
something around 0.1-ish of a "real character". So if we are
accept your implementation, `len("á")` would return ~1.1.



More information about the Python-list mailing list