Grapheme clusters, a.k.a.real characters

Rustom Mody rustompmody at gmail.com
Mon Jul 17 00:10:37 EDT 2017


On Monday, July 17, 2017 at 6:58:57 AM UTC+5:30, Steve D'Aprano wrote:
> On Mon, 17 Jul 2017 01:40 am, Rustom Mody wrote:
> 
> > On Sunday, July 16, 2017 at 8:10:41 PM UTC+5:30, Rick Johnson wrote:
> [...] 
> > $ python
> > Python 3.6.0 |Anaconda 4.3.1 (64-bit)| (default, Dec 23 2016, 12:22:00)
> > [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
> > Type "help", "copyright", "credits" or "license" for more information.
> >>>> len("á")
> > 1
> >>>> len("á")
> > 2
> > 
> > Shall we stipulate it to be 1.5? [¿ Maybe 1½ ?]
> 
> Please don't feed the trolls. 

Its usually called 'joke' Steven! Did the word fall out of your dictionary 
in the last upgrade?
Rick was no more trolling than Marko or you or Chris or Mikhail or anyone else
If anyone's trolling its me…  len("á") == 1½ is so obviously nonsense on so 
many levels I did not think
"And now ladies (are there any?) and gentlemen I am going to tell a joke!"
would be necessary

On a more serious note every other post on this (as on many discussing unicode
more broadly) is so ridiculously Euro (or Anglo) centric I would not know where
to begin.
Witness your own…

> If you have to respond to Ranting Rick, at least
> write something sensible that people following this thread might learn from,
> instead of encouraging his nonsense.
> 
> I don't believe for a second you seriously would like len(some_string) to
> return '1½', but just in case anyone is taking that proposal seriously, that
> would break backwards compatibility. len() must return an int, not a float, a
> complex number, or a string.
> 
> If you want to know the length of a string *in bytes*, you have to encode it to
> bytes first, using some specific encoding, then call len() on those bytes.
> 
> If you want to know the length of a string *in code points*, then just call
> len() on the string.
> 
> If you want to know the height or width of a string in pixels in some specific
> font, see your GUI toolkit.
> 
> If you want to know the length of a string in "characters" (graphemes), well,
> Python doesn't have a built-in function to do that, or a standard library
> solution. Yet.

You've given 4 ifs.
An L-language may would assume that the atomic units of language-L would 
be supported.  Your 4th if suggests thats ok. Is it?

Hint1: Ask your grandmother whether unicode's notion of character makes sense. 
Ask 10 gmas from 10 language-L's
Hint2: When in doubt gma usually is right

PS Claims such as Euro (or some other) centricism usually imply a corresponding 
call for "rights" "equality" etc
No such politically correct call is being made or implied (by me)
There never was equality in the world; there never will be



More information about the Python-list mailing list