Micro Python -- a lean and efficient implementation of Python 3

Wed Jun 4 09:51:42 EDT 2014

On Wed, 04 Jun 2014 12:53:19 +0100, Robin Becker wrote:

> I believe that we should distinguish between glyph/character indexing
> and string indexing. Even in unicode it may be hard to decide where a
> visual glyph starts and ends. I assume most people would like to assign
> one glyph to one unicode, but that's not always possible with composed
> glyphs.
> 
>  >>> for a in (u'\xc5',u'A\u030a'):
> ... 	for o in (u'\xf6',u'o\u0308'):
> ... 		u=a+u'ngstr'+o+u'm'
> ... 		print("%s %s" % (repr(u),u))
> ...
> u'\xc5ngstr\xf6m' Ångström
> u'\xc5ngstro\u0308m' Ångström
> u'A\u030angstr\xf6m' Ångström
> u'A\u030angstro\u0308m' Ångström
> >>> u'\xc5ngstr\xf6m'==u'\xc5ngstro\u0308m'
> False
> 
> so even unicode doesn't always allow for O(1) glyph indexing.

What you're talking about here is "graphemes", not glyphs. Glyphs are the 
little pictures that represent the characters when written down. 
Graphemes (technically, "grapheme clusters") are the things which native 
speakers of a language believe ought to be considered a single unit. 
Think of them as similar to letters. That can be quite tricky to 
determine, and is dependent on the language you are speaking. The letters 
"ch" are considered two letters in English, but only a single letter in 
Czech and Slovak.

I believe that *grapheme-aware* text processing is *far* too complicated 
for a programming language to promise. If you think that len() needs to 
count graphemes, then what should len("ch") return, 1 or 2? Grapheme 
processing is a complex, complicated task best left up to powerful 
libraries built on top of a sturdy Unicode base.

> I know this is artificial, 

But it isn't artificial in the least. Unicode isn't complicated because 
it's badly designed, or complicated for the sake of complexity. It's 
complicated because human language is complicated. That, and because of 
legacy encodings.

> but this is the same situation as utf8 faces just
> the frequency of occurrence is different. A very large amount of
> computing is still western centric so searching a byte string for latin
> characters is still efficient; searching for an n with a tilde on top
> might not be so easy.

This is a good point, but on balance I disagree. A grapheme-aware library 
is likely to need to be based on more complex data structures than simple 
strings (arrays of code points). But for the underlying relatively simple 
string library, graphemes are too hard. Code points are simple, and the 
language can deal with code points without caring about their semantics. 
For instance, in English, I might not want to insert letters between the 
q and u of "queen", since in English u (nearly) always follows q. It 
would be inappropriate for the programming language string library to 
care about that, and similarly it would be inappropriate for it to care 
that u'A\u030a' represents a single grapheme Å.

-- 
Steven D'Aprano
http://import-that.dreamwidth.org/