Ah Python, you have spoiled me for all other languages

Steven D'Aprano steve at pearwood.info
Sun Jun 7 12:28:35 EDT 2015


On Mon, 8 Jun 2015 12:58 am, random832 at fastmail.us wrote:

> On Sun, Jun 7, 2015, at 07:42, Steven D'Aprano wrote:
>> The question of graphemes (what "ordinary people" consider letters and
>> characters, e.g. "ch" is two letters to an English speaker but one letter
>> to a Czech speaker) should be left to libraries.
> 
> Do Czech speakers expect to be able to select and delete it as a single
> unit and never have the cursor in the middle of it?

You'd have to ask one. I expect the answer is No, because they're used to
using software written by English speakers who think that "ch" is two
letters.

Whether they would *like* to stick the cursor between the c and the h is a
different question to whether they would *expect* it.

There may even be words where "ch" counts as two letters, where the "c" is
at the end of one syllable but the "h" is the beginning of the next.
(That's certainly the case for Dutch "ij".) Natural language is *hard*.

But generally speaking, I expect that when Czech speakers are playing (say)
Scrabble, they would want to have a tile called "CH" which they can play as
a single letter.


> If not, then this is 
> not really fundamentally the same thing as what we have with combining
> characters or certain sequences of Indic letters.

I'll have to take your word for that.

> 
> Also, "should be left to libraries" isn't really a coherent statement
> when we are talking about the design of the standard *library*.

The language offers a certain view of strings, which is reflected in the
methods that strings have, and built-in functions that operate on strings.
Should len('ch') return 1 or 2? If you think that the language should treat
strings as sequences of graphemes, then you will answer "sometimes 1".
Maybe there is a global setting to set the locale

LANG = 'Cz'
len('ch')
=> returns 1

or an optional parameter that you can pass to len:

len('ch', lang='Cz')
=> returns 1

len('ch', lang='En')
=> returns 2

But if you think that the language should treat strings as sequences of code
points, as I do, then there's only one reasonable thing for len('ch') to
return, and that is 2. But *some library* (as opposed to the built-in str
type) can offer a grapheme view of strings:

from language_tools import Graphemes
g = Graphemes.fromstr('ch', lang='Cz', exceptions=['xchx', 'ychy'])
len(g)
=> returns 1

Do you still think this is incoherent?



-- 
Steven




More information about the Python-list mailing list