Grapheme clusters, a.k.a.real characters

Marko Rauhamaa marko at pacujo.net
Fri Jul 14 04:15:54 EDT 2017


Chris Angelico <rosuav at gmail.com>:

> On Fri, Jul 14, 2017 at 4:30 PM, Marko Rauhamaa <marko at pacujo.net> wrote:
>> When people use Unicode, they are expecting to be able to deal in real
>> characters. I would expect:
>>
>>    len(text)               to give me the length in characters
>>    text[-1]                to evaluate to the last character
>>    re.match("a.c", text)   to match a character between a and c
>>
>> So the question is, should we have a third type for text. Or should the
>> semantics of strings be changed to be based on characters?
>
> What is the length of a string? How often do you actually care about
> the number of grapheme clusters - and not, for example, about the
> pixel width?

A good question. I have in the past argued that the string should be a
special data type for the specialist text processing needs.

However, I happen to have fooled around with a character-graphics based
game in recent days, and even professionally, I use character-based
alignment quite often. Consider, for example, a Python source code
editor where you want to limit the length of the line based on the
number of characters more typically than based on the number of pixels.

Furthermore, you only dismissed my question about

   len(text)

What about

   text[-1]
   re.match("a.c", text)


Marko



More information about the Python-list mailing list