Grapheme clusters, a.k.a.real characters

Chris Angelico rosuav at gmail.com
Fri Jul 14 04:30:01 EDT 2017


On Fri, Jul 14, 2017 at 6:15 PM, Marko Rauhamaa <marko at pacujo.net> wrote:
> Chris Angelico <rosuav at gmail.com>:
>
>> On Fri, Jul 14, 2017 at 4:30 PM, Marko Rauhamaa <marko at pacujo.net> wrote:
>>> When people use Unicode, they are expecting to be able to deal in real
>>> characters. I would expect:
>>>
>>>    len(text)               to give me the length in characters
>>>    text[-1]                to evaluate to the last character
>>>    re.match("a.c", text)   to match a character between a and c
>>>
>>> So the question is, should we have a third type for text. Or should the
>>> semantics of strings be changed to be based on characters?
>>
>> What is the length of a string? How often do you actually care about
>> the number of grapheme clusters - and not, for example, about the
>> pixel width?
>
> A good question. I have in the past argued that the string should be a
> special data type for the specialist text processing needs.
>
> However, I happen to have fooled around with a character-graphics based
> game in recent days, and even professionally, I use character-based
> alignment quite often. Consider, for example, a Python source code
> editor where you want to limit the length of the line based on the
> number of characters more typically than based on the number of pixels.
>
> Furthermore, you only dismissed my question about
>
>    len(text)
>
> What about
>
>    text[-1]
>    re.match("a.c", text)

The considerations and concerns in the second half of my paragraph -
the bit you didn't quote - directly address these two.

ChrisA



More information about the Python-list mailing list