Grapheme clusters, a.k.a.real characters

Marko Rauhamaa marko at pacujo.net
Fri Jul 14 04:53:26 EDT 2017


Chris Angelico <rosuav at gmail.com>:

> On Fri, Jul 14, 2017 at 6:15 PM, Marko Rauhamaa <marko at pacujo.net> wrote:
>> Furthermore, you only dismissed my question about
>>
>>    len(text)
>>
>> What about
>>
>>    text[-1]
>>    re.match("a.c", text)
>
> The considerations and concerns in the second half of my paragraph -
> the bit you didn't quote - directly address these two.

I guess you refer to:

   These kinds of linguistic considerations shouldn't be codified into
   the core of the language.

Then, why bother with Unicode to begin with? Why not just use bytes?
After all, Python3's strings have the very same pitfalls:

  - you don't know the length of a text in characters

  - chr(n) doesn't return a character

  - you can't easily find the 7th character in a piece of text

  - you can't compare the equality of two pieces of text

  - you can't use a piece of text as a reliable dict key

etc.


Marko



More information about the Python-list mailing list