Grapheme clusters, a.k.a.real characters

Marko Rauhamaa marko at pacujo.net
Fri Jul 14 06:59:32 EDT 2017


Chris Angelico <rosuav at gmail.com>:

> On Fri, Jul 14, 2017 at 6:53 PM, Marko Rauhamaa <marko at pacujo.net> wrote:
>> Chris Angelico <rosuav at gmail.com>:
>> Then, why bother with Unicode to begin with? Why not just use bytes?
>> After all, Python3's strings have the very same pitfalls:
>>
>>   - you don't know the length of a text in characters
>>   - chr(n) doesn't return a character
>>   - you can't easily find the 7th character in a piece of text
>
> First you have to define "character".

I'm referring to the

    Grapheme clusters, a.k.a.real characters

>>   - you can't compare the equality of two pieces of text
>>   - you can't use a piece of text as a reliable dict key
>
> (Dict key usage is defined in terms of equality, so these two are the
> same concern.)

Ideally, yes. However, someone might say, "don't use == to compare
equality; use unicode.textually_equal() instead". That advise might
satisfy the first requirement but not the second.

> Yes, you can. For most purposes, textual equality should be defined in
> terms of NFC or NFD normalization. Python already gives you that. You
> could argue that a string should always be stored NFC (or NFD, take
> your pick), and then the equality operator would handle this; but I'm
> not sure the benefit is worth it.

As I said, Python3's strings are neither here nor there. They don't
quite solve the problem Python2's strings had. They will push the
internationalization problems a bit farther out but fall short of the
mark.

he developer still has to worry a lot. Unicode seemingly solved one
problem only to present the developer of a bagful of new problems.

And if Python3's strings are a half-measure, why not stick to bytes?

> If you're trying to use strings as identifiers in any way (say, file
> names, or document lookup references), using the NFC/NFD normalized
> form of the string should be sufficient.

Show me ten Python3 database applications, and I'll show you ten Python3
database applications that don't normalize their primary keys.


Marko



More information about the Python-list mailing list