Grapheme clusters, a.k.a.real characters

Fri Jul 14 08:32:15 EDT 2017

On Fri, Jul 14, 2017 at 8:59 PM, Marko Rauhamaa <marko at pacujo.net> wrote:
> Chris Angelico <rosuav at gmail.com>:
>
>> On Fri, Jul 14, 2017 at 6:53 PM, Marko Rauhamaa <marko at pacujo.net> wrote:
>>> Chris Angelico <rosuav at gmail.com>:
>>> Then, why bother with Unicode to begin with? Why not just use bytes?
>>> After all, Python3's strings have the very same pitfalls:
>>>
>>>   - you don't know the length of a text in characters
>>>   - chr(n) doesn't return a character
>>>   - you can't easily find the 7th character in a piece of text
>>
>> First you have to define "character".
>
> I'm referring to the
>
>     Grapheme clusters, a.k.a.real characters

Okay. Just as long as you know that that's not the only valid definition.

>> Yes, you can. For most purposes, textual equality should be defined in
>> terms of NFC or NFD normalization. Python already gives you that. You
>> could argue that a string should always be stored NFC (or NFD, take
>> your pick), and then the equality operator would handle this; but I'm
>> not sure the benefit is worth it.
>
> As I said, Python3's strings are neither here nor there. They don't
> quite solve the problem Python2's strings had. They will push the
> internationalization problems a bit farther out but fall short of the
> mark.
>
> he developer still has to worry a lot. Unicode seemingly solved one
> problem only to present the developer of a bagful of new problems.
>
> And if Python3's strings are a half-measure, why not stick to bytes?

Python's float type can't represent all possible non-integer values.
If it's such a half-measure, why not stick to integers and do all your
own fraction handling?

>> If you're trying to use strings as identifiers in any way (say, file
>> names, or document lookup references), using the NFC/NFD normalized
>> form of the string should be sufficient.
>
> Show me ten Python3 database applications, and I'll show you ten Python3
> database applications that don't normalize their primary keys.

I don't have ten open source ones handy, but I can tell you for sure
that I've worked with far more than ten that don't NEED to normalize
their primary keys. Why? Because they are *by definition* normal
already. Mostly because they use integers for keys. Tada!
Normalization is unnecessary.

ChrisA