Glyphs and graphemes [was Re: Cult-like behaviour]

Chris Angelico rosuav at gmail.com
Tue Jul 17 04:49:51 EDT 2018


On Tue, Jul 17, 2018 at 6:27 PM, Marko Rauhamaa <marko at pacujo.net> wrote:
> It is essential for people to understand that the very same issues that
> plague UTF-8 plague UTF-32 as well. Using UTF in both highlights that
> fact.

What a wonderful nonsense. I suppose that the same issues plague Elon
Musk as plague the musk sticks in the sweets aisle in the supermarket
- they do use the same letters, after all.

>> If by "very many things", you mean "not very many things", I agree
>> with you. In my experience, dealing with code points is "good enough",
>> especially if you use Western European alphabets, and even more so if
>> you're willing to do a normalization step before processing text.
>
> Of course, UTF-8 doesn't relieve you from Unicode problems. But it has
> one big advantage: it can usually deal with non-Unicode data without any
> extra considerations while Python3's strings make you have to take
> elaborate measures to handle those special cases. Why, even print() must
> be guarded against UnicodeEncodeError when the printed string is not in
> the programmer's control.

What is this "non-Unicode data" that UTF-8 can handle? Do you mean
arbitrary byte sequences? Because no, it cannot; properly-formed UTF-8
sequences MUST comply with the precise requirements of the format.

Can you give an example of how Python 3's print function can raise
UnicodeEncodeError when given a Python 3 string?

ChrisA



More information about the Python-list mailing list