Cult-like behaviour [was Re: Kindness]

Marko Rauhamaa marko at pacujo.net
Sun Jul 15 17:28:39 EDT 2018


Terry Reedy <tjreedy at udel.edu>:

> On 7/15/2018 7:37 AM, Marko Rauhamaa wrote:
>> One of the classic Unix and Internet tenets is that text is bytes is
>> text.
>
> Tenets of a faith may be wrong ;-).  An informatic paradigm from more
> than 45 years ago may be outdated and in need of revision.
>
> [...]
>
>> Of course, much of it was naïve, but UTF-8 has miraculously given
>> it a new life.  
>
> UTF-8 makes 'bytes is text' even less true. Not only are some leading
> bytes not text, but some byte sequences are illegal. Bytes are not
> UTF-8 text. As n increases, the probability that a string of n random
> bytes will be utf-8 text approaches 0 faster than interpreting the
> same bytes as Latin1.

Yes, but Linux and the Internet are my bread and butter (and more). The
45-year-old axioms still hold, whatever complications they lead to. If
you wanted to change that, you'd have to build your system from ground
up.

Windows, BTW, isn't that system, nor is macOS. They made some moves in
that direction, but ended up making some missteps as well. And beware,
if your new system used Python3's UTF-32 strings as a foundation, that
would be an equally naïve misstep. You'd need to reach a notch higher
and use glyphs or other "semiotic atoms" as building blocks. UTF-32,
after all, is a variable-width encoding.


Marko



More information about the Python-list mailing list