Cult-like behaviour [was Re: Kindness]

Ed Kellett e+python-list at kellett.im
Sun Jul 15 13:04:59 EDT 2018


On 2018-07-15 15:52, Steven D'Aprano wrote:
> On Sun, 15 Jul 2018 14:17:51 +0300, Marko Rauhamaa wrote:
> 
>> Steven D'Aprano <steve+comp.lang.python at pearwood.info>:
>>
>>> On Sun, 15 Jul 2018 11:43:14 +0300, Marko Rauhamaa wrote:
>>>> Paul Rubin <no.email at nospam.invalid>:
>>>>> I don't think Go is the answer either, but it probably got strings
>>>>> right.  What is the answer?
>>>
>>> Go strings aren't text strings. They're byte strings. When you say that
>>> Go got them right, that depends on your definition of success.
>>>
>>> If your definition of "success" is:
>>>
>>> - fail to be able to support 80% + of the world's languages
>>>   and a majority of the world's text;
>>
>> Of course byte strings can support at least as many languages as
>> Python3's code point strings and at least equally well.
> 
> You cannot possibly be serious.
> 
> There are 256 possible byte values. China alone has over 10,000 different 
> characters. You can't represent 10,000+ characters using only 256 
> distinct code points.
> 
> You can't even represent the world's languages using 16-bit word-strings 
> instead of byte strings.

I think you're tearing down a straw man here. (So is Marko.)

The byte-string-only argument is to use byte strings containing encoded
text. This does always work. It's just very easy to make mistakes like
double-encoding.

The "do what Python 3 does" argument is, as I see it, that it's better
to deal with text independently of its encoding, and explicitly
converting to and from byte representations. I'm very much in favour,
not particularly because it prevents errors (though it does), but
because it saves me from having to manage irrelevant details like the
encoding of the text in question.

Imagine if people made the same argument: "byte strings are better than
a representation-independent type" about, say, integers. Using byte
strings instead of integers is great! You can roundtrip any integer and
not care how it's encoded! You can print it to a terminal or a file or
anything without having to pointlessly re-encode it! Okay, so things get
a bit hairy if someone uses hex instead of the obviously-superior
decimal, but nobody does that. And when they do, you can just
bytes.decode('int-hex'). Just remember not to do it more than once, a
famously easy problem in programming that has never bitten anyone ever,
and you're golden. Look at all the problems this solves! Now we can even
parse a file format with integers in it and emit them again without
having to know what encoding the integers are, which doesn't actually
save us from any encoding headaches because we need to figure out the
encoding to work with those integers at all, but will make for good
ammunition against those ridiculous integer zealots.

On a more serious note, I think this particular aspect of Python causes
quite a lot of difficulty for Python 2 programs that make heavy use of
the bytes-text duality, and quite a lot of peace of mind for every other
case. So, Marko, I don't know what code you work on, but I think it's
unfair to attack Python 3's unicode handling too hard if you haven't
written a new project with it.



More information about the Python-list mailing list