Unicode [was Re: Cult-like behaviour]

Mon Jul 16 06:26:23 EDT 2018

On Sun, 15 Jul 2018 17:39:55 -0700, Jim Lee wrote:

> On 07/15/18 17:18, Steven D'Aprano wrote:
>> On Sun, 15 Jul 2018 16:08:15 -0700, Jim Lee wrote:
>>
>>> Python3 is intrinsically tied to Unicode for string handling.
>>> Therefore, the Python programmer is forced to deal with it (in all but
>>> trivial cases), rather than given a choice.  So I don't understand how
>>> I can illustrate my point with Python code since Python won't let me
>>> deal with strings without also dealing with Unicode.
>> Nonsense.
>>
>> b"Look ma, a Python 2 style ASCII string."
>>
>>
> As I said, all but trivial cases.
> 
> Do you consider separating Unicode strings from byte strings, having to
> decode and encode from one to the other, 

If you use nothing but byte strings, you don't need to separate the non-
existent text strings from the byte strings, nor do you need to decode or 
encode.

> and knowing which
> functions/methods accept one, the other, or both as arguments, 

That's certainly a real complication, if I may stretch the meaning of the 
word "complication" beyond breaking point. Surely you are already having 
to read the documentation of the function to learn what arguments it 
takes, and what types they are (int or float, list or iterator, 'r' or 
'a', etc). If someone can't deal with the question of "unicode or bytes" 
as well, then perhaps they ought to consider a career change to something 
less demanding, like politics.

If, as you insinuate, all your data is 100% ASCII, then you have nothing 
to fear. Just treat 

    str(bytes_obj, 'ASCII')
    bytes(str_obj, 'ASCII')

as the equivalent of a cast or coercion, and you won't go wrong. (Of 
course, in 2018, the number of applications that can truly say all their 
data is pure ASCII is vanishingly small.)

Or use Latin-1, if you want to do the most simple-minded thing that you 
can to make errors go away, without caring about correctness.

But the thing is, that complexity is *inherent in the domain*. You can 
try to deal with it without Unicode, and as soon as you have users 
expecting to use more than one code page, you're doomed.

> as "not dealing with Unicode"?  I don't.

Frankly, I do.

Dealing with all the vagaries of human text *is* complicated, that's the 
nature of the beast. Dealing with the complexities of Unicode can be as 
complex as dealing with the complexities of floating point arithmetic.

(But neither of those are even in the same ballpark as dealing with the 
complexities of *not* using Unicode: legacy code pages and encodings are 
a nightmare to deal with.)

Nevertheless, just as casual users can go a very, very long way just 
treating floats as the real numbers we learn about in school, and trust 
that IEEE-754 semantics will mean your answers are "close enough", so the 
casual user can go a very long way ignoring the complexities of Unicode, 
so long as they control their own data and know what it is.

If you don't know what your data is, then you're doomed, Unicode or no 
Unicode. (If you don't think that's a problem, if you think that "just 
treat text as octets" works, then people like you are the reason there is 
so much mojibake in the world, screwing it up for the rest of us.)

-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson