Time we switched to unicode? (was Explanation of this Python language feature?)

Tue Mar 25 02:12:50 EDT 2014

On Tue, Mar 25, 2014 at 4:47 PM, Steven D'Aprano <steve at pearwood.info> wrote:
> On Tue, 25 Mar 2014 14:57:02 +1100, Chris Angelico wrote:
>> No, I'm not missing that. But the human brain is a tokenizer, just as
>> Python is. Once you know what a token means, you comprehend it as that
>> token, and it takes up space in your mind as a single unit. There's not
>> a lot of readability difference between a one-symbol token and a
>> one-word token.
>
> Hmmm, I don't know about that. Mathematicians are heavy users of symbols.
> Why do they write ∀ instead of "for all", or ⊂ instead of "subset"?
>
> Why do we write "40" instead of "forty"?

Because the shorter symbols lend themselves better to the
"super-tokenization" where you don't read the individual parts but the
whole. The difference between "40" and "forty" is minimal, but the
difference between "86400" and "eighty-six thousand [and] four
hundred" is significant; the first is a single token, which you could
then instantly recognize as the number of seconds in a day (leap
seconds aside), but the second is a lengthy expression.

There's also ease of writing. On paper or blackboard, it's really easy
to write little strokes and curvy lines to mean things, and to write a
bolded letter R to mean "Real numbers". In Python, it's much easier to
use a few more ASCII letters than to write ⊂ ℝ.

>> Also, since the human brain works largely with words,
>
> I think that's a fairly controversial opinion. The Chinese might have
> something to say about that.

Well, all the people I interviewed (three of them: me, myself, and I)
agree that the human brain works with words. My research is 100%
scientific, and is therefore unassailable. So there. :)

> I think that heavy use of symbols is a form of Huffman coding -- common
> things should be short, and uncommon things longer. Mathematicians tend
> to be *extremely* specialised, so they're all inventing their own Huffman
> codings, and the end result is a huge number of (often ambiguous) symbols.

Yeah. That's about the size of it. Usually, each symbol has some read
form; "ℕ ⊂ ℝ" would be read as "Naturals are a subset of Reals" (or
maybe "Naturals is a subset of Reals"?), and in program code, using
the word "subset" or "issubset" wouldn't be much worse. It would be
some worse, and the exact cost depends on how frequently your code
does subset comparisons; my view is that the worseness of words is
less than the worseness of untypable symbols. (And I'm about to be
arrested for murdering the English language.)

> Personally, I think that it would be good to start accepting, but not
> requiring, Unicode in programming languages. We can already write:
>
> from math import pi as π
>
> Perhaps we should be able to write:
>
> setA ⊂ setB

It would be nice, if subset testing is considered common enough to
warrant it. (I'm not sure it is, but I'm certainly not sure it isn't.)
But it violates "one obvious way". Python doesn't, as a general rule,
offer us two ways of spelling the exact same thing. So the bar for
inclusion would be quite high: it has to be so much better than the
alternative that it justifies the creation of a duplicate notation.

ChrisA