[Python-ideas] Proposal for default character representation

Greg Ewing greg.ewing at canterbury.ac.nz
Fri Oct 14 05:36:35 EDT 2016


Mikhail V wrote:

> if "\u1230" <= c <= "\u123f":
> 
> and:
> 
> o = ord (c)
> if 100 <= o <= 150:

Note that, if need be, you could also write that as

   if 0x64 <= o <= 0x96:

> So yours is a valid code but for me its freaky,
> and surely I stick to the second variant.

The thing is, where did you get those numbers from in
the first place?

If you got them in some way that gives them to you
in decimal, such as print(ord(c)), there is nothing
to stop you from writing them as decimal constants
in the code.

But if you got them e.g. by looking up a character
table that gives them to you in hex, you can equally
well put them in as hex constants. So there is no
particular advantage either way.

> You said, I can better see in which unicode page
> I am by looking at hex ordinal, but I hardly
> need it, I just need to know one integer, namely
> where some range begins, that's it.
> Furthermore this is the code which would an average
> programmer better read and maintain.

To a maintainer who is familiar with the layout of
the unicode code space, the hex representation of
a character is likely to have some meaning, whereas
the decimal representation will not. So for that
person, using decimal would make the code *harder*
to maintain.

To a maintainer who doesn't have that familiarity,
it makes no difference either way.

So your proposal would result in a *decrease* of
maintainability overall.

> if I make a mistake, typo, or want to expand the range
> by some value I need to make summ and substract
> operation in my head to progress with my code effectively.
> Is it clear now what I mean by
> conversions back and forth?

Yes, but in my experience the number of times I've
had to do that kind of arithmetic with character codes
is very nearly zero. And when I do, I'm more likely to
get the computer to do it for me than work out the
numbers and then type them in as literals. I just
don't see this as being anywhere near being a
significant problem.

> In standard ASCII
> there are enough glyphs that would work way better
> together,

Out of curiosity, what glyphs do you have in mind?

> ұұ-ұ ---- ---- ---ұ
> 
> you can downscale the strings, so a 16-bit
> value would be ~60 pixels wide

Yes, you can make the characters narrow enough that
you can take 4 of them in at once, almost as though
they were a single glyph... at which point you've
effectively just substituted one set of 16 glyphs
for another. Then you'd have to analyse whether the
*combined* 4-element glyphs were easier to disinguish
from each other than the ones they replaced. Since
the new ones are made up of repetitions of just two
elements, whereas the old ones contain a much more
varied set of elements, I'd be skeptical about that.

BTW, your choice of ұ because of its "peak readibility"
seems to be a case of taking something out of context.
The readability of a glyph can only be judged in terms
of how easy it is to distinguish from other glyphs.
Here, the only thing that matters is distinguishing it
from the other symbol, so something like "|" would
perhaps be a better choice.

||-| ---- ---- ---|

> So if you are more
> than 40 years old (sorry for some familiarity)
> this can be really strong issue and unfortunately
> hardly changeable.

Sure, being familiar with the current system means that
it would take me some effort to become proficient with
a new one.

What I'm far from convinced of is that I would gain any
benefit from making that effort, or that a fresh person
would be noticeably better off if they learned your new
system instead of the old one.

At this point you're probably going to say "Greg, it's
taken you 40 years to become that proficient in hex.
Someone learning my system would do it much faster!"

Well, no. When I was about 12 I built a computer whose
only I/O devices worked in binary. From the time I first
started toggling programs into it to the time I had the
whole binary/hex conversion table burned into my neurons
was maybe about 1 hour. And I wasn't even *trying* to
memorise it, it just happened.

> It is not about speed, it is about brain load.
> Chinese can read their hieroglyphs fast, but
> the cognition load on the brain is 100 times higher
> than current latin set.

Has that been measured? How?

This one sets off my skepticism alarm too, because
people that read Latin scripts don't read them a
letter at a time -- they recognise whole *words* at
once, or at least large chunks of them. The number of
English words is about the same order of magnitude
as the number of Chinese characters.

> I know people who can read bash scripts
> fast, but would you claim that bash syntax can be
> any good compared to Python syntax?

For the things that bash was designed to be good for,
yes, it can. Python wins for anything beyond very
simple programming, but bash wasn't designed for that.
(The fact that some people use it that way says more
about their dogged persistence in the face of
adversity than it does about bash.)

I don't doubt that some sets of glyphs are easier to
distinguish from each other than others. But the
letters and digits that we currently use have already
been pretty well optimised by scribes and typographers
over the last few hundred years, and I'd be surprised
if there's any *major* room left for improvement.

Mixing up letters and digits is certainly jarring to
many people, but I'm not sure that isn't largely just
because we're so used to mentally categorising them
into two distinct groups. Maybe there is some objective
difference that can be measured, but I'd expect it
to be quite small compared to the effect of these
prior "habits" as you call them.

-- 
Greg


More information about the Python-ideas mailing list