PEP 3131: Supporting Non-ASCII Identifiers

Tue May 15 20:46:21 EDT 2007

On Tue, 15 May 2007 20:43:31 +1000, Aldo Cortesi wrote:

> Thus spake Steven D'Aprano (steven at REMOVE.THIS.cybersource.com.au):
> 
>> >> Me, I try to understand a patch by reading it. Call me
>> >> old-fashioned.
>> >
>> > I concur, Aldo.  Indeed, if I _can't_ be sure I understand a patch, I
>> > don't accept it -- I ask the submitter to make it clearer.
>>
>>
>> Yes, but there is a huge gulf between what Aldo originally said he does
>> ("visual inspection") and *reading and understanding the code*.
> 
> Let's set aside the fact that you're guilty of sloppy quoting here,
> since the phrase "visual inspection" is yours, not mine. 

Yes, my bad, I apologize, that was sloppy of me. What you actually said 
was "I can't reliably verify it by eye".

> Regardless,
> your interpretation of my words is just plain dumb. My phrasing was
> intended to draw attention to the fact that one needs to READ code in
> order to understand it. You know - with one's eyes. VISUALLY. And VISUAL
> INSPECTION of code becomes unreliable if this PEP passes.

Not withstanding my misquote, I find it ... amusing ... that after 
hauling me over the coals for using the term "visual inspection", you're 
not only using it, but shouting it.

Perhaps you aren't aware that doing something "by eye" is idiomatic 
English for doing it quickly, roughly, imprecisely. It is the opposite of 
taking the time and effort to do the job carefully and accurately. If you 
measure something "by eye", you just look at it and take a guess. 

So, as I said, if you're relying on VISUAL INSPECTION (your words _now_) 
you're already vulnerable. Fortunately for you, you're not relying on 
visual inspection, you are actually _reading_ and _comprehending_ the 
code. That might even mean, in extreme cases, you sit down with pencil 
and paper and sketch out the program flow to understand what it is doing.

Now that (I hope!) you understand why I said what I said, can we agree 
that _understanding_ is critical to the process? If you don't understand 
the code, you don't accept it. If somebody submits a patch with 
identifiers like a9472302 and a 9473202 you're going to reject it as too 
difficult to understand.

How do non-ASCII identifiers change that situation? What will be 
different?

>> If I've understood Martin's post, the PEP states that identifiers are
>> converted to normal form. If two identifiers look the same, they will
>> be the same.
> 
> I'm sorry to have to tell you, but you understood Martin's post no
> better than you did mine. There is no general way to detect homoglyphs
> and "convert them to a normal form". Observe:
> 
> import unicodedata
> print repr(unicodedata.normalize("NFC", u"\u2160")) print u"\u2160"
> print "I"

Yes, I observe two very different glyphs, as different as the ASCII 
characters I and |. What do you see?

> So, a round 0 for reading comprehension this lesson, I'm afraid. Better
> luck next time.

Ha ha, very funny.

So, let's summarize... 

Non-ASCII identifiers are bad, because they are vulnerable to the exact 
same problems as ASCII identifiers, only we're happy to live with those 
problems if they are ASCII, and just install a font that makes I and l 
look different, but we won't install a font that makes I and Ⅰ look 
different, because that's too hard.

Well, you've convinced me. Obviously expecting Python programmers to cope 
with something as complicated as installing a decent set of fonts is such 
a major huddle that people will abandon the language in droves, probably 
taking up Haskel and Visual Basic and Lisp and all those other languages 
that allow non-ASCII identifiers.

-- 
Steven.