PEP 3131: Supporting Non-ASCII Identifiers

Steven D'Aprano steven at REMOVE.THIS.cybersource.com.au
Sun May 13 22:41:41 EDT 2007


On Mon, 14 May 2007 09:42:13 +1000, Aldo Cortesi wrote:

>       I don't
>       want to be in a situation where I need to mechanically "clean"
>       code (say, from a submitted patch) with a tool because I can't
>       reliably verify it by eye.

But you can't reliably verify by eye. That's orders of magnitude more 
difficult than debugging by eye, and we all know that you can't reliably 
debug anything but the most trivial programs by eye.

If you're relying on cursory visual inspection to recognize harmful code, 
you're already vulnerable to trojans.



>       We should learn from the plethora of
>       Unicode-related security problems that have cropped up in the last
>       few years.

Of course we should. And one of the things we should learn is when and 
how Unicode is a risk, and not imagine that Unicode is some sort of 
mystical contamination that creates security problems just by being used.



>     - Non-ASCII identifiers would be a barrier to code exchange. If I
>     know
>       Python I should be able to easily read any piece of code written
>       in it, regardless of the linguistic origin of the author. If PEP
>       3131 is accepted, this will no longer be the case.

But it isn't the case now, so that's no different. Code exchange 
regardless of human language is a nice principle, but it doesn't work in 
practice. How do you use "any piece of code ... regardless of the 
linguistic origin of the author" when you don't know what the functions 
and classes and arguments _mean_?

Here's a tiny doc string from one of the functions in the standard 
library, translated (more or less) to Portuguese. If you can't read 
Portuguese at least well enough to get by, how could you possibly use 
this function? What would you use it for? What does it do? What arguments 
does it take?

def dirsorteinsercao(a, x, baixo=0, elevado=None): 
    """da o artigo x insercao na lista a, e mantem-na a 
    supondo classificado e classificado. Se x estiver ja em a,
    introduza-o a direita do x direita mais. Os args opcionais 
    baixos (defeito 0) e elevados (len(a) do defeito) limitam 
    a fatia de a a ser procurarado.
    """
    # not a non-ASCII character in sight (unless I missed one...)

[Apologies to Portuguese speakers for the dogs-breakfast I'm sure Babel-
fish and I made of the translation.]

The particular function I chose is probably small enough and obvious 
enough that you could work out what it does just by following the 
algorithm. You might even be able to guess what it is, because Portuguese 
is similar enough to other Latin languages that most people can guess 
what some of the words might mean (elevados could be height, maybe?). Now 
multiply this difficulty by a thousand for a non-trivial module with 
multiple classes and dozens of methods and functions. And you might not 
even know what language it is in.

No, code exchange regardless of natural language is a nice principle, but 
it doesn't exist except in very special circumstances. 



> A Python
>       project that uses Urdu identifiers throughout is just as useless
>       to me, from a code-exchange point of view, as one written in Perl.

That's because you can't read it, not because it uses Unicode. It could 
be written entirely in ASCII, and still be unreadable and impossible to 
understand.



>     - Unicode is harder to work with than ASCII in ways that are more
>     important
>       in code than in human-language text. Humans eyes don't care if two
>       visually indistinguishable characters are used interchangeably.
>       Interpreters do. There is no doubt that people will accidentally
>       introduce mistakes into their code because of this.

That's no different from typos in ASCII. There's no doubt that we'll give 
the same answer we've always given for this problem: unit tests, pylint 
and pychecker.



-- 
Steven.



More information about the Python-list mailing list