PEP 3131: Supporting Non-ASCII Identifiers
Steven D'Aprano
steven at REMOVE.THIS.cybersource.com.au
Sun May 13 22:41:41 EDT 2007
On Mon, 14 May 2007 09:42:13 +1000, Aldo Cortesi wrote:
> I don't
> want to be in a situation where I need to mechanically "clean"
> code (say, from a submitted patch) with a tool because I can't
> reliably verify it by eye.
But you can't reliably verify by eye. That's orders of magnitude more
difficult than debugging by eye, and we all know that you can't reliably
debug anything but the most trivial programs by eye.
If you're relying on cursory visual inspection to recognize harmful code,
you're already vulnerable to trojans.
> We should learn from the plethora of
> Unicode-related security problems that have cropped up in the last
> few years.
Of course we should. And one of the things we should learn is when and
how Unicode is a risk, and not imagine that Unicode is some sort of
mystical contamination that creates security problems just by being used.
> - Non-ASCII identifiers would be a barrier to code exchange. If I
> know
> Python I should be able to easily read any piece of code written
> in it, regardless of the linguistic origin of the author. If PEP
> 3131 is accepted, this will no longer be the case.
But it isn't the case now, so that's no different. Code exchange
regardless of human language is a nice principle, but it doesn't work in
practice. How do you use "any piece of code ... regardless of the
linguistic origin of the author" when you don't know what the functions
and classes and arguments _mean_?
Here's a tiny doc string from one of the functions in the standard
library, translated (more or less) to Portuguese. If you can't read
Portuguese at least well enough to get by, how could you possibly use
this function? What would you use it for? What does it do? What arguments
does it take?
def dirsorteinsercao(a, x, baixo=0, elevado=None):
"""da o artigo x insercao na lista a, e mantem-na a
supondo classificado e classificado. Se x estiver ja em a,
introduza-o a direita do x direita mais. Os args opcionais
baixos (defeito 0) e elevados (len(a) do defeito) limitam
a fatia de a a ser procurarado.
"""
# not a non-ASCII character in sight (unless I missed one...)
[Apologies to Portuguese speakers for the dogs-breakfast I'm sure Babel-
fish and I made of the translation.]
The particular function I chose is probably small enough and obvious
enough that you could work out what it does just by following the
algorithm. You might even be able to guess what it is, because Portuguese
is similar enough to other Latin languages that most people can guess
what some of the words might mean (elevados could be height, maybe?). Now
multiply this difficulty by a thousand for a non-trivial module with
multiple classes and dozens of methods and functions. And you might not
even know what language it is in.
No, code exchange regardless of natural language is a nice principle, but
it doesn't exist except in very special circumstances.
> A Python
> project that uses Urdu identifiers throughout is just as useless
> to me, from a code-exchange point of view, as one written in Perl.
That's because you can't read it, not because it uses Unicode. It could
be written entirely in ASCII, and still be unreadable and impossible to
understand.
> - Unicode is harder to work with than ASCII in ways that are more
> important
> in code than in human-language text. Humans eyes don't care if two
> visually indistinguishable characters are used interchangeably.
> Interpreters do. There is no doubt that people will accidentally
> introduce mistakes into their code because of this.
That's no different from typos in ASCII. There's no doubt that we'll give
the same answer we've always given for this problem: unit tests, pylint
and pychecker.
--
Steven.
More information about the Python-list
mailing list