Allowing non-ASCII identifiers
Dietrich Epp
dietrich at zdome.net
Wed Feb 11 18:38:50 EST 2004
On Feb 10, 2004, at 8:59 AM, Scott David Daniels wrote:
> Also, we would have to solve the issue of multiple representations
> for the same identifier (normalized identifiers)? There are four
> equivalent representations:
>
> (u'\N{Latin small letter e with acute}l'
> u'\N{Latin small letter e with grave}ve')
>
> (u'\N{Latin small letter e with acute}l'
> u'e\N{Combining grave accent}ve')
>
> (u'e\N{Combining acute accent}l'
> u'\N{Latin small letter e with grave}ve')
>
> (u'e\N{Combining acute accent}l'
> u'e\N{Combining grave accent}ve')
>
> Unicode says we should treat these four identically. Further,
> they each have a distinct hash code, so a dictionary will not
> necessarily even try to compare them to find them equal.
You could require that all identifiers be the canonically decomposed
Unicode representations encoded into UTF-8. This would mean that no
matter which string is chosen from the above, the result is always the
same sequence of characters. This is how many filesystems use unicode,
i.e., Mac HFS+ works this way (but filesystems usually also require a
specific version of Unicode for backwards compatibility).
I personally think that Unicode identifiers would be catastrophic.
With Unicode on the web, if you can't represent some characters, you
can't read the web page. With programming, it could mean that you are
unable to use a particular module, altering the functionality for
people who can't enter certain codes. There is also the issue of which
characters to allow, because some characters look like numbers. Is
unicode 'IV' a number or an identifier? What about a circled 4? What
about unicode line breaks and paragraph breaks? What about opening and
closing quote marks? What about right-to-left characters? What about
ligatures? Non-breaking spaces? Function application?
I think the assumption some people have is that Unicode will only ever
be used for things that are like the roman alphabet: adding diacritical
marks, etc. It sounds like the most worthless extension ever, and the
only language I think of when I think of special characters is
Intercal.
More information about the Python-list
mailing list