Allowing non-ASCII identifiers

Wed Feb 11 18:38:50 EST 2004

On Feb 10, 2004, at 8:59 AM, Scott David Daniels wrote:

> Also, we would have to solve the issue of multiple representations
> for the same identifier (normalized identifiers)?  There are four
> equivalent representations:
>
>     (u'\N{Latin small letter e with acute}l'
>                        u'\N{Latin small letter e with grave}ve')
>
>     (u'\N{Latin small letter e with acute}l'
>                        u'e\N{Combining grave accent}ve')
>
>     (u'e\N{Combining acute accent}l'
>                        u'\N{Latin small letter e with grave}ve')
>
>     (u'e\N{Combining acute accent}l'
>                        u'e\N{Combining grave accent}ve')
>
> Unicode says we should treat these four identically.  Further,
> they each have a distinct hash code, so a dictionary will not 
> necessarily even try to compare them to find them equal.

You could require that all identifiers be the canonically decomposed 
Unicode representations encoded into UTF-8.  This would mean that no 
matter which string is chosen from the above, the result is always the 
same sequence of characters.  This is how many filesystems use unicode, 
i.e., Mac HFS+ works this way (but filesystems usually also require a 
specific version of Unicode for backwards compatibility).

I personally think that Unicode identifiers would be catastrophic.  
With Unicode on the web, if you can't represent some characters, you 
can't read the web page.  With programming, it could mean that you are 
unable to use a particular module, altering the functionality for 
people who can't enter certain codes.  There is also the issue of which 
characters to allow, because some characters look like numbers.  Is 
unicode 'IV' a number or an identifier?  What about a circled 4?  What 
about unicode line breaks and paragraph breaks?  What about opening and 
closing quote marks?  What about right-to-left characters?  What about 
ligatures?  Non-breaking spaces?  Function application?

I think the assumption some people have is that Unicode will only ever 
be used for things that are like the roman alphabet: adding diacritical 
marks, etc.  It sounds like the most worthless extension ever, and the 
only language I think of when I think of special characters is 
Intercal.