languages with full unicode support

Joachim Durchholz jo at durchholz.org
Sat Jul 1 03:46:50 EDT 2006


Chris Uppal schrieb:
> Joachim Durchholz wrote:
> 
>>> This is implementation-defined in C.  A compiler is allowed to accept
>>> variable names with alphabetic Unicode characters outside of ASCII.
>> Hmm... that could would be nonportable, so C support for Unicode is
>> half-baked at best.
> 
> Since the interpretation of characters which are yet to be added to
> Unicode is undefined (will they be digits, "letters", operators, symbol,
> punctuation.... ?), there doesn't seem to be any sane way that a language could
> allow an unrestricted choice of Unicode in identifiers.

I don't think this is a problem in practice. E.g. if a language uses the 
usual definition for identifiers (first letter, then letters/digits), 
you end up with a language that changes its definition on the whims of 
the Unicode consortium, but that's less of a problem than one might 
think at first.

I'd expect two kinds of changes in character categorization: additions 
and corrections. (Any other?)

Additions are relatively unproblematic. Existing code will remain valid 
and retain its semantics. The new characters will be available for new 
programs.
There's a slight technological complication: the compiler needs to be 
able to look up the newest definition. In other words, for a compiler to 
run, it needs to be able to access http://unicode.org, or the language 
infrastructure needs a way to carry around various revisions of the 
Unicode tables and select the newest one.

Corrections are technically more problematic, but then we can rely on 
the common sense of the programmers. If the Unicode consortium 
miscategorized a character as a letter, the programmers that use that 
character set will probably know it well enough to avoid its use. It 
will probably not even occur to them that that character could be a 
letter ;-)


Actually I'm not sure that Unicode is important for long-lived code. 
Code tends to not survive very long unless it's written in English, in 
which case anything outside of strings is in 7-bit ASCII. So the 
majority of code won't ever be affected by Unicode problems - Unicode is 
more a way of lowering entry barriers.

Regards,
Jo



More information about the Python-list mailing list