Java identifiers (was: languages with full unicode support)

Wed Jun 28 09:56:18 EDT 2006

Note Followup-To: comp.lang.java.programmer

Chris Uppal wrote:
> Since the interpretation of characters which are yet to be added to
> Unicode is undefined (will they be digits, "letters", operators, symbol,
> punctuation.... ?), there doesn't seem to be any sane way that a language could
> allow an unrestricted choice of Unicode in identifiers.  Hence, it must define
> a specific allowed sub-set.  C certainly defines an allowed subset of Unicode
> characters -- so I don't think you could call its Unicode support "half-baked"
> (not in that respect, anyway).  A case -- not entirely convincing, IMO -- could
> be made that it would be better to allow a wider range of characters.
> 
> And no, I don't think Java's approach -- where there /is no defined set of
> allowed identifier characters/ -- makes any sense at all :-(

Java does have a defined set of allowed identifier characters. However, you
certainly have to go around the houses a bit to work out what that set is:

<http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.8>

# An identifier is an unlimited-length sequence of Java letters and Java digits,
# the first of which must be a Java letter. An identifier cannot have the same
# spelling (Unicode character sequence) as a keyword (§3.9), boolean literal
# (§3.10.3), or the null literal (§3.10.7).
[...]
# A "Java letter" is a character for which the method
# Character.isJavaIdentifierStart(int) returns true. A "Java letter-or-digit"
# is a character for which the method Character.isJavaIdentifierPart(int)
# returns true.
[...]
# Two identifiers are the same only if they are identical, that is, have the
# same Unicode character for each letter or digit.

For Java 1.5.0:

<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html>

# Character information is based on the Unicode Standard, version 4.0.

<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isJavaIdentifierStart(int)>

# A character may start a Java identifier if and only if one of the following
# conditions is true:
#
#   * isLetter(codePoint) returns true
#   * getType(codePoint) returns LETTER_NUMBER
#   * the referenced character is a currency symbol (such as "$")

[This means that getType(codePoint) returns CURRENCY_SYMBOL, i.e. Unicode
General Category Sc.]

#   * the referenced character is a connecting punctuation character (such as "_").

[This means that getType(codePoint) returns CONNECTOR_PUNCTUATION, i.e. Unicode
General Category Pc.]

<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isJavaIdentifierPart(int)>

# A character may be part of a Java identifier if any of the following are true:
#
#   * it is a letter
#   * it is a currency symbol (such as '$')
#   * it is a connecting punctuation character (such as '_')
#   * it is a digit
#   * it is a numeric letter (such as a Roman numeral character)

[General Category Nl.]

#   * it is a combining mark

[General Category Mc (see <http://www.unicode.org/versions/Unicode4.0.0/ch04.pdf>).]

#   * it is a non-spacing mark

[General Category Mn (ditto).]

#   * isIdentifierIgnorable(codePoint) returns true for the character

<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isDigit(int)>

# A character is a digit if its general category type, provided by
# getType(codePoint), is DECIMAL_DIGIT_NUMBER.

[General Category Nd.]

<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isIdentifierIgnorable(int)>

# The following Unicode characters are ignorable in a Java identifier or a Unicode
# identifier:
#
#   * ISO control characters that are not whitespace
#         o '\u0000' through '\u0008'
#         o '\u000E' through '\u001B'
#         o '\u007F' through '\u009F'
#   * all characters that have the FORMAT general category value

[FORMAT is General Category Cf.]

<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isLetter(int)>

# A character is considered to be a letter if its general category type, provided
# by getType(codePoint), is any of the following:
#
#   * UPPERCASE_LETTER
#   * LOWERCASE_LETTER
#   * TITLECASE_LETTER
#   * MODIFIER_LETTER
#   * OTHER_LETTER

====

To cut a long story short, the syntax of identifiers in Java 1.5 is therefore:

  Keyword ::= one of
        abstract    continue    for           new          switch
        assert      default     if            package      synchronized
        boolean     do          goto          private      this
        break       double      implements    protected    throw
        byte        else        import        public       throws
        case        enum        instanceof    return       transient
        catch       extends     int           short        try
        char        final       interface     static       void
        class       finally     long          strictfp     volatile
        const       float       native        super        while

  Identifier        ::= IdentifierChars butnot (Keyword | "true" | "false" | "null")
  IdentifierChars   ::= JavaLetter | IdentifierChars JavaLetterOrDigit
  JavaLetter        ::= Lu | Ll | Lt | Lm | Lo | Nl | Sc | Pc
  JavaLetterOrDigit ::= JavaLetter | Nd | Mn | Mc |
                        U+0000..0008 | U+000E..001B | U+007F..009F | Cf

where the two-letter terminals refer to General Categories in Unicode 4.0.0
(exactly).

Note that the so-called "ignorable" characters (for which
isIdentifierIgnorable(codePoint) returns true) are not ignorable; they are
treated like any other identifier character. This quote from the API spec:

# The following Unicode characters are ignorable in a Java identifier [...]

should be ignored (no pun intended). It is contradicted by:

# Two identifiers are the same only if they are identical, that is, have the
# same Unicode character for each letter or digit.

in the language spec. Unicode does have a concept of ignorable characters in
identifiers, which is probably where this documentation bug crept in.

The inclusion of U+0000 and various control characters in the set of valid
identifier characters is also a dubious decision, IMHO.

Note that I am not defending in any way the complexity of this definition; there's
clearly no excuse for it (or for the "ignorable" documentation bug). The language
spec should have been defined directly in terms of the Unicode General Categories,
and then the API in terms of the language spec. They way it is done now is
completely backwards.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>