[Python-Dev] Internationalization Toolkit

Tim Peters tim_one@email.msn.com
Wed, 10 Nov 1999 01:25:07 -0500


> Marc-Andre Lemburg has a proposal for work that I'm asking him to do
> (under pressure from HP who want Python i18n badly and are willing to
> pay!): http://starship.skyport.net/~lemburg/unicode-proposal.txt

I can't make time for a close review now.  Just one thing that hit my eye
early:

    Python should provide a built-in constructor for Unicode strings
    which is available through __builtins__:

    u = unicode(<encoded Python string>[,<encoding name>=
                                         <default encoding>])

    u = u'<utf-8 encoded Python string>'

Two points on the Unicode literals (u'abc'):

UTF-8 is a very nice encoding scheme, but is very hard for people "to do" by
hand -- it breaks apart and rearranges bytes at the bit level, and
everything other than 7-bit ASCII requires solid strings of "high-bit"
characters.  This is painful for people to enter manually on both counts --
and no common reference gives the UTF-8 encoding of glyphs directly.  So, as
discussed earlier, we should follow Java's lead and also introduce a \u
escape sequence:

    octet:           hexdigit hexdigit
    unicodecode:     octet octet
    unicode_escape:  "\\u" unicodecode

Inside a u'' string, I guess this should expand to the UTF-8 encoding of the
Unicode character at the unicodecode code position.  For consistency, then,
it should probably expand the same way inside "regular strings" too.  Unlike
Java does, I'd rather not give it a meaning outside string literals.

The other point is a nit:  The vast bulk of UTF-8 encodings encode
characters in UCS-4 space outside of Unicode.  In good Pythonic fashion,
those must either be explicitly outlawed, or explicitly defined.  I vote for
outlawed, in the sense of detected error that raises an exception.  That
leaves our future options open.

BTW, is ord(unicode_char) defined?  And as what?  And does ord have an
inverse in the Unicode world?  Both seem essential.

international-in-spite-of-himself-ly y'rs  - tim