[I18n-sig] Pre-PEP: Proposed Python Character Model

Tue, 06 Feb 2001 06:49:09 -0800

I went to a very interesting talk about internationalization by Tim
Bray, one of the editors of the XML spec and a real expert on i18n. It
inspired me to wrestle one more time with the architectural issues in
Python that are preventing us from saying that it is a really
internationalized language. Those geek cruises aren't just about sun,
surf and sand. There's a pretty high level of intellectual give and take
also! Email me for more info...

Anyhow, we deferred many of these issues (probably
out of exhaustion) the last time we talked about it but we cannot and
should not do so forever. In particular, I do not think that we should
add more features for working with Unicode (e.g. unichr) before thinking
through the issues.

---

Abstract

    Many of the world's written languages have more than 255 characters.
    Therefore Python is out of date in its insistence that "basic strings"
    are lists of characters with ordinals between 0 and 255. Python's
    basic character type must allow at least enough digits for Eastern
    languages.

Problem Description 

    Python's western bias stems from a variety of issues.

    The first problem is that Python's native character type is an 8-bit
    character. You can see that it is an 8-bit character by trying to
    insert a value with an ordinal higher than 255. Python should allow
    for ordinal numbers up to at least the size of a single Eastern
    language such as Chinese or Japanese. Whenever a Python file object
    is "read", it returns one of these lists of 8-byte characters. The
    standard file object "read" method can never return a list of Chinese
    or Japanese characters. This is an unacceptable state of affairs in
    the 21st century.

Goals

    1. Python should have a single string type. It should support
       Eastern characters as well as it does European characters.
       Operationally speaking:

    type("") == type(chr(150)) == type(chr(1500)) == type(file.read())

    2. It should be easier and more efficient to encode and decode
       information being sent to and retrieved from devices.

    3. It should remain possible to work with the byte-level representation. 
       This is sometimes useful for for performance reasons.

Definitions

    Character Set

        A character set is a mapping from integers to characters. Note
        that both integers and characters are abstractions. In other
        words, a decision to use a particular character set does not in
        any way mandate a particular implementation or representation
        for characters.

        In Python terms, a character set can be thought of as no more
        or less than a pair of functions: ord() and chr().  ASCII, for
        instance, is a pair of functions defined only for 0 through 127
        and ISO Latin 1 is defined only for 0 through 255. Character
        sets typically also define a mapping from characters to names
        of those characters in some natural language (often English)
        and to a simple graphical representation that native language
        speakers would recognize.

        It is not possible to have a concept of "character" without having
        a character set. After all, characters must be chosen from some
        repertoire and there must be a mapping from characters to integers
        (defined by ord).

    Character Encoding

        A character encoding is a mechanism for representing characters
        in terms of bits. Character encodings are only relevant when
        information is passed from Python to some system that works
        with the characters in terms of representation rather than
        abstraction. Just as a Python programmer would not care about
        the representation of a long integer, they should not care about
        the representation of a string.  Understanding the distinction
        between an abstract character and its bit level representation
        is essential to understanding this Python character model.

        A Python programmer does not need to know or care whether a long
        integer is represented as twos complement, ones complement or
        in terms of ASCII digits. Similarly a Python programmer does
        not need to know or care how characters are represented in
        memory. We might even change the representation over time to
        achieve higher performance.

    Universal Character Set

        There is only one standardized international character set that
        allows for mixed-language information. It is called the Universal
        Character Set and it is logically defined for characters 0
        through 2^32 but practically is deployed for characters 0 through
        2^16. The Universal Character Set is an international standard
        in the sense that it is standardized by ISO and has the force
        of law in international agreements.

        A popular subset of the Universal Character Set is called
        Unicode. The most popular subset of Unicode is called the "Unicode
        Basic Multilingual Plane (Unicode BMP)". The Unicode BMP has
        space for all of the world's major languages including Chinese,
        Korean, Japanese and Vietnamese.  There are 2^16 characters in
        the Unicode BMP.

        The Unicode BMP subset of UCS is becoming a defacto standard on
        the Web.  In any modern browser you can create an HTML or XML
        document with &#301; and get back a rendered version of Unicode
        character 301. In other words, Unicode is becoming the defato
        character set for the Internet in addition to being the officially
        mandated character set for international commerce.

        In addition to defining ord() and chr(), Unicode provides a
        database of information about characters. Each character has an
        english language name, a classification (letter, number, etc.) a
        "demonstration" glyph and so forth.

The Unicode Contraversy

        Unicode is not entirely uncontroversial. In particular there are
        Japanese speakers who dislike the way Unicode merges characters
        from various languages that were considered "the same" by the
        experts that defined the specification. Nevertheless Unicode is
        in used as the character set for important Japanese software such
        as the two most popular word processors, Ichitaro and Microsoft 
        Word. 

        Other programming languages have also moved to use Unicode as the 
        basic character set instead of ASCII or ISO Latin 1. From memory, 
        I believe that this is the case for:

            Java 
            Perl
            JavaScript
            Visual Basic 
            TCL

        XML is also Unicode based. Note that the difference between
        all of these languages and Python is that Unicode is the
        *basic* character type. Even when you type ASCII literals, they
        are immediately converted to Unicode.

        It is the author's belief this "running code" is evidence of
        Unicode's practical applicability. Arguments against it seem
        more rooted in theory than in practical problems. On the other
        hand, this belief is informed by those who have done heavy
        work with Asian characters and not based on my own direct
        experience.

Python Character Set

    As discussed before, Python's native character set happens to consist
    of exactly 255 characters. If we increase the size of Python's
    character set, no existing code would break and there would be no
    cost in functionality.

    Given that Unicode is a standard character set and it is richer
    than that of Python's, Python should move to that character set.
    Once Python moves to that character set it will no longer be necessary
    to have a distinction between "Unicode string" and "regular string."
    This means that Unicode literals and escape codes can also be
    merged with ordinary literals and escape codes. unichr can be merged
    with chr.

Character Strings and Byte Arrays

    Two of the most common constructs in computer science are strings of
    characters and strings of bytes. A string of bytes can be represented
    as a string of characters between 0 and 255. Therefore the only
    reason to have a distinction between Unicode strings and byte
    strings is for implementation simplicity and performance purposes.
    This distinction should only be made visible to the average Python
    programmer in rare circumstances.

    Advanced Python programmers will sometimes care about true "byte
    strings". They will sometimes want to build and parse information
    according to its representation instead of its abstract form. This
    should be done with byte arrays. It should be possible to read bytes
    from and write bytes to arrays. It should also be possible to use
    regular expressions on byte arrays.

Character Encodings for I/O

    Information is typically read from devices such as file systems
    and network cards one byte at a time. Unicode BMP characters
    can have values up to 2^16 (or even higher, if you include all of
    UCS). There is a fundamental disconnect there. Each character cannot
    be represented as a single byte anymore. To solve this problem,
    there are several "encodings" for large characters that describe
    how to represent them as series of bytes.

    Unfortunately, there is not one, single, dominant encoding. There are
    at least a dozen popular ones including ASCII (which supports only
    0-127), ISO Latin 1 (which supports only 0-255), others in the ISO
    "extended ASCII" family (which support different European scripts),
    UTF-8 (used heavily in C programs and on Unix), UTF-16 (preferred by
    Java and Windows), Shift-JIS (preferred in Japan) and so forth. This
    means that the only safe way to read data from a file into Python
    strings is to specify the encoding explicitly.

    Python's current assumption is that each byte translates into a
    character of the same ordinal. This is only true for "ISO Latin 1".
    Python should require the user to specify this explicitly instead.

    Any code that does I/O should be changed to require the user to
    specify the encoding that the I/O should use. It is the opinion of
    the author that there should be no default encoding at all. If you
    want to read ASCII text, you should specify ASCII explicitly. If
    you want to read ISO Latin 1, you should specify it explicitly.

    Once data is read into Python objects the original encoding is
    irrelevant. This is similar to reading an integer from a binary file,
    an ASCII file or a packed decimal file. The original bits and bytes
    representation of the integer is disconnected from the abstract
    representation of the integer object.

Proposed I/O API

    This encoding could be chosen at various levels. In some applications
    it may make sense to specify the encoding on every read or write as
    an extra argument to the read and write methods. In most applications
    it makes more sense to attach that information to the file object as
    an attribute and have the read and write methods default the encoding
    to the property value. This attribute value could be initially set
    as an extra argument to the "open" function.

    Here is some Python code demonstrating a proposed API:

        fileobj = fopen("foo", "r", "ASCII") # only accepts values < 128 
        fileobj2 = fopen("bar", "r", "ISO Latin 1")  # byte-values "as is" 
        fileobj3 = fopen("baz", "r", "UTF-8")
        fileobj2.encoding = "UTF-16" # changed my mind!  
        data = fileobj2.read(1024, "UTF-8" ) # changed my mind again

    For efficiency, it should also be possible to read raw bytes into
    a memory buffer without doing any interpretation:

    moredata = fileobj2.readbytes(1024)

    This will generate a byte array, not a character string. This
    is logically equivalent to reading the file as "ISO Latin 1"
    (which happens to map bytes to characters with the same ordinals)
    and generating a byte array by copying characters to bytes but it
    is much more efficient.

Python File Encoding

    It should be possible to create Python files in any of the common
    encodings that are backwards compatible with ASCII. This includes
    ASCII itself, all language-specific "extended ASCII" variants
    (e.g. ISO Latin 1), Shift-JIS and UTF-8 which can actually encode
    any UCS character value.

    The precise variant of "super-ASCII" must be declared with a 
    specialized comment that precedes any other lines other than the
    shebang line if present. It has a syntax like this:

    #?encoding="UTF-8"
    #?encoding="ISO-8859-1"
    ...
    #?encoding="ISO-8859-9"
    #?encoding="Shift_JIS"

    For now, this is the complete list of legal encodings. Others may
    be added in the future.

    Python files which use non-ASCII characters without defining an
    encoding should be immediately deprecated and made illegal in some
    future version of Python.

C APIs

    The only time representation matters is when data is being moved from
    Python's internal model to something outside of Python's control
    or vice versa. Reading and writing from a device is a special case
    discussed above. Sending information from Python to C code is also
    an issue.

    Python already has a rule that allows the automatic conversion
    of characters up to 255 into their C equivalents. Once the Python
    character type is expanded, characters outside of that range should
    trigger an exception (just as converting a large long integer to a
    C int triggers an exception).

    Some might claim it is inappropriate to presume that
    the character-for- byte mapping is the correct "encoding" for
    information passing from Python to C. It is best not to think of it
    as an encoding. It is merely the most straightforward mapping from
    a Python type to a C type. In addition to being straightforward,
    I claim it is the best thing for several reasons:

    * It is what Python already does with string objects (but not
    Unicode objects).

    * Once I/O is handled "properly", (see above) it should be extremely
    rare to have characters in strings above 128 that mean anything
    OTHER than character values. Binary data should go into byte arrays.

    * It preserves the length of the string so that the length C sees
    is the same as the length Python sees.

    * It does not require us to make an arbitrary choice of UTF-8 versus
    UTF-16.

    * It means that C extensions can be internationalized by switching
    from C's char type to a wchar_t and switching from the string format
    code to the Unicode format code.

    Python's built-in modules should migrate from char to wchar_t (aka
    Py_UNICODE) over time. That is, more and more functions should
    support characters greater than 255 over time.

Rough Implementation Requirements

    Combine String and Unicode Types:

        The StringType and UnicodeType objects should be aliases for
        the same object. All PyString_* and PyUnicode_* functions should 
        work with objects of this type.

    Remove Unicode String Literals

        Ordinary string literals should allow large character escape codes
        and generate Unicode string objects.

        Unicode objects should "repr" themselves as Python string objects.

        Unicode string literals should be deprecated.

    Generalize C-level Unicode conversion

        The format string "S" and the PyString_AsString functions should
        accept Unicode values and convert them to character arrays
        by converting each value to its equivalent byte-value. Values
        greater than 255 should generate an exception.

    New function: fopen

        fopen should be like Python's current open function except that
        it should allow and require an encoding parameter. It should
        be considered a replacement for open. fopen should return an 
        encoding-aware file object. open should eventually
        be deprecated.

    Add byte arrays

        The regular expression library should be generalized to handle
        byte arrays without converting them to Python strings. This will
        allow those who need to work with bytes to do so more efficiently.

        In general, it should be possible to use byte arrays where-ever
        it is possible to use strings. Byte arrays could be thought of
        as a special kind of "limited but efficient" string. Arguably we
        could go so far as to call them "byte strings" and reuse Python's
        current string implementation. The primary differences would be
        in their "repr", "type" and literal syntax.

        In a sense we would have kept the existing distinction between
        Unicode strings and 8-bit strings but made Unicode the "default"
        and provided 8-bit strings as an efficient alternative.

Appendix: Using Non-Unicode character sets

    Let's presume that a linguistics researcher objected to the
    unification of Han characters in Unicode and wanted to invent a
    character set that included separate characters for all Chinese,
    Japanese and Korean character sets. Perhaps they also want to support
    some non-standard character set like Klingon. Klingon is actually
    scheduled to become part of Unicode eventually but let's presume
    it wasn't. 

    This section will demonstrate that this researcher is no worse off
    under the new system than they were under historical Python. Adopting
    Unicode as a standard has no down-side for someone in this
    situation. They have several options under the new system:

     1. Ignore Unicode

        Read in the bytes using the encoding "RAW" which would mean that
        each byte would be translated into a character between 0 and
        255. It would be a synonym for ISO Latin 1. Now you can process
        the data using exactly the same Python code that you would have
        used in Python 1.5 through Python 2.0. The only difference is
        that the in-memory representation of the data MIGHT be less
        space efficient because Unicode characters MIGHT be implemented
        internally as 16 or 32 bit integers.

        This solution is the simplest and easiest to code.

    2. Use Byte Arrays

        As discussed earlier, a byte array is like a string where
        the characters are restricted to characters between 0 and
        255. The only virtues of byte arrays are that they enforce this
        rule and they can be implemented in a more memory-efficient
        manner. According to the proposal, it should be possible to load
        data into a byte array (or "byte string") using the "readbytes"
        method.

        This solution is the most efficient.

    3. Use Unicode's Private Use Area (PUA)

        Unicode is an extensible standard. There are certain character
        codes reserved for private use between consenting parties. You
        could map characters like Klingon or certain Korean ideographs
        into the private use area. Obviously the Unicode character
        database would not have meaningful information about these
        characters and rendering systems would not know how to render
        them. But this situation is no worse than in today's Python. There
        is no character database for arbitrary character sets and there
        is no automatic way to render them.

        One limitation to this issue is that the Private Use Area can
        only handle so many characters. The BMP PUA can hold thousands
        and if we step up to "full" Unicode support we have room for 
        hundreds of thousands.

        This solution gets the maximum benefit from Unicode for the
        characters that are defined by Unicode without losing the ability
        to refer to characters outside of Unicode.

    4. Use A Higher Level Encoding

        You could wrap Korean characters in <KOREA>...</KOREA> tags. You
        could describe a characters as \KLINGON-KAHK (i.e. 13 Unicode
        characters).  You could use a special Unicode character as an
        "escape flag" to say that the next character should be interpreted
        specially.

        This solution is the most self-descriptive and extensible.

    In summary, expanding Python's character type to support Unicode
    characters does not restrict even the most estoric, Unicode-hostile
    types of text processing. Therefore there is no basis for objecting
    to Unicode as some form of restriction. Those who need to use
    another logial character set have as much ability to do so as they
    always have.

Conclusion

    Python needs to support international characters. The "ASCII" of
    internationalized characters is Unicode. Most other languages have
    moved or are moving their basic character and string types to
    support Unicode. Python should also.