From martin@loewis.home.cs.tu-berlin.de  Sun Feb  4 15:13:21 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Sun, 4 Feb 2001 16:13:21 +0100
Subject: [I18n-sig] Re: [4suite] 4Suite 0.10.2 alpha 1
In-Reply-To: <3A7D15EB.970D327E@fourthought.com> (message from Uche Ogbuji on
 Sun, 04 Feb 2001 01:42:19 -0700)
References: <3A7D15EB.970D327E@fourthought.com>
Message-ID: <200102041513.f14FDLZ01273@mira.informatik.hu-berlin.de>

> Please test the new internationalization: French and German translations
> hve been added courtesy Alexandre and Martin.

This is indeed causing problems for me. Invoking 4xslt gives

raceback (most recent call last):
  File "/usr/local/bin/4xslt", line 4, in ?
    from xml.xslt import _4xslt
  File "/usr/local/lib/python2.1/site-packages/_xmlplus/xslt/__init__.py", line
16, in ?
    from xml import xpath
  File "/usr/local/lib/python2.1/site-packages/_xmlplus/xpath/__init__.py", line 41, in ?
    import XPathParserBase
  File "/usr/local/lib/python2.1/site-packages/_xmlplus/xpath/XPathParserBase.py", line 7, in ?
    gettext.install('4Suite', locale_dir)
  File "/usr/local/lib/python2.1/gettext.py", line 251, in install
    translation(domain, localedir).install(unicode)
  File "/usr/local/lib/python2.1/gettext.py", line 238, in translation
    raise IOError(ENOENT, 'No translation file found for domain', domain)
IOError: [Errno 2] No translation file found for domain: '4Suite'

The problem is two-fold: For one thing, there is no German xpath
message catalog. However, it shouldn't fail if LANG is set to an
unsupported language, so you should catch IOError also.

I consider this is a gettext bug: gettext should not fail in the
absence of a catalog, but default to the "C" locale.

Regards,
Martin


From paulp@ActiveState.com  Tue Feb  6 14:49:09 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Tue, 06 Feb 2001 06:49:09 -0800
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
Message-ID: <3A800EE5.A8122B3C@ActiveState.com>

I went to a very interesting talk about internationalization by Tim
Bray, one of the editors of the XML spec and a real expert on i18n. It
inspired me to wrestle one more time with the architectural issues in
Python that are preventing us from saying that it is a really
internationalized language. Those geek cruises aren't just about sun,
surf and sand. There's a pretty high level of intellectual give and take
also! Email me for more info...

Anyhow, we deferred many of these issues (probably
out of exhaustion) the last time we talked about it but we cannot and
should not do so forever. In particular, I do not think that we should
add more features for working with Unicode (e.g. unichr) before thinking
through the issues.

---

Abstract

    Many of the world's written languages have more than 255 characters.
    Therefore Python is out of date in its insistence that "basic strings"
    are lists of characters with ordinals between 0 and 255. Python's
    basic character type must allow at least enough digits for Eastern
    languages.

Problem Description 

    Python's western bias stems from a variety of issues.

    The first problem is that Python's native character type is an 8-bit
    character. You can see that it is an 8-bit character by trying to
    insert a value with an ordinal higher than 255. Python should allow
    for ordinal numbers up to at least the size of a single Eastern
    language such as Chinese or Japanese. Whenever a Python file object
    is "read", it returns one of these lists of 8-byte characters. The
    standard file object "read" method can never return a list of Chinese
    or Japanese characters. This is an unacceptable state of affairs in
    the 21st century.

Goals

    1. Python should have a single string type. It should support
       Eastern characters as well as it does European characters.
       Operationally speaking:

    type("") == type(chr(150)) == type(chr(1500)) == type(file.read())

    2. It should be easier and more efficient to encode and decode
       information being sent to and retrieved from devices.

    3. It should remain possible to work with the byte-level representation. 
       This is sometimes useful for for performance reasons.

Definitions

    Character Set

        A character set is a mapping from integers to characters. Note
        that both integers and characters are abstractions. In other
        words, a decision to use a particular character set does not in
        any way mandate a particular implementation or representation
        for characters.

        In Python terms, a character set can be thought of as no more
        or less than a pair of functions: ord() and chr().  ASCII, for
        instance, is a pair of functions defined only for 0 through 127
        and ISO Latin 1 is defined only for 0 through 255. Character
        sets typically also define a mapping from characters to names
        of those characters in some natural language (often English)
        and to a simple graphical representation that native language
        speakers would recognize.

        It is not possible to have a concept of "character" without having
        a character set. After all, characters must be chosen from some
        repertoire and there must be a mapping from characters to integers
        (defined by ord).

    Character Encoding

        A character encoding is a mechanism for representing characters
        in terms of bits. Character encodings are only relevant when
        information is passed from Python to some system that works
        with the characters in terms of representation rather than
        abstraction. Just as a Python programmer would not care about
        the representation of a long integer, they should not care about
        the representation of a string.  Understanding the distinction
        between an abstract character and its bit level representation
        is essential to understanding this Python character model.

        A Python programmer does not need to know or care whether a long
        integer is represented as twos complement, ones complement or
        in terms of ASCII digits. Similarly a Python programmer does
        not need to know or care how characters are represented in
        memory. We might even change the representation over time to
        achieve higher performance.


    Universal Character Set

        There is only one standardized international character set that
        allows for mixed-language information. It is called the Universal
        Character Set and it is logically defined for characters 0
        through 2^32 but practically is deployed for characters 0 through
        2^16. The Universal Character Set is an international standard
        in the sense that it is standardized by ISO and has the force
        of law in international agreements.

        A popular subset of the Universal Character Set is called
        Unicode. The most popular subset of Unicode is called the "Unicode
        Basic Multilingual Plane (Unicode BMP)". The Unicode BMP has
        space for all of the world's major languages including Chinese,
        Korean, Japanese and Vietnamese.  There are 2^16 characters in
        the Unicode BMP.

        The Unicode BMP subset of UCS is becoming a defacto standard on
        the Web.  In any modern browser you can create an HTML or XML
        document with &#301; and get back a rendered version of Unicode
        character 301. In other words, Unicode is becoming the defato
        character set for the Internet in addition to being the officially
        mandated character set for international commerce.

        In addition to defining ord() and chr(), Unicode provides a
        database of information about characters. Each character has an
        english language name, a classification (letter, number, etc.) a
        "demonstration" glyph and so forth.

The Unicode Contraversy

        Unicode is not entirely uncontroversial. In particular there are
        Japanese speakers who dislike the way Unicode merges characters
        from various languages that were considered "the same" by the
        experts that defined the specification. Nevertheless Unicode is
        in used as the character set for important Japanese software such
        as the two most popular word processors, Ichitaro and Microsoft 
        Word. 

        Other programming languages have also moved to use Unicode as the 
        basic character set instead of ASCII or ISO Latin 1. From memory, 
        I believe that this is the case for:

            Java 
            Perl
            JavaScript
            Visual Basic 
            TCL

        XML is also Unicode based. Note that the difference between
        all of these languages and Python is that Unicode is the
        *basic* character type. Even when you type ASCII literals, they
        are immediately converted to Unicode.
       
        It is the author's belief this "running code" is evidence of
        Unicode's practical applicability. Arguments against it seem
        more rooted in theory than in practical problems. On the other
        hand, this belief is informed by those who have done heavy
        work with Asian characters and not based on my own direct
        experience.

Python Character Set

    As discussed before, Python's native character set happens to consist
    of exactly 255 characters. If we increase the size of Python's
    character set, no existing code would break and there would be no
    cost in functionality.

    Given that Unicode is a standard character set and it is richer
    than that of Python's, Python should move to that character set.
    Once Python moves to that character set it will no longer be necessary
    to have a distinction between "Unicode string" and "regular string."
    This means that Unicode literals and escape codes can also be
    merged with ordinary literals and escape codes. unichr can be merged
    with chr.

Character Strings and Byte Arrays

    Two of the most common constructs in computer science are strings of
    characters and strings of bytes. A string of bytes can be represented
    as a string of characters between 0 and 255. Therefore the only
    reason to have a distinction between Unicode strings and byte
    strings is for implementation simplicity and performance purposes.
    This distinction should only be made visible to the average Python
    programmer in rare circumstances.

    Advanced Python programmers will sometimes care about true "byte
    strings". They will sometimes want to build and parse information
    according to its representation instead of its abstract form. This
    should be done with byte arrays. It should be possible to read bytes
    from and write bytes to arrays. It should also be possible to use
    regular expressions on byte arrays.

Character Encodings for I/O

    Information is typically read from devices such as file systems
    and network cards one byte at a time. Unicode BMP characters
    can have values up to 2^16 (or even higher, if you include all of
    UCS). There is a fundamental disconnect there. Each character cannot
    be represented as a single byte anymore. To solve this problem,
    there are several "encodings" for large characters that describe
    how to represent them as series of bytes.

    Unfortunately, there is not one, single, dominant encoding. There are
    at least a dozen popular ones including ASCII (which supports only
    0-127), ISO Latin 1 (which supports only 0-255), others in the ISO
    "extended ASCII" family (which support different European scripts),
    UTF-8 (used heavily in C programs and on Unix), UTF-16 (preferred by
    Java and Windows), Shift-JIS (preferred in Japan) and so forth. This
    means that the only safe way to read data from a file into Python
    strings is to specify the encoding explicitly.

    Python's current assumption is that each byte translates into a
    character of the same ordinal. This is only true for "ISO Latin 1".
    Python should require the user to specify this explicitly instead.

    Any code that does I/O should be changed to require the user to
    specify the encoding that the I/O should use. It is the opinion of
    the author that there should be no default encoding at all. If you
    want to read ASCII text, you should specify ASCII explicitly. If
    you want to read ISO Latin 1, you should specify it explicitly.

    Once data is read into Python objects the original encoding is
    irrelevant. This is similar to reading an integer from a binary file,
    an ASCII file or a packed decimal file. The original bits and bytes
    representation of the integer is disconnected from the abstract
    representation of the integer object.

Proposed I/O API

    This encoding could be chosen at various levels. In some applications
    it may make sense to specify the encoding on every read or write as
    an extra argument to the read and write methods. In most applications
    it makes more sense to attach that information to the file object as
    an attribute and have the read and write methods default the encoding
    to the property value. This attribute value could be initially set
    as an extra argument to the "open" function.

    Here is some Python code demonstrating a proposed API:

        fileobj = fopen("foo", "r", "ASCII") # only accepts values < 128 
        fileobj2 = fopen("bar", "r", "ISO Latin 1")  # byte-values "as is" 
        fileobj3 = fopen("baz", "r", "UTF-8")
        fileobj2.encoding = "UTF-16" # changed my mind!  
        data = fileobj2.read(1024, "UTF-8" ) # changed my mind again

    For efficiency, it should also be possible to read raw bytes into
    a memory buffer without doing any interpretation:

    moredata = fileobj2.readbytes(1024)

    This will generate a byte array, not a character string. This
    is logically equivalent to reading the file as "ISO Latin 1"
    (which happens to map bytes to characters with the same ordinals)
    and generating a byte array by copying characters to bytes but it
    is much more efficient.

Python File Encoding

    It should be possible to create Python files in any of the common
    encodings that are backwards compatible with ASCII. This includes
    ASCII itself, all language-specific "extended ASCII" variants
    (e.g. ISO Latin 1), Shift-JIS and UTF-8 which can actually encode
    any UCS character value.
 
    The precise variant of "super-ASCII" must be declared with a 
    specialized comment that precedes any other lines other than the
    shebang line if present. It has a syntax like this:

    #?encoding="UTF-8"
    #?encoding="ISO-8859-1"
    ...
    #?encoding="ISO-8859-9"
    #?encoding="Shift_JIS"

    For now, this is the complete list of legal encodings. Others may
    be added in the future.

    Python files which use non-ASCII characters without defining an
    encoding should be immediately deprecated and made illegal in some
    future version of Python.

C APIs

    The only time representation matters is when data is being moved from
    Python's internal model to something outside of Python's control
    or vice versa. Reading and writing from a device is a special case
    discussed above. Sending information from Python to C code is also
    an issue.

    Python already has a rule that allows the automatic conversion
    of characters up to 255 into their C equivalents. Once the Python
    character type is expanded, characters outside of that range should
    trigger an exception (just as converting a large long integer to a
    C int triggers an exception).

    Some might claim it is inappropriate to presume that
    the character-for- byte mapping is the correct "encoding" for
    information passing from Python to C. It is best not to think of it
    as an encoding. It is merely the most straightforward mapping from
    a Python type to a C type. In addition to being straightforward,
    I claim it is the best thing for several reasons:

    * It is what Python already does with string objects (but not
    Unicode objects).

    * Once I/O is handled "properly", (see above) it should be extremely
    rare to have characters in strings above 128 that mean anything
    OTHER than character values. Binary data should go into byte arrays.

    * It preserves the length of the string so that the length C sees
    is the same as the length Python sees.

    * It does not require us to make an arbitrary choice of UTF-8 versus
    UTF-16.

    * It means that C extensions can be internationalized by switching
    from C's char type to a wchar_t and switching from the string format
    code to the Unicode format code.

    Python's built-in modules should migrate from char to wchar_t (aka
    Py_UNICODE) over time. That is, more and more functions should
    support characters greater than 255 over time.

Rough Implementation Requirements

    Combine String and Unicode Types:

        The StringType and UnicodeType objects should be aliases for
        the same object. All PyString_* and PyUnicode_* functions should 
        work with objects of this type.

    Remove Unicode String Literals

        Ordinary string literals should allow large character escape codes
        and generate Unicode string objects.

        Unicode objects should "repr" themselves as Python string objects.

        Unicode string literals should be deprecated.

    Generalize C-level Unicode conversion

        The format string "S" and the PyString_AsString functions should
        accept Unicode values and convert them to character arrays
        by converting each value to its equivalent byte-value. Values
        greater than 255 should generate an exception.

    New function: fopen

        fopen should be like Python's current open function except that
        it should allow and require an encoding parameter. It should
        be considered a replacement for open. fopen should return an 
        encoding-aware file object. open should eventually
        be deprecated.


    Add byte arrays

        The regular expression library should be generalized to handle
        byte arrays without converting them to Python strings. This will
        allow those who need to work with bytes to do so more efficiently.

        In general, it should be possible to use byte arrays where-ever
        it is possible to use strings. Byte arrays could be thought of
        as a special kind of "limited but efficient" string. Arguably we
        could go so far as to call them "byte strings" and reuse Python's
        current string implementation. The primary differences would be
        in their "repr", "type" and literal syntax.

        In a sense we would have kept the existing distinction between
        Unicode strings and 8-bit strings but made Unicode the "default"
        and provided 8-bit strings as an efficient alternative.

Appendix: Using Non-Unicode character sets

    Let's presume that a linguistics researcher objected to the
    unification of Han characters in Unicode and wanted to invent a
    character set that included separate characters for all Chinese,
    Japanese and Korean character sets. Perhaps they also want to support
    some non-standard character set like Klingon. Klingon is actually
    scheduled to become part of Unicode eventually but let's presume
    it wasn't. 

    This section will demonstrate that this researcher is no worse off
    under the new system than they were under historical Python. Adopting
    Unicode as a standard has no down-side for someone in this
    situation. They have several options under the new system:

     1. Ignore Unicode

        Read in the bytes using the encoding "RAW" which would mean that
        each byte would be translated into a character between 0 and
        255. It would be a synonym for ISO Latin 1. Now you can process
        the data using exactly the same Python code that you would have
        used in Python 1.5 through Python 2.0. The only difference is
        that the in-memory representation of the data MIGHT be less
        space efficient because Unicode characters MIGHT be implemented
        internally as 16 or 32 bit integers.

        This solution is the simplest and easiest to code.

    2. Use Byte Arrays

        As discussed earlier, a byte array is like a string where
        the characters are restricted to characters between 0 and
        255. The only virtues of byte arrays are that they enforce this
        rule and they can be implemented in a more memory-efficient
        manner. According to the proposal, it should be possible to load
        data into a byte array (or "byte string") using the "readbytes"
        method.

        This solution is the most efficient.

    3. Use Unicode's Private Use Area (PUA)

        Unicode is an extensible standard. There are certain character
        codes reserved for private use between consenting parties. You
        could map characters like Klingon or certain Korean ideographs
        into the private use area. Obviously the Unicode character
        database would not have meaningful information about these
        characters and rendering systems would not know how to render
        them. But this situation is no worse than in today's Python. There
        is no character database for arbitrary character sets and there
        is no automatic way to render them.

        One limitation to this issue is that the Private Use Area can
        only handle so many characters. The BMP PUA can hold thousands
        and if we step up to "full" Unicode support we have room for 
        hundreds of thousands.

        This solution gets the maximum benefit from Unicode for the
        characters that are defined by Unicode without losing the ability
        to refer to characters outside of Unicode.

    4. Use A Higher Level Encoding

        You could wrap Korean characters in <KOREA>...</KOREA> tags. You
        could describe a characters as \KLINGON-KAHK (i.e. 13 Unicode
        characters).  You could use a special Unicode character as an
        "escape flag" to say that the next character should be interpreted
        specially.

        This solution is the most self-descriptive and extensible.

    In summary, expanding Python's character type to support Unicode
    characters does not restrict even the most estoric, Unicode-hostile
    types of text processing. Therefore there is no basis for objecting
    to Unicode as some form of restriction. Those who need to use
    another logial character set have as much ability to do so as they
    always have.

Conclusion

    Python needs to support international characters. The "ASCII" of
    internationalized characters is Unicode. Most other languages have
    moved or are moving their basic character and string types to
    support Unicode. Python should also.


From mal@lemburg.com  Tue Feb  6 15:09:46 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 06 Feb 2001 16:09:46 +0100
Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com>
Message-ID: <3A8013BA.2FF93E8B@lemburg.com>

Paul Prescod wrote:
> 
> "M.-A. Lemburg" wrote:
> >
> > [pre-PEP]
> >
> > You have a lot of good points in there (also some inaccuracies) and
> > I agree that Python should move to using Unicode for text data
> > and arrays for binary data.
> 
> That's my primary goal. If we can all agree that is the goal then we can
> start to design new features with that mind. I'm overjoyed to have you
> on board. I'm pretty sure Fredrick agrees with the goals (probably not
> every implementation detail). I'll send to i18n sig and see if I can get
> buy-in from Andy Robinson et. al. Then it's just Guido.

Oh, I think that everybody agrees on moving to Unicode as
basic text storage container. The question is how to get there ;-)

Today we are facing a problem in that strings are also used as
containers for binary data and no distinction is made between
the two. We also have to watch out for external interfaces which
still use 8-bit character data, so there's a lot ahead.
 
> > Some things you may be missing though is that Python already
> > has support for a few features you mention, e.g. codecs.open()
> > provide more or less what you have in mind with fopen() and
> > the compiler can already unify Unicode and string literals using
> > the -U command line option.
> 
> The problem with unifying string literals without unifying string
> *types* is that many functions probably check for and type("") not
> type(u"").

Well, with -U on, Python will compile "" into u"", so you can
already test Unicode compatibility today... last I tried, Python
didn't even start up :-(

> > What you don't talk about in the PEP is that Python's stdlib isn't
> > even Unicode aware yet, and whatever unification steps we take,
> > this project will have to preceed it.
> 
> I'm not convinced that is true. We should be able to figure it out
> quickly though.

We can use that knowledge to base future design upon. The problem
with many stdlib modules is that they don't make a difference
between text and binary data (and often can't, e.g. take sockets),
so we'll have to figure out a way to differentiate between the
two. We'll also need an easy-to-use binary data type -- as you
mention in the PEP, we could take the old string implementation
as basis and then perhaps turn u"" into "" and use b"" to mean
what "" does now (string object).
 
> > The problem with making the
> > stdlib Unicode aware is that of deciding which parts deal with
> > text data or binary data -- the code sometimes makes assumptions
> > about the nature of the data and at other times it simply doesn't
> > care.
> 
> Can you give an example? If the new string type is 100% backwards
> compatible in every way with the old string type then the only code that
> should break is silly code that did stuff like:
> 
> try:
>     something = chr( somethingelse )
> except ValueError:
>     print "Unicode is evil!"
> 
> Note that I expect types.StringType == types(chr(10000)) etc.

Sure, but there are interfaces which don't differentiate between
text and binary data, e.g. many IO-operations don't care about
what exactly they are writing or reading.
 
We'd probably define a new set of text data APIs (meaning
methods) to make this difference clear and visible, e.g.
.writetext() and .readtext().

> > In this light I think you ought to focus Python 3k with your
> > PEP. This will also enable better merging techniques due to the
> > lifting of the type/class difference.
> 
> Python3K is a beautiful dream but we have problems we need to solve
> today. We could start moving to a Unicode future in baby steps right
> now. Your "open" function could be moved into builtins as "fopen".
> Python's "binary" open function could be deprecated under its current
> name and perhaps renamed.

Hmm, I'd prefer to keep things separate for a while and then
switch over to new APIs once we get used to them.
 
> The sooner we start the sooner we finish. You and /F laid some beautiful
> groundwork. Now we just need to keep up the momentum. I think we can do
> this without a big backwards compatibility earthquake. VB and TCL
> figured out how to do it...

... and we should probably try to learn from them. They have
put a considerable amount of work into getting the low-level
interfacing issues straight. It would be nice if we could avoid
adding more conversion magic...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From paulp@ActiveState.com  Tue Feb  6 15:54:49 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Tue, 06 Feb 2001 07:54:49 -0800
Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com>
Message-ID: <3A801E49.F8DF70E2@ActiveState.com>

"M.-A. Lemburg" wrote:
> 
> ...
> 
> Oh, I think that everybody agrees on moving to Unicode as
> basic text storage container. 

The last time we went around there was an anti-Unicode faction who
argued that adding Unicode support was fine but making it the default
would inconvenience Japanese users.

> ...
> Well, with -U on, Python will compile "" into u"", so you can
> already test Unicode compatibility today... last I tried, Python
> didn't even start up :-(

I'm going to say again that I don't see that as a test of
Unicode-compatibility. It is a test of compatibility with our existing
Unicode object. If we simply allowed string objects to support higher
character numbers I *cannot see* how that could break existing code.

> ...
> We can use that knowledge to base future design upon. The problem
> with many stdlib modules is that they don't make a difference
> between text and binary data (and often can't, e.g. take sockets),
> so we'll have to figure out a way to differentiate between the
> two. We'll also need an easy-to-use binary data type -- as you
> mention in the PEP, we could take the old string implementation
> as basis and then perhaps turn u"" into "" and use b"" to mean
> what "" does now (string object).

I agree that we need all of this but I strongly disagree that there is
any dependency relationship between improving the Unicode-awareness of
I/O routines (sockets and files) and allowing string objects to support
higher character numbers. I claim that allowing higher character numbers
in strings will not break socket objects. It might simply be the case
that for a while socket objects never create these higher charcters.

Similarly, we could improve socket objects so that they have different
readtext/readbinary and writetext/writebinary without unifying the
string objects. There are lots of small changes we can make without
breaking anything. One I would like to see right now is a unification of
chr() and unichr().

We are just making life harder for ourselves by walking further and
further down one path when "everyone agrees" that we are eventually
going to end up on another path.

> ... It would be nice if we could avoid
> adding more conversion magic...

We already have more "magic" in our conversions than we need. I don't
think I'm proposing any new conversions.

 Paul Prescod


From tdickenson@geminidataloggers.com  Tue Feb  6 16:54:22 2001
From: tdickenson@geminidataloggers.com (Toby Dickenson)
Date: Tue, 06 Feb 2001 16:54:22 +0000
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
In-Reply-To: <3A800EE5.A8122B3C@ActiveState.com>
References: <3A800EE5.A8122B3C@ActiveState.com>
Message-ID: <uta08to8puuo5bs8dvio8sja78i0cj8nkk@4ax.com>

Its annoying (for me) that the discussion of this is happening on
python-dev, rather than the i18n-sig list.

should I join python-dev list too?

Toby Dickenson
tdickenson@geminidataloggers.com


From mal@lemburg.com  Tue Feb  6 17:43:05 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 06 Feb 2001 18:43:05 +0100
Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com>
Message-ID: <3A8037A9.2E842800@lemburg.com>

[Moving the follow ups to i18n-sig...]

Paul Prescod wrote:
> 
> "M.-A. Lemburg" wrote:
> >
> > ...
> >
> > Oh, I think that everybody agrees on moving to Unicode as
> > basic text storage container.
> 
> The last time we went around there was an anti-Unicode faction who
> argued that adding Unicode support was fine but making it the default
> would inconvenience Japanese users.

Unicode is the defacto international standard for unified
script encodings. Discussing whether Unicode is good or bad is
really beyond the scope of language design and should be dealt
with in other more suitable forums, IMHO.

> > ...
> > Well, with -U on, Python will compile "" into u"", so you can
> > already test Unicode compatibility today... last I tried, Python
> > didn't even start up :-(
> 
> I'm going to say again that I don't see that as a test of
> Unicode-compatibility. It is a test of compatibility with our existing
> Unicode object. If we simply allowed string objects to support higher
> character numbers I *cannot see* how that could break existing code.

It's a nice way of identifying problem locations in existing
Python code.

I don't understand your statement about allowing string objects
to support "higher" ordinals... are you proposing to add a third
character type ?
 
> > ...
> > We can use that knowledge to base future design upon. The problem
> > with many stdlib modules is that they don't make a difference
> > between text and binary data (and often can't, e.g. take sockets),
> > so we'll have to figure out a way to differentiate between the
> > two. We'll also need an easy-to-use binary data type -- as you
> > mention in the PEP, we could take the old string implementation
> > as basis and then perhaps turn u"" into "" and use b"" to mean
> > what "" does now (string object).
> 
> I agree that we need all of this but I strongly disagree that there is
> any dependency relationship between improving the Unicode-awareness of
> I/O routines (sockets and files) and allowing string objects to support
> higher character numbers. I claim that allowing higher character numbers
> in strings will not break socket objects. It might simply be the case
> that for a while socket objects never create these higher charcters.
> 
> Similarly, we could improve socket objects so that they have different
> readtext/readbinary and writetext/writebinary without unifying the
> string objects. There are lots of small changes we can make without
> breaking anything. One I would like to see right now is a unification of
> chr() and unichr().

This won't work: programs simply do not expect to get Unicode
characters out of chr() and would break. OTOH, programs using
unichr() don't expect 8bit-strings as output.

Let's keep the two worlds well separated for a while and
unify afterwards (this is much easier to do when everything's
in place and well tested).
 
> We are just making life harder for ourselves by walking further and
> further down one path when "everyone agrees" that we are eventually
> going to end up on another path.

No. We are just sending off a pioneer team to try to find an
alternative path. Once that path is found we can switch signs
to have the mainstream use the new alternative path.
 
> > ... It would be nice if we could avoid
> > adding more conversion magic...
> 
> We already have more "magic" in our conversions than we need. I don't
> think I'm proposing any new conversions.

Well, let's hope so :-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From paulp@ActiveState.com  Tue Feb  6 18:27:10 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Tue, 06 Feb 2001 10:27:10 -0800
Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <3A8037A9.2E842800@lemburg.com>
Message-ID: <3A8041FE.F506891F@ActiveState.com>

"M.-A. Lemburg" wrote:
> 
> ...
> 
> Unicode is the defacto international standard for unified
> script encodings. Discussing whether Unicode is good or bad is
> really beyond the scope of language design and should be dealt
> with in other more suitable forums, IMHO.

We are in violent agreement.

>...
> 
> I don't understand your statement about allowing string objects
> to support "higher" ordinals... are you proposing to add a third
> character type ?

Yes and no. I want to make a type with a superset of the functionality
of strings and Unicode strings.

> > Similarly, we could improve socket objects so that they have different
> > readtext/readbinary and writetext/writebinary without unifying the
> > string objects. There are lots of small changes we can make without
> > breaking anything. 

Before we go on: do you agree that we could add fopen and
readtext/readbinary on various I/O types without breaking anything? And
that that we should do so?

> > One I would like to see right now is a unification of
> > chr() and unichr().
> 
> This won't work: programs simply do not expect to get Unicode
> characters out of chr() and would break. 

Why would a program pass a large integer to chr() if it cannot handle
the resulting wide string????

> OTOH, programs using
> unichr() don't expect 8bit-strings as output.

Where would an 8bit string break code that expected a Unicode string?
The upward conversion is automatic and lossless! 

Having chr() and unichr() is like having a special function for adding
integers versus longs. IMO it is madness.

> Let's keep the two worlds well separated for a while and
> unify afterwards (this is much easier to do when everything's
> in place and well tested).

No, the more we keep the worlds seperated the more code will be written
that expects to deal with two separate types. We need to get people
thinking in terms of strings of characters not strings of bytes and we
need to do it as soon as possible.

 Paul Prescod


From martin@loewis.home.cs.tu-berlin.de  Tue Feb  6 20:49:42 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 6 Feb 2001 21:49:42 +0100
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
In-Reply-To: <3A800EE5.A8122B3C@ActiveState.com> (message from Paul Prescod on
 Tue, 06 Feb 2001 06:49:09 -0800)
References: <3A800EE5.A8122B3C@ActiveState.com>
Message-ID: <200102062049.f16Kngq01092@mira.informatik.hu-berlin.de>

Hi Paul,

Interesting remarks. I comment only on those where I disagree.

>     1. Python should have a single string type. 

I disagree. There should be a character string type and a byte string
type, at least. I would agree that a single character string type is
desirable.

>     type("") == type(chr(150)) == type(chr(1500)) == type(file.read())

I disagree. For the last one, much depends on what file is. If it is a
byte-oriented file, reading from it should not return character
strings.

>     2. It should be easier and more efficient to encode and decode
>        information being sent to and retrieved from devices.

I disagree. Easier, maybe; more efficient - I don't think Python is
particular inefficient in encoding/decoding.

>         It is not possible to have a concept of "character" without having
>         a character set. After all, characters must be chosen from some
>         repertoire and there must be a mapping from characters to integers
>         (defined by ord).

Sure it is possible. Different character sets (in your terminology)
have common characters, which is a phenomenon that your definition
cannot describe. Mathematically speaking, there is an unlimited domain
CHAR (the set of all characters), and then a character set would map a
subset of NAT (the set of all natural numbers, including zero) to a
subset of CHAR. Then, a character is an element of CHAR. Depending on
the character set, it has different associated numbers, though (or may
not have an associated ordinal at all).

>         A character encoding is a mechanism for representing characters
>         in terms of bits. 

More generally, it is a mechanism for representing character sequences
in terms of bit sequences. Otherwise, you can not cover the phenomenon
that the encoding of a string is not the concatenation of the
encodings of the individual characters in some encodings.

Also, this term is often called "coded character set" (CCS).

>         A Python programmer does not need to know or care whether a long
>         integer is represented as twos complement, ones complement or
>         in terms of ASCII digits.

They need to know if they want to explain the outcome of, say, hex(~1)
(for that, they need the size of the internal representation at a
minimum). In general, I agree.

>         Similarly a Python programmer does not need to know or care
>         how characters are represented in memory. We might even
>         change the representation over time to achieve higher
>         performance.

Programmers need to know the character set, at a minimum. Since you
were assuming that you can't have characters without character sets, I
guess you've assumed that as implied.

>     Universal Character Set
> 
>         There is only one standardized international character set that
>         allows for mixed-language information. 

Not true. E.g. ISO 8859-5 allows both Russian and English text,
ISO 8859-2 allows English, Polish, German, Slovakian, and a few
others. ISO 2022 (and by reference all incorporated character sets)
supports virtually all existing languages.

>         A popular subset of the Universal Character Set is called
>         Unicode. The most popular subset of Unicode is called the "Unicode
>         Basic Multilingual Plane (Unicode BMP)". 

Isn't the BMP the same as Unicode, as it is the BMP (i.e. group 0,
plane 0) of ISO 10646?

>             Java 
>         It is the author's belief this "running code" is evidence of
>         Unicode's practical applicability. 

At least in the case of Java, I disagree. It very much depends on the
exact version of the JVM that you are using, but I had the following
problems:
- AWT would not find a font to display a specific character, although
  such a font was available. After changing JDK configuration files,
  AWT would not be able to display strings that mix languages.

- JDK could not print a non-Latin-1 string to System.out; there was no
  way of telling it that it should use UTF-8 for output. (sounds
  familiar ?-)

- While javac would accept non-ASCII letters in class names, the
  interpreter would refuse to load class files with "funny
  characters".

Please note that all of these occured on the first attempt to use a
certain feature which works "in theory". Since Java's Unicode support
is considered as most advanced by many, I think there is still a long
way to go.

BTW, for dealing with GUI output, I believe that Tk's handling is most
advanced.
  

>     As discussed before, Python's native character set happens to consist
>     of exactly 255 characters.  If we increase the size of Python's
>     character set, no existing code would break and there would be no
>     cost in functionality.

Sure. Code that treats character strings as if they are byte strings
will break.

>     Once Python moves to that character set it will no longer be necessary
>     to have a distinction between "Unicode string" and "regular string."

Right. The distinction will between "character string" and "byte string".

>     This means that Unicode literals and escape codes can also be
>     merged with ordinary literals and escape codes. unichr can be merged
>     with chr.

Not sure. That means that there won't be byte string literals. It is
particular worrying that you want to remove the way to get the numeric
value of a byte in a byte string.


>     Two of the most common constructs in computer science are strings of
>     characters and strings of bytes. A string of bytes can be represented
>     as a string of characters between 0 and 255. Therefore the only
>     reason to have a distinction between Unicode strings and byte
>     strings is for implementation simplicity and performance purposes.
>     This distinction should only be made visible to the average Python
>     programmer in rare circumstances.

Are you saying that byte strings are visible to the average programmer
in rare circumstances only? Then I disagree; byte strings are
extremely common, as they are what file.read returns.

>     Unfortunately, there is not one, single, dominant encoding. There are
>     at least a dozen popular ones including ASCII (which supports only
>     0-127), ISO Latin 1 (which supports only 0-255), others in the ISO
>     "extended ASCII" family (which support different European scripts),
>     UTF-8 (used heavily in C programs and on Unix), UTF-16 (preferred by
>     Java and Windows), Shift-JIS (preferred in Japan) and so forth. This
>     means that the only safe way to read data from a file into Python
>     strings is to specify the encoding explicitly.

Note how you are mixing character sets and encodings here. As you had
defined earlier, a single character set (such as US-ASCII) can have
multiply encodings (e.g. with checksum bit or without).

>     Python's current assumption is that each byte translates into a
>     character of the same ordinal. This is only true for "ISO Latin 1".

I disagree. With your definition of character set, many character sets
have the property that a single byte is sufficient to represent a
single character (e.g. all of ISO 8859). You seem to assume that the
current Python character set is Latin-1, which it is not. Instead,
Python's character set is defined by the application and the operating
system.

>     Any code that does I/O should be changed to require the user to
>     specify the encoding that the I/O should use. It is the opinion of
>     the author that there should be no default encoding at all.

Not sure. IMO, the default should be to read and write byte strings.

>     Here is some Python code demonstrating a proposed API:
> 
>         fileobj = fopen("foo", "r", "ASCII") # only accepts values < 128 
>         fileobj2 = fopen("bar", "r", "ISO Latin 1")  # byte-values "as is" 
>         fileobj3 = fopen("baz", "r", "UTF-8")

Sounds good. Note that the proper way to write this is

   fileobj = codecs.open("foo", "r", "ASCII")
   # etc

>         fileobj2.encoding = "UTF-16" # changed my mind!  

Why is that a requirement. In a normal stream, you cannot change the
encoding in the middle - in particular not from Latin 1 single-byte to
UTF-16.

>     For efficiency, it should also be possible to read raw bytes into
>     a memory buffer without doing any interpretation:
> 
>     moredata = fileobj2.readbytes(1024)

Disagree. If a file is open for reading characters, reading bytes from
the middle is not possible. If made possible, it won't be more efficient,
as you have to keep track of the encoder's state. Instead, the right way
to write this is

     fileobj2 = open("bar", "rb")
     moredata = fileobj2.read(1024)

>     It should be possible to create Python files in any of the common
>     encodings that are backwards compatible with ASCII.

By "Python files", you mean source code, I assume?

>     #?encoding="UTF-8"
>     #?encoding="ISO-8859-1"

The specific syntax may be debatable; I dislike semantics being put in
comments. There should be first-class syntax for that. Agree on the
principle approach.

>     Python files which use non-ASCII characters without defining an
>     encoding should be immediately deprecated and made illegal in some
>     future version of Python.

Agree.

>     Python already has a rule that allows the automatic conversion
>     of characters up to 255 into their C equivalents.

If it is a character (i.e. Unicode) string, it only converts 127
characters in that way.

>     Once the Python character type is expanded, characters outside
>     of that range should trigger an exception (just as converting a
>     large long integer to a C int triggers an exception).

Agree; that is what it does today.

>     Some might claim it is inappropriate to presume that
>     the character-for- byte mapping is the correct "encoding" for
>     information passing from Python to C.

Indeed, I would claim so. I could not phrase a rebuttal, though,
because your understanding of the desired Python type system seems not
to match mine.

>     Python's built-in modules should migrate from char to wchar_t (aka
>     Py_UNICODE) over time. That is, more and more functions should
>     support characters greater than 255 over time.

Some certainly should. Others, which were designed for dealing with
byte strings, should not.

>         The StringType and UnicodeType objects should be aliases for
>         the same object. All PyString_* and PyUnicode_* functions should 
>         work with objects of this type.

Disagree. There should be support for a byte string type.

>         Ordinary string literals should allow large character escape codes
>         and generate Unicode string objects.

That is available today with the -U option. I'm -0 on disallowing byte
string literals, as I don't consider them too important.

>         The format string "S" and the PyString_AsString functions should
>         accept Unicode values and convert them to character arrays
>         by converting each value to its equivalent byte-value. Values
>         greater than 255 should generate an exception.

Disagree. Conversion should be automatic only up to 127; everything
else gives questionable results.

>         fopen should be like Python's current open function except that
>         it should allow and require an encoding parameter.

Disagree. This is codec.open.

>         In general, it should be possible to use byte arrays where-ever
>         it is possible to use strings. Byte arrays could be thought of
>         as a special kind of "limited but efficient" string. Arguably we
>         could go so far as to call them "byte strings" and reuse Python's
>         current string implementation. The primary differences would be
>         in their "repr", "type" and literal syntax.

Agreed.

> Appendix: Using Non-Unicode character sets
> 
>     Let's presume that a linguistics researcher objected to the
>     unification of Han characters in Unicode and wanted to invent a
>     character set that included separate characters for all Chinese,
>     Japanese and Korean character sets. 

With ISO 10646, he could easily do so in a private-use plane. Of
course, implementations that only provide BMP support are somewhat
handicapped here.

>     Python needs to support international characters. The "ASCII" of
>     internationalized characters is Unicode. Most other languages have
>     moved or are moving their basic character and string types to
>     support Unicode. Python should also.

And indeed, Python does today. I don't see a problem *at all* with the
structure of the Unicode support in Python 2.0. As initial experiences
show, application *will* need to be modified to take Unicode into
account; I doubt that any enhancements will change that.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Tue Feb  6 21:04:10 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 6 Feb 2001 22:04:10 +0100
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
In-Reply-To: <uta08to8puuo5bs8dvio8sja78i0cj8nkk@4ax.com> (message from Toby
 Dickenson on Tue, 06 Feb 2001 16:54:22 +0000)
References: <3A800EE5.A8122B3C@ActiveState.com> <uta08to8puuo5bs8dvio8sja78i0cj8nkk@4ax.com>
Message-ID: <200102062104.f16L4AY01228@mira.informatik.hu-berlin.de>

> Its annoying (for me) that the discussion of this is happening on
> python-dev, rather than the i18n-sig list.

i18n-sig clearly seems to be the right place; I'm equally annoyed.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Tue Feb  6 21:16:38 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 6 Feb 2001 22:16:38 +0100
Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model
In-Reply-To: <3A8041FE.F506891F@ActiveState.com> (message from Paul Prescod on
 Tue, 06 Feb 2001 10:27:10 -0800)
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <3A8037A9.2E842800@lemburg.com> <3A8041FE.F506891F@ActiveState.com>
Message-ID: <200102062116.f16LGcE01306@mira.informatik.hu-berlin.de>

> Before we go on: do you agree that we could add fopen and
> readtext/readbinary on various I/O types without breaking anything? 

That's a trivial question: Simply adding the functions will likely not
break anything, unless somebody else already had been using these
names.

> And that that we should do so?

No. Your fopen is already available, and readtext/readbinary only work
on a per-file basis, not on a per-read basis.

> > This won't work: programs simply do not expect to get Unicode
> > characters out of chr() and would break. 
> 
> Why would a program pass a large integer to chr() if it cannot handle
> the resulting wide string????

It won't. What it might do is to interpret the result as a byte
string, which would break depending on how exactly your new type
system works.

> No, the more we keep the worlds seperated the more code will be written
> that expects to deal with two separate types. We need to get people
> thinking in terms of strings of characters not strings of bytes and we
> need to do it as soon as possible.

For that, we need a patch first. Any volunteer attempting such a patch
risks being ignored, thus wasting his time. E.g. I invented a
Unicode-for-Python solution several years ago which was used
rarely. Marc-Andre developed one which was integrated in Python 2.0;
that is the one you want to tear down now. Why do yo think you will
have more luck?

In any case, I encourage you to try. I promise I will analyse your
patch and find its weaknesses with respect to existing applications
(I'm pretty sure there will be weaknesses).

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Tue Feb  6 21:00:59 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 6 Feb 2001 22:00:59 +0100
Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model
In-Reply-To: <3A801E49.F8DF70E2@ActiveState.com> (message from Paul Prescod on
 Tue, 06 Feb 2001 07:54:49 -0800)
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com>
Message-ID: <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de>

> If we simply allowed string objects to support higher character
> numbers I *cannot see* how that could break existing code.

To take a specific example: What would you change about imp and
py_compile.py? What is the type of imp.get_magic()? If character
string, what about this fragment?

import imp
MAGIC = imp.get_magic()

def wr_long(f, x):
    """Internal; write a 32-bit int to a file in little-endian order."""
    f.write(chr( x        & 0xff))
    f.write(chr((x >> 8)  & 0xff))
    f.write(chr((x >> 16) & 0xff))
    f.write(chr((x >> 24) & 0xff))
...
    fc = open(cfile, 'wb')
    fc.write('\0\0\0\0')
    wr_long(fc, timestamp)
    fc.write(MAGIC)

Would that continue to write the same file that the current version
writes?

> We are just making life harder for ourselves by walking further and
> further down one path when "everyone agrees" that we are eventually
> going to end up on another path.

I think a problem of discussing on a theoretical level is that the
impact of changes is not clear. You seem to claim that you want
changes that have zero impact on existing programs. Can you provide a
patch implementing these changes, so that others can experiment and
find out whether their application would break?

Regards,
Martin


From paulp@ActiveState.com  Tue Feb  6 23:05:29 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Tue, 06 Feb 2001 15:05:29 -0800
Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de>
Message-ID: <3A808339.7B2BD5D6@ActiveState.com>

"Martin v. Loewis" wrote:
> 
> > If we simply allowed string objects to support higher character
> > numbers I *cannot see* how that could break existing code.
> 
> To take a specific example: What would you change about imp and
> py_compile.py? What is the type of imp.get_magic()? If character
> string, what about this fragment?
> 
> ...
> 
> Would that continue to write the same file that the current version
> writes?

Yes. Why wouldn't it?

You haven't specified an encoding for the file write so it would default
to what it does today. You aren't using any large characters so there is
no need for multi-byte encoding. Below is some code that may further
illuminate my idea.

wr_long is basically your code but it shows that chr and unichr are
interchangable by allowing you to pass in "func". magic is also passed
in as a string or unicode string with no ill effects.

I had to define a unicode() and oldstr() function to work around a bug
in the way Python does default conversions between Unicode strings and
ordinary strings. It should just map equivalent ordinals as my functions
do. 

import imp

def wr_long(f, x, func, magic):
    """Internal; write a 32-bit int to a file in little-endian
order."""
    f.write(func( x        & 0xff))
    f.write(func((x >> 8)  & 0xff))
    f.write(func((x >> 16) & 0xff))
    f.write(func((x >> 24) & 0xff))
    f.write('\0\0\0\0')
    f.write(oldstr(magic))

def unicode(string):
    return u"".join([unichr(ord(char)) for char in string])

def oldstr(string):
    return "".join([chr(ord(char)) for char in string])

wr_long(open("out1.txt","wb"), 5, chr, str(imp.get_magic()))
wr_long(open("out2.txt","wb"), 5, chr, str(imp.get_magic()))
wr_long(open("out3.txt","wb"), 5, unichr, unicode(imp.get_magic()))
wr_long(open("out4.txt","wb"), 5, unichr, str(imp.get_magic()))

assert( open("out1.txt").read() == 
        open("out2.txt").read() == 
        open("out3.txt").read() == 
        open("out4.txt").read())

 Paul Prescod


From paulp@ActiveState.com  Tue Feb  6 23:07:08 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Tue, 06 Feb 2001 15:07:08 -0800
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
References: <3A800EE5.A8122B3C@ActiveState.com> <uta08to8puuo5bs8dvio8sja78i0cj8nkk@4ax.com> <200102062104.f16L4AY01228@mira.informatik.hu-berlin.de>
Message-ID: <3A80839C.19C69C35@ActiveState.com>

The sig is the right place to work out the details but I think that
Guido needs to decide that unifying the string and unicode types is the
right thing before we can get there (and hopefully before we spend too
much energy arguing about details).

"Martin v. Loewis" wrote:
> 
> > Its annoying (for me) that the discussion of this is happening on
> > python-dev, rather than the i18n-sig list.
> 
> i18n-sig clearly seems to be the right place; I'm equally annoyed.
> 
> Regards,
> Martin
> 
> _______________________________________________
> I18n-sig mailing list
> I18n-sig@python.org
> http://mail.python.org/mailman/listinfo/i18n-sig


From paulp@ActiveState.com  Tue Feb  6 23:21:38 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Tue, 06 Feb 2001 15:21:38 -0800
Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de>
Message-ID: <3A808702.5FF36669@ActiveState.com>

Let me say one more thing.

Unicode and string types are *already widely interoperable*. 

You run into problems:

 a) when you try to convert a character greater than 128. In my opinion
this is just a poor design decision that can be easily reversed

 b) some code does an explicit check for types.StringType which of
course is not compatible with types.UnicodeType. This can only be fixed
by merging the features of types.StringType and types.UnicodeType so
that they can be the same object. This is not as trivial as the other
fix in terms of lines of code that must change but conceptually it
doesn't seem complicated at all.

I think a lot of Unicode interoperability problems would just go away if
"a" was fixed...

 Paul Prescod


From martin@loewis.home.cs.tu-berlin.de  Tue Feb  6 23:50:52 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 7 Feb 2001 00:50:52 +0100
Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model
In-Reply-To: <3A808339.7B2BD5D6@ActiveState.com> (message from Paul Prescod on
 Tue, 06 Feb 2001 15:05:29 -0800)
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808339.7B2BD5D6@ActiveState.com>
Message-ID: <200102062350.f16Noqc02391@mira.informatik.hu-berlin.de>

> Yes. Why wouldn't it?
> 
> You haven't specified an encoding for the file write so it would default
> to what it does today. You aren't using any large characters so there is
> no need for multi-byte encoding.

I'm certainly using characters > 128. In UTF-8, they would become
multi-byte. I'm not certain whether this would cause a problem; you
did not give all implementation details of your approach, so it is
hard to say. 

For example, f.write would use the s# conversion (since the file was
opened in binary). What exactly would that do?

If your change would be to *just* widen the internal representation of
characters, it would do PyString_AS_STRING/PyString_GET_SIZE, so it
would return a pointer to the internal representation. As a result,
writing the MAGIC would result in only two bytes of the magic being
written, with intermediate \0 bytes; that would be wrong.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Tue Feb  6 23:30:23 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 7 Feb 2001 00:30:23 +0100
Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model
In-Reply-To: <3A808339.7B2BD5D6@ActiveState.com> (message from Paul Prescod on
 Tue, 06 Feb 2001 15:05:29 -0800)
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808339.7B2BD5D6@ActiveState.com>
Message-ID: <200102062330.f16NUNX02359@mira.informatik.hu-berlin.de>


From martin@loewis.home.cs.tu-berlin.de  Tue Feb  6 23:54:47 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 7 Feb 2001 00:54:47 +0100
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
In-Reply-To: <3A80839C.19C69C35@ActiveState.com> (message from Paul Prescod on
 Tue, 06 Feb 2001 15:07:08 -0800)
References: <3A800EE5.A8122B3C@ActiveState.com> <uta08to8puuo5bs8dvio8sja78i0cj8nkk@4ax.com> <200102062104.f16L4AY01228@mira.informatik.hu-berlin.de> <3A80839C.19C69C35@ActiveState.com>
Message-ID: <200102062354.f16NslG02393@mira.informatik.hu-berlin.de>

> The sig is the right place to work out the details but I think that
> Guido needs to decide that unifying the string and unicode types is
> the right thing before we can get there (and hopefully before we
> spend too much energy arguing about details).

I think it must be exactly vice versa. An agreement "in principle" is
worth nothing if it then turns out that an implementation is not
feasible, or would have undesirable side effects. That is how PEPs
work: you first work out all the details, get feedback from the
community, and *then* can ask for BDFL pronouncement. 

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Wed Feb  7 00:00:11 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 7 Feb 2001 01:00:11 +0100
Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model
In-Reply-To: <3A808702.5FF36669@ActiveState.com> (message from Paul Prescod on
 Tue, 06 Feb 2001 15:21:38 -0800)
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com>
Message-ID: <200102070000.f1700BV02437@mira.informatik.hu-berlin.de>

>  a) when you try to convert a character greater than 128. In my opinion
> this is just a poor design decision that can be easily reversed

Technically, you can easily convert expand it to 256; not that easily
beyond.

Then, people who put KOI8-R into their Python source code will
complain why the strings come out incorrectly, even though they set
their language to Russion, and even though it worked that way in
earlier Python versions.

Or, if they then tag their sources as KOI8-R, writing strings to a
"plain" file will fail, as they have characters > 256 in the string.

> I think a lot of Unicode interoperability problems would just go
> away if "a" was fixed...

No, that would be just open a new can of worms.

Again, provide a specific patch, and I can tell you specific problems.

Regards,
Martin


From paulp@ActiveState.com  Wed Feb  7 00:07:47 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Tue, 06 Feb 2001 16:07:47 -0800
Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808339.7B2BD5D6@ActiveState.com> <200102062350.f16Noqc02391@mira.informatik.hu-berlin.de>
Message-ID: <3A8091D3.F45F666A@ActiveState.com>

"Martin v. Loewis" wrote:
> 
> ...
>
> I'm certainly using characters > 128. In UTF-8, they would become
> multi-byte. I'm not certain whether this would cause a problem; you
> did not give all implementation details of your approach, so it is
> hard to say.

I think this is specified properly in the PEP but I know it is way too
much learn in one day so I'm not blaming you. I'm just pointing out that
it isn't as underspecified as it seems:

    Python already has a rule that allows the automatic conversion
    of characters up to 255 into their C equivalents. Once the Python
    character type is expanded, characters outside of that range should
    trigger an exception (just as converting a large long integer to a
    C int triggers an exception).

> For example, f.write would use the s# conversion (since the file was
> opened in binary). What exactly would that do?

Answer above.

> If your change would be to *just* widen the internal representation of
> characters, it would do PyString_AS_STRING/PyString_GET_SIZE, so it
> would return a pointer to the internal representation. 

Is it a requirement that PyString_AS_STRING return a pointer to the
internal representation instead of a narrowed equivalent?

 Paul Prescod


From paulp@ActiveState.com  Wed Feb  7 00:09:28 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Tue, 06 Feb 2001 16:09:28 -0800
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
References: <3A800EE5.A8122B3C@ActiveState.com> <uta08to8puuo5bs8dvio8sja78i0cj8nkk@4ax.com> <200102062104.f16L4AY01228@mira.informatik.hu-berlin.de> <3A80839C.19C69C35@ActiveState.com> <200102062354.f16NslG02393@mira.informatik.hu-berlin.de>
Message-ID: <3A809238.2219FA2E@ActiveState.com>

"Martin v. Loewis" wrote:
> 
> ...
> 
> I think it must be exactly vice versa. An agreement "in principle" is
> worth nothing if it then turns out that an implementation is not
> feasible, or would have undesirable side effects. That is how PEPs
> work: you first work out all the details, get feedback from the
> community, and *then* can ask for BDFL pronouncement.

If Guido is philosophically opposed to Unicode as some people were the
last time we discussed it, then I do not have time to work out details
and then later find out that the project was doomed from the start
because of the philosophical issue.

 Paul Prescod


From paulp@ActiveState.com  Wed Feb  7 00:21:50 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Tue, 06 Feb 2001 16:21:50 -0800
Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de>
Message-ID: <3A80951E.DF725F03@ActiveState.com>

"Martin v. Loewis" wrote:
> 
> >  a) when you try to convert a character greater than 128. In my opinion
> > this is just a poor design decision that can be easily reversed
> 
> Technically, you can easily convert expand it to 256; not that easily
> beyond.

Beyond that is like putting a  long integer into a 32 bit integer slot.
It's a TypeError.

> Then, people who put KOI8-R into their Python source code will
> complain why the strings come out incorrectly, even though they set
> their language to Russion, and even though it worked that way in
> earlier Python versions.

I don't follow.

If I have:

a="abcXXXdef"

XXX is a series of non-ASCII bytes. Those are mapped into Unicode
characters with the same ordinals. Now you write them to a file. You
presumably do not specify an encoding on the file write operation. So
the characters get mapped back to bytes with the same ordinals. It all
behaves as it did in Python 1.0 ... 

You can only introduce characters greater than 256 into strings
explicitly and presumably legacy code does not do that because there was
no way to do that!

> > I think a lot of Unicode interoperability problems would just go
> > away if "a" was fixed...
> 
> No, that would be just open a new can of worms.
> 
> Again, provide a specific patch, and I can tell you specific problems.

It isn't the appropriate time to create such a core code patch. I'm
trying to figure out our direction so that we can figure out what can be
done in the short term. The only two things I can think of are merge
chr/unichr (easy) and provide encoding-smart alternatives to open() and
read() (also easy). The encoding-smart alternatives should also be
documented as preferred replacements as soon as possible.

 Paul Prescod


From paulp@ActiveState.com  Wed Feb  7 01:12:43 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Tue, 06 Feb 2001 17:12:43 -0800
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
References: <3A800EE5.A8122B3C@ActiveState.com> <200102062049.f16Kngq01092@mira.informatik.hu-berlin.de>
Message-ID: <3A80A10B.1E978B30@ActiveState.com>

"Martin v. Loewis" wrote:
> 
> ...
> I disagree. There should be a character string type and a byte string
> type, at least. I would agree that a single character string type is
> desirable.

It depends on whether we decide to talk about "byte strings" or "byte
arrays".

> >     type("") == type(chr(150)) == type(chr(1500)) == type(file.read())
> 
> I disagree. For the last one, much depends on what file is. If it is a
> byte-oriented file, reading from it should not return character
> strings.

I don't think that there should be such a thing as a byte-oriented
file...but that's a pretty small detail.

I think that the result of the read() function should be consistently a
character string and not different from one type of file object to
another...getting a byte array/string/thing should be a seperate method.

> >     2. It should be easier and more efficient to encode and decode
> >        information being sent to and retrieved from devices.
> 
> I disagree. Easier, maybe; more efficient - I don't think Python is
> particular inefficient in encoding/decoding.

Once I have a file object, I don't know of a way to read unicode from it
without reading bytes and then decoding into another string...but I may
just not know that there is a more efficient way.

> Sure it is possible. Different character sets (in your terminology)
> have common characters, which is a phenomenon that your definition
> cannot describe. Mathematically speaking, there is an unlimited domain
> CHAR (the set of all characters), 

CHAR is not a useful set in a computer science sense because if items
from it are addressable or comparable then there exists an ord()
function. Therefore there is a character set. If the items are not
addressable or comparable then how would you make use of it?

We could argue about the platonic truth embedded in the word "character"
but I think that's a waste of time.

> More generally, it is a mechanism for representing character sequences
> in terms of bit sequences. Otherwise, you can not cover the phenomenon
> that the encoding of a string is not the concatenation of the
> encodings of the individual characters in some encodings.
> 
> Also, this term is often called "coded character set" (CCS).

Fair enough.

> >         Similarly a Python programmer does not need to know or care
> >         how characters are represented in memory. We might even
> >         change the representation over time to achieve higher
> >         performance.
> 
> Programmers need to know the character set, at a minimum. Since you
> were assuming that you can't have characters without character sets, I
> guess you've assumed that as implied.

The whole point of these two sections is that programmers should care
alot about the character set and not at all about its in-memory
representation.

> >     Universal Character Set
> >
> >         There is only one standardized international character set that
> >         allows for mixed-language information.
> 
> Not true. E.g. ISO 8859-5 allows both Russian and English text,
> ISO 8859-2 allows English, Polish, German, Slovakian, and a few
> others. 

If you want to use a definition of "international" that means "European"
then I guess that's fair. But you don't say you've internationalized a
computer program when you've added support for the Canadian dollar along
with the American one. :)

> ISO 2022 (and by reference all incorporated character sets)
> supports virtually all existing languages.

I do not believe that ISO 2022 is really considered a character set.

> >         A popular subset of the Universal Character Set is called
> >         Unicode. The most popular subset of Unicode is called the "Unicode
> >         Basic Multilingual Plane (Unicode BMP)".
> 
> Isn't the BMP the same as Unicode, as it is the BMP (i.e. group 0,
> plane 0) of ISO 10646?

No, Unicode has space for 16 planes:

UTF-16 extra planes (to be filled by Unicode 4 and ISO-10646-2) 
	Non-Han Supplementary Plane 1: {U-00010000..U-0001FFFF} 
	Etruscan: {U-00010200..U-00010227} 
	Gothic: {U-00010230..U-0001024B} 
	Klingon: {U-000123D0..U-000123F9} 
	Western Musical Symbols: {U-0001D103..U-0001D1D7} 
	Han Supplementary Plane 2: {U-00020000..U-0002FFFF} 
	Reserved Planes 3..13: {U-00030000..U-000DFFFF} 
	Plane 14: {U-000E0000..U-000EFFFF} 
	Language Tag Characters: {U-000E0000..U-000E007F}
	Private Use Planes: {U-000F0000..U-0010FFFF} 

> >             Java
> >         It is the author's belief this "running code" is evidence of
> >         Unicode's practical applicability.
> 
> At least in the case of Java, I disagree. It very much depends on the
> exact version of the JVM that you are using, but I had the following
> problems:

I'm not saying that any particular Unicode-using system is perfect. I'm
saying that they work. I don't think that Java would work better if it
used something other than Unicode.

> Sure. Code that treats character strings as if they are byte strings
> will break.

We've discussed this further and I think I may yet convince you
otherwise...

> >     This means that Unicode literals and escape codes can also be
> >     merged with ordinary literals and escape codes. unichr can be merged
> >     with chr.
> 
> Not sure. That means that there won't be byte string literals. It is
> particular worrying that you want to remove the way to get the numeric
> value of a byte in a byte string.

I don't recall suggesting any such thing! chr() of a byte string should
return the byte value. chr() of a unicode string should return the
character value.

> Are you saying that byte strings are visible to the average programmer
> in rare circumstances only? Then I disagree; byte strings are
> extremely common, as they are what file.read returns.

Not under my proposal. file.read returns a character string. Sometimes
the character string contains characters between 0 and 255 and is
indistinguishable from today's string type. Sometimes the file object
knows that you want the data decoded and it returns large characters.

> >     Unfortunately, there is not one, single, dominant encoding. There are
> >     at least a dozen popular ones including ASCII (which supports only
> >     0-127), ISO Latin 1 (which supports only 0-255), others in the ISO
> >     "extended ASCII" family (which support different European scripts),
> >     UTF-8 (used heavily in C programs and on Unix), UTF-16 (preferred by
> >     Java and Windows), Shift-JIS (preferred in Japan) and so forth. This
> >     means that the only safe way to read data from a file into Python
> >     strings is to specify the encoding explicitly.
> 
> Note how you are mixing character sets and encodings here. As you had
> defined earlier, a single character set (such as US-ASCII) can have
> multiply encodings (e.g. with checksum bit or without).

I believe that ASCII is both a character set and an encoding. If not,
what is the name for the encoding we've been using prior to Unicode?

> >     Any code that does I/O should be changed to require the user to
> >     specify the encoding that the I/O should use. It is the opinion of
> >     the author that there should be no default encoding at all.
> 
> Not sure. IMO, the default should be to read and write byte strings.

The default for current Python code, yes. The default going forward? We
could debate that.

> Sounds good. Note that the proper way to write this is

We need a built-in function that everyone uses as an alternative to the
byte/string-ambiguous "open".

>    fileobj = codecs.open("foo", "r", "ASCII")
>    # etc
> 
> >         fileobj2.encoding = "UTF-16" # changed my mind!
> 
> Why is that a requirement. In a normal stream, you cannot change the
> encoding in the middle - in particular not from Latin 1 single-byte to
> UTF-16.

What is a "normal stream?" Python must be able to handle all streams,
right? I can imagine all kinds of pickle-like or structured stream file
formats that switch back and forth between binary information, strings
and unicode. I'd rather not require our users to handle these in
multiple passes.

BTW, you only know the encoding of an XML file after you've read the
first line...

> Disagree. If a file is open for reading characters, reading bytes from
> the middle is not possible. If made possible, it won't be more efficient,
> as you have to keep track of the encoder's state. Instead, the right way
> to write this is
> 
>      fileobj2 = open("bar", "rb")
>      moredata = fileobj2.read(1024)

I disagree on many levels...but I'm willing to put off this argument.

> ...
> >     #?encoding="UTF-8"
> >     #?encoding="ISO-8859-1"
> 
> The specific syntax may be debatable; I dislike semantics being put in
> comments. There should be first-class syntax for that. Agree on the
> principle approach.

We need a backwards-compatible syntax...

> >     Python already has a rule that allows the automatic conversion
> >     of characters up to 255 into their C equivalents.
> 
> If it is a character (i.e. Unicode) string, it only converts 127
> characters in that way.

Yes, this is an annoying difference. But I was talking about *Python
strings* not Unicode strings.

> >         Ordinary string literals should allow large character escape codes
> >         and generate Unicode string objects.
> 
> That is available today with the -U option. I'm -0 on disallowing byte
> string literals, as I don't consider them too important.

I don't know what you mean by disallowing byte string literals.

If I type:

a="abcdef"

Python is ambiguous whether this is a character string literal or a byte
string literal. I'm planning on interpreting it as a character string
literal. That's just a definitional thing and it doesn't break anything
or remove anything. It doesn't even hurt if you use escapes to embed
nulls or other control characters. Unicode character equivalents exist
for all of them.

> >         The format string "S" and the PyString_AsString functions should
> >         accept Unicode values and convert them to character arrays
> >         by converting each value to its equivalent byte-value. Values
> >         greater than 255 should generate an exception.
> 
> Disagree. Conversion should be automatic only up to 127; everything
> else gives questionable results.

This is a fundamental disagreement that we will have to work through.
What is "questionable" about interpreting a unicode 245 as a character
245? If you wanted UTF-8 you would have asked for UTF-8!!!

> >         fopen should be like Python's current open function except that
> >         it should allow and require an encoding parameter.
> 
> Disagree. This is codec.open.

code.open will never become popular.

> >     Python needs to support international characters. The "ASCII" of
> >     internationalized characters is Unicode. Most other languages have
> >     moved or are moving their basic character and string types to
> >     support Unicode. Python should also.
> 
> And indeed, Python does today. I don't see a problem *at all* with the
> structure of the Unicode support in Python 2.0. As initial experiences
> show, application *will* need to be modified to take Unicode into
> account; I doubt that any enhancements will change that.

Let's say you are a Chinese TCL programmer. If you know the escape code
for a Kanji character you put it in a string literal just as a Westerner
would do. 

The same Chinese Python programmer must use a special syntax of string
literal and the object he creates has a different type and lots and lots
of trivial, otherwise language-agnostic code crashes because it tests
for type("") when it could handle large character codes without a
problem.

I see this as a big problem...

 Paul Prescod


From brian@tomigaya.shibuya.tokyo.jp  Wed Feb  7 06:01:06 2001
From: brian@tomigaya.shibuya.tokyo.jp (Hooper Brian)
Date: Wed, 7 Feb 2001 15:01:06 +0900 (JST)
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
Message-ID: <20010207060106.20984.qmail@web102.mail.yahoo.co.jp>

Hi there, this is Brian Hooper from Japan,

--- Paul Prescod <paulp@ActiveState.com> wrote:
> If Guido is philosophically opposed to Unicode as
> some people were the
> last time we discussed it, then I do not have time
> to work out details
> and then later find out that the project was doomed
> from the start
> because of the philosophical issue.

As someone who is frequently using Python with Japanese
from day to day, I'd just like to offer that I think that
most Japanese users are not philosophically opposed to
Unicode, they would just like support for Unicode to have
as little an impact as possible on older
pre-Unicode-support code.  One fairly extended discussion
on this list concerned how to allow for a different
encoding default than UTF-8, since a lot of programs here
are written to handle EUC and SJIS directly as byte-string
literals.

The best thing, at least from the point of view of
supporting old code, would be to be able to continue to
have Python continue to handle SJIS and EUC (which, in
spite of Unicode support in Windows, etc., are still by
far the dominant encodings for information interchange in
Japan) without trying to help out by converting it into
characters.  If my input is a blob of binary data, then
having the bytes of that data automatically grouped into
two- or four- bytes per character, or automatically
converted into Unicode, isn't so nice if what I actually
wanted was the binary data as is.  What about adding an
optional encoding argument to the existing open(),
allowing encoding to be passed to that, and using 'raw' as
the default format (what it does now)?

As one example of this, Java (unless you give the compiler
an -encoding flag) assumes that string literals and file
input is in Unicode, but for example in web programming,
where almost all the clients are using SJIS or EUC, and
the designers of the web sites are also using SJIS or EUC,
none of the input is in Unicode.  This is also kind of a
pain with JSP where pages are compiled int servlets by the
server, again in the "wrong" encoding.  Unicode _support_
is already here, on many fronts, but compatibility is
important, because the old encodings will take a long time
to go away, I think.

I agree that Unicode is where we want to go - being able
to do things like cleanly slice double-byte strings
without having to worry about breaking the encoding would
be a refreshing change from the current state of things,
and it would be nice to be able to have a useful string
length measure also!  I do however think that some things
_will_ break in the process of getting there... the
question is just how much will break, and when.  In this
sense, adding new functions like fopen() seems like a
reasonable solution to me, since it doesn't change the way
already existing constructs work.  

Sorry that this message is kind of a ramble, but I hope it
adds to the discussion.

Cheers,
-Brian

__________________________________________________
Do You Yahoo!?
$B%$%s%9%?%s%H%a%C%;!<%8$rAw$m$&!*(B Yahoo!$B%a%C%;%s%8%c!<(B
http://messenger.yahoo.co.jp/


From martin@loewis.home.cs.tu-berlin.de  Wed Feb  7 07:25:04 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 7 Feb 2001 08:25:04 +0100
Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model
In-Reply-To: <3A8091D3.F45F666A@ActiveState.com> (message from Paul Prescod on
 Tue, 06 Feb 2001 16:07:47 -0800)
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808339.7B2BD5D6@ActiveState.com> <200102062350.f16Noqc02391@mira.informatik.hu-berlin.de> <3A8091D3.F45F666A@ActiveState.com>
Message-ID: <200102070725.f177P4X00905@mira.informatik.hu-berlin.de>

>     Python already has a rule that allows the automatic conversion
>     of characters up to 255 into their C equivalents. Once the Python
>     character type is expanded, characters outside of that range should
>     trigger an exception (just as converting a large long integer to a
>     C int triggers an exception).
> 
> > For example, f.write would use the s# conversion (since the file was
> > opened in binary). What exactly would that do?
> 
> Answer above.

So every s and s# conversion would trigger a copying of the
string. How is that implemented? Currently, every Unicode object has a
reference to a string object that is produced by converting to the
default character set. Would it grow another reference to a string
object that is carrying the Latin-1-conversion?

> Is it a requirement that PyString_AS_STRING return a pointer to the
> internal representation instead of a narrowed equivalent?

Certainly. Applications expect to write to the resulting memory, and
expect to change the underlying string; this is valid only if one had
been passing NULL to PyString_FromStringAndSize.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Wed Feb  7 07:32:53 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 7 Feb 2001 08:32:53 +0100
Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model
In-Reply-To: <3A80951E.DF725F03@ActiveState.com> (message from Paul Prescod on
 Tue, 06 Feb 2001 16:21:50 -0800)
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com>
Message-ID: <200102070732.f177WrV00930@mira.informatik.hu-berlin.de>

> > Then, people who put KOI8-R into their Python source code will
> > complain why the strings come out incorrectly, even though they set
> > their language to Russion, and even though it worked that way in
> > earlier Python versions.
> 
> I don't follow.
> 
> If I have:
> 
> a="abcXXXdef"
> 
> XXX is a series of non-ASCII bytes. Those are mapped into Unicode
> characters with the same ordinals. Now you write them to a file. You
> presumably do not specify an encoding on the file write operation. So
> the characters get mapped back to bytes with the same ordinals. It all
> behaves as it did in Python 1.0 ... 

They don't write them to a file. Instead, they print them in the IDLE
terminal, or display them in a Tk or PythonWin window. Both support
arbitrary many characters, and will treat the bytes as characters
originating from Latin-1 (according to their ordinals).

Or, they pass them as attributes in a DOM method, which, on
write-back, will encode every string as UTF-8 (as that is the default
encoding of XML). Then the characters will get changed, when they
shouldn't.

> You can only introduce characters greater than 256 into strings
> explicitly and presumably legacy code does not do that because there
> was no way to do that!

Legacy code will pass them to applications that know to operate with
the full Unicode character set, e.g. by applying encodings where
necessary, or selecting proper fonts (which might include applying
encodings). *That* is where it will break, and the library has no way
of telling whether the strings where meant as byte strings (in an
unspecified character set), or as Unicode character strings.

> It isn't the appropriate time to create such a core code patch. I'm
> trying to figure out our direction so that we can figure out what can be
> done in the short term. The only two things I can think of are merge
> chr/unichr (easy) and provide encoding-smart alternatives to open() and
> read() (also easy). The encoding-smart alternatives should also be
> documented as preferred replacements as soon as possible.

I'm not sure they are preferred. They are if you know the encoding of
your data sources. If you don't, you better be safe than sorry.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Wed Feb  7 08:06:40 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 7 Feb 2001 09:06:40 +0100
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
In-Reply-To: <3A80A10B.1E978B30@ActiveState.com> (message from Paul Prescod on
 Tue, 06 Feb 2001 17:12:43 -0800)
References: <3A800EE5.A8122B3C@ActiveState.com> <200102062049.f16Kngq01092@mira.informatik.hu-berlin.de> <3A80A10B.1E978B30@ActiveState.com>
Message-ID: <200102070806.f1786eg01079@mira.informatik.hu-berlin.de>

> Once I have a file object, I don't know of a way to read unicode from it
> without reading bytes and then decoding into another string...but I may
> just not know that there is a more efficient way.

Just try

  reader = codecs.lookup("ISO-8859-2")[2]
  charfile = reader(file)

There could be a convenience function, but that also is a detail.

> CHAR is not a useful set in a computer science sense because if items
> from it are addressable or comparable then there exists an ord()
> function. 

This domain was for definition purposes only; I would not assume that
items are addressable or comparable except for equality (i.e. they are
unordered).

> Therefore there is a character set. If the items are not
> addressable or comparable then how would you make use of it?

To represent a character in a computer, you need to have a character
set; I certainly agree with that. I was just pointing out that the
*same* character can exist in different character sets.

> > >         There is only one standardized international character set that
> > >         allows for mixed-language information.
> > 
> > Not true. E.g. ISO 8859-5 allows both Russian and English text,
> > ISO 8859-2 allows English, Polish, German, Slovakian, and a few
> > others. 
> 
> If you want to use a definition of "international" that means "European"
> then I guess that's fair. But you don't say you've internationalized a
> computer program when you've added support for the Canadian dollar along
> with the American one. :)

My definition of "international standard" is "defined by an
international organization", such as ISO. So ISO 8859 certainly
qualifies. ISO 646 (aka ASCII) is also an international standard; it
even allows for "national variants", but it does not allow
mixed-language information. As for ISO 8859, it also supports Arabic
and Hebrew, BTW.

> > Isn't the BMP the same as Unicode, as it is the BMP (i.e. group 0,
> > plane 0) of ISO 10646?
> 
> No, Unicode has space for 16 planes:
> 
> UTF-16 extra planes (to be filled by Unicode 4 and ISO-10646-2) 

Ok. Good that they consider that part of Unicode now; that was not
always the case.

> I don't recall suggesting any such thing! chr() of a byte string should
> return the byte value. chr() of a unicode string should return the
> character value.

chr of a byte string? How exactly do I write this down? I.e. if I have
chr(42), what do I get?

> Not under my proposal. file.read returns a character string. Sometimes
> the character string contains characters between 0 and 255 and is
> indistinguishable from today's string type. Sometimes the file object
> knows that you want the data decoded and it returns large characters.

I guess we have to defer this until I see whether it is feasible
(which I believe it is not - it was the mistake Sun made in the early
JDKs).


> I believe that ASCII is both a character set and an encoding. If not,
> what is the name for the encoding we've been using prior to Unicode?

For ASCII, only a single encoding is common today. I think there used
to be other modes of operation, but nobody cared to give them names.

> > Sounds good. Note that the proper way to write this is
> 
> We need a built-in function that everyone uses as an alternative to the
> byte/string-ambiguous "open".

Why is that a requirement?

> >    fileobj = codecs.open("foo", "r", "ASCII")
> >    # etc
> > 
> > >         fileobj2.encoding = "UTF-16" # changed my mind!
> > 
> > Why is that a requirement. In a normal stream, you cannot change the
> > encoding in the middle - in particular not from Latin 1 single-byte to
> > UTF-16.
> 
> What is a "normal stream?" 

I meant the one returned from open().

> I can imagine all kinds of pickle-like or structured stream file
> formats that switch back and forth between binary information,
> strings and unicode.

For example? If a format supports mixing binary and text information,
it needs to specify what encoding to use for the text fragments, and
it needs to specify how exactly conversion is performed (in case of
stateful codecs). It is certainly the application's job to get this
right; only the application knows how the format is supposed to work.

> BTW, you only know the encoding of an XML file after you've read the
> first line...

Certainly. You don't know the encoding of a MIME message until you
have seen the Content-Type and Content-Transfer-Encoding fields.

> > The specific syntax may be debatable; I dislike semantics being put in
> > comments. There should be first-class syntax for that. Agree on the
> > principle approach.
> 
> We need a backwards-compatible syntax...

Why is that? The backwards-compatible way of writing funny bytes is to use \x escapes.

> This is a fundamental disagreement that we will have to work through.
> What is "questionable" about interpreting a unicode 245 as a character
> 245? If you wanted UTF-8 you would have asked for UTF-8!!!

Likewise, if you want Latin-1 you should ask for it. Explicit is
better than implicit.

> > Disagree. This is codec.open.
> 
> code.open will never become popular.

Why is that?

> Let's say you are a Chinese TCL programmer. If you know the escape code
> for a Kanji character you put it in a string literal just as a Westerner
> would do. 

If, as a programmer, I have to use escape codes to put a character
into my source, I consider this quite inconvenient. Instead, I'd like
to use my keyboard to put in the characters I care about, and I'd like
them to be printed in the way I recognize them.

> The same Chinese Python programmer must use a special syntax of string
> literal and the object he creates has a different type and lots and lots
> of trivial

That Chinese Python programmer should use his editor of choice, and
put _() around strings that are meant as text (as opposed to strings
that are protocol). At the beginning of the module, he should write

def _(str):return unicode(str, "BIG-5")

(assuming BIG-5 is what his editor produces). Not that inconvenient,
and I doubt the same thing is easier in Tcl.

> otherwise language-agnostic code crashes because it tests for
> type("") when it could handle large character codes without a
> problem.

Yes, using type("") is a problem. I'd like to see a symbolic name

StringTypes = [StringType, UnicodeType]

in the types module.

Regards,
Martin


From fredrik@pythonware.com  Wed Feb  7 10:00:03 2001
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Wed, 7 Feb 2001 11:00:03 +0100
Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de>
Message-ID: <00cf01c090ec$c4eb7220$0900a8c0@SPIFF>

martin wrote:    
> To take a specific example: What would you change about imp and
> py_compile.py? What is the type of imp.get_magic()? If character
> string, what about this fragment?
> 
> import imp
> MAGIC = imp.get_magic()
> 
> def wr_long(f, x):
>     """Internal; write a 32-bit int to a file in little-endian order."""
>     f.write(chr( x        & 0xff))
>     f.write(chr((x >> 8)  & 0xff))
>     f.write(chr((x >> 16) & 0xff))
>     f.write(chr((x >> 24) & 0xff))
> ...
>     fc = open(cfile, 'wb')
>     fc.write('\0\0\0\0')
>     wr_long(fc, timestamp)
>     fc.write(MAGIC)
> 
> Would that continue to write the same file that the current version
> writes?

yes (file opened in binary mode, no encoding, no code points above 255)

Cheers /F


From tdickenson@geminidataloggers.com  Wed Feb  7 10:35:53 2001
From: tdickenson@geminidataloggers.com (Toby Dickenson)
Date: Wed, 07 Feb 2001 10:35:53 +0000
Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model
In-Reply-To: <3A8041FE.F506891F@ActiveState.com>
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <3A8037A9.2E842800@lemburg.com> <3A8041FE.F506891F@ActiveState.com>
Message-ID: <1c728tobr3u4impgmih5nn6mmr5i00o2gg@4ax.com>

On Tue, 06 Feb 2001 10:27:10 -0800, Paul Prescod
<paulp@ActiveState.com> wrote:

>"M.-A. Lemburg" wrote:
>>=20
>> ...
>>=20
>> Unicode is the defacto international standard for unified
>> script encodings. Discussing whether Unicode is good or bad is
>> really beyond the scope of language design and should be dealt
>> with in other more suitable forums, IMHO.
>
>We are in violent agreement.
>
>>...
>>=20
>> I don't understand your statement about allowing string objects
>> to support "higher" ordinals... are you proposing to add a third
>> character type ?
>
>Yes and no. I want to make a type with a superset of the functionality
>of strings and Unicode strings.
>
>> > Similarly, we could improve socket objects so that they have =
different
>> > readtext/readbinary and writetext/writebinary without unifying the
>> > string objects. There are lots of small changes we can make without
>> > breaking anything.=20
>
>Before we go on: do you agree that we could add fopen and
>readtext/readbinary on various I/O types without breaking anything?


>And
>that that we should do so?

I dislike the idea of burdening the file object interface with
separate functions for binary and text IO, and a way of changing the
encoding. There are many other types/classes that support the file
interface, and I think it is desirable to support text IO on all of
them.

The wrapper approach from the codecs module seems better, since it can
be used to convert any byte file into a text file.

Also consider a hypothetical new storage device that stores unicode
natively: how should it implement readbytes?

(however, an implicit 'import codecs.open as fopen' may make sense)

>> > One I would like to see right now is a unification of
>> > chr() and unichr().
>>=20
>> This won't work: programs simply do not expect to get Unicode
>> characters out of chr() and would break.=20
>
>Why would a program pass a large integer to chr() if it cannot handle
>the resulting wide string????
>
>> OTOH, programs using
>> unichr() don't expect 8bit-strings as output.

We can unify these two only if we change the default encoding from
ASCII to latin1, otherwise:

Python 2.0 (#6, Oct  6 2000, 15:49:48) [MSC 32 bit (Intel)] on win32
Type "copyright", "credits" or "license" for more information.
>>>
>>> u'\310'+unichr(200)
u'\310\310'
>>> u'\310'+chr(200)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: ASCII decoding error: ordinal not in range(128)

The counter-argument from last time around was that this will do the
wrong thing for anyone mixing unicode objects with plain strings
containing non-latin1 content. This argument goes away once there is
only one type used for storing text.


Toby Dickenson
tdickenson@geminidataloggers.com


From tdickenson@geminidataloggers.com  Wed Feb  7 11:03:18 2001
From: tdickenson@geminidataloggers.com (Toby Dickenson)
Date: Wed, 07 Feb 2001 11:03:18 +0000
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
In-Reply-To: <200102062049.f16Kngq01092@mira.informatik.hu-berlin.de>
References: <3A800EE5.A8122B3C@ActiveState.com> <200102062049.f16Kngq01092@mira.informatik.hu-berlin.de>
Message-ID: <t8928t0qnh43nnlhc8kh9ic24v6guq7jc3@4ax.com>

On Tue, 6 Feb 2001 21:49:42 +0100, "Martin v. Loewis"
<martin@loewis.home.cs.tu-berlin.de> wrote:

>Hi Paul,
>
>Interesting remarks. I comment only on those where I disagree.
>
>>     1. Python should have a single string type.=20
>
>I disagree. There should be a character string type and a byte string
>type, at least. I would agree that a single character string type is
>desirable.

There is already a large body of code that mixes text and binary data
in the same type. If we have separate text/binary types, then we need
to plan a transition period to allow code to distinguish between the
two uses.

>>     Two of the most common constructs in computer science are strings =
of
>>     characters and strings of bytes. A string of bytes can be =
represented
>>     as a string of characters between 0 and 255. Therefore the only
>>     reason to have a distinction between Unicode strings and byte
>>     strings is for implementation simplicity and performance purposes.
>>     This distinction should only be made visible to the average Python
>>     programmer in rare circumstances.

I disagree. Many programmers will be satisfied when they read a byte
string from a text file, print it, and see "Hello World". Much better
that we distinguish the two types, so that it looks like binary data
when printed.


Toby Dickenson
tdickenson@geminidataloggers.com


From mal@lemburg.com  Wed Feb  7 11:47:53 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 07 Feb 2001 12:47:53 +0100
Subject: [I18n-sig] Pre-PEP: Python Character Model
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <3A8037A9.2E842800@lemburg.com> <3A8041FE.F506891F@ActiveState.com>
Message-ID: <3A8135E9.E360A267@lemburg.com>

Paul Prescod wrote:
> 
> "M.-A. Lemburg" wrote:
> >...
> >
> > I don't understand your statement about allowing string objects
> > to support "higher" ordinals... are you proposing to add a third
> > character type ?
> 
> Yes and no. I want to make a type with a superset of the functionality
> of strings and Unicode strings.

Hmm and I was under the impression that we try to replace
strings with Unicode and then perhaps reuse the 8-bit string
implementation for binary data.

> > > Similarly, we could improve socket objects so that they have different
> > > readtext/readbinary and writetext/writebinary without unifying the
> > > string objects. There are lots of small changes we can make without
> > > breaking anything.
> 
> Before we go on: do you agree that we could add fopen and
> readtext/readbinary on various I/O types without breaking anything? And
> that that we should do so?

Sure. We can always add new things, then deprecate the old stuff
and slowly move to the new methods as standard. E.g. adding
.readtext() and .writetext() would be a good start in that
direction since those names make it clear that the code will
deal with text rather than binary data.
 
> > > One I would like to see right now is a unification of
> > > chr() and unichr().
> >
> > This won't work: programs simply do not expect to get Unicode
> > characters out of chr() and would break.
> 
> Why would a program pass a large integer to chr() if it cannot handle
> the resulting wide string????

As result of an error. Ok, some other part in the program will
then probably break, but this hides the original error location.
 
> > OTOH, programs using
> > unichr() don't expect 8bit-strings as output.
> 
> Where would an 8bit string break code that expected a Unicode string?
> The upward conversion is automatic and lossless!

But why would you want to do upward conversion on single characters ?
That would only cost performance.
 
> Having chr() and unichr() is like having a special function for adding
> integers versus longs. IMO it is madness.

No. chr() is a constructor for a single 8-bit character, unichr()
is the corresponding constructor for a single Unicode character.
This is much like the difference between int() and long().

> > Let's keep the two worlds well separated for a while and
> > unify afterwards (this is much easier to do when everything's
> > in place and well tested).
> 
> No, the more we keep the worlds seperated the more code will be written
> that expects to deal with two separate types. We need to get people
> thinking in terms of strings of characters not strings of bytes and we
> need to do it as soon as possible.

Ok, then let me put it this way: let's first make people aware
that there is an important difference between text data and
binary data. Once this is being accepted, we can move on to
thinking about making Unicode the standard for text data.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Wed Feb  7 12:58:32 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 07 Feb 2001 13:58:32 +0100
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
References: <20010207060106.20984.qmail@web102.mail.yahoo.co.jp>
Message-ID: <3A814678.2F245D14@lemburg.com>

Hooper Brian wrote:
> ...
> What about adding an
> optional encoding argument to the existing open(),
> allowing encoding to be passed to that, and using 'raw' as
> the default format (what it does now)?

This is what codecs.open() already provides.
 
-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From uche.ogbuji@fourthought.com  Wed Feb  7 19:21:28 2001
From: uche.ogbuji@fourthought.com (Uche Ogbuji)
Date: Wed, 07 Feb 2001 12:21:28 -0700
Subject: [I18n-sig] Re: [4suite] 4Suite 0.10.2 alpha 1
In-Reply-To: Message from "Martin v. Loewis" <martin@loewis.home.cs.tu-berlin.de>
 of "Sun, 04 Feb 2001 16:13:21 +0100." <200102041513.f14FDLZ01273@mira.informatik.hu-berlin.de>
Message-ID: <200102071921.MAA07019@localhost.localdomain>

> > Please test the new internationalization: French and German translations
> > hve been added courtesy Alexandre and Martin.
> 
> This is indeed causing problems for me. Invoking 4xslt gives

[snip]

O'oer.  I'm glad I happened to read i18n-sig before releaseing 0.10.2.  My 
procmail recipes were lame and dumped all three copies of your message here.

> The problem is two-fold: For one thing, there is no German xpath
> message catalog. However, it shouldn't fail if LANG is set to an
> unsupported language, so you should catch IOError also.

OK.  By the way, did you have any comments on the update procedure I suggested 
to you and Alexandre?  I'd like to get the German Translations of XPath (and 
ODS, etc.) in before release if possible.

Meanwhile, I'll add the IOError to the exceptions list.

Thanks.


-- 
Uche Ogbuji                               Principal Consultant
uche.ogbuji@fourthought.com               +1 303 583 9900 x 101
Fourthought, Inc.                         http://Fourthought.com 
4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA
Software-engineering, knowledge-management, XML, CORBA, Linux, Python


From paulp@ActiveState.com  Wed Feb  7 19:44:05 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Wed, 07 Feb 2001 11:44:05 -0800
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
References: <20010207060106.20984.qmail@web102.mail.yahoo.co.jp> <3A814678.2F245D14@lemburg.com>
Message-ID: <3A81A585.F0771269@ActiveState.com>

"M.-A. Lemburg" wrote:
> 
> Hooper Brian wrote:
> > ...
> > What about adding an
> > optional encoding argument to the existing open(),
> > allowing encoding to be passed to that, and using 'raw' as
> > the default format (what it does now)?
> 
> This is what codecs.open() already provides.

There is a reason that Brian and I independently invented the same idea.
It's because Joe Programmer without a degree in rocket science is going
to expect it to work that way. Joe Programmer does not know what a codec
is, will not consider importing the codecs module and will have no idea
what to do with the object once they've got there hands on it.

It's a million times easier to tell a programmer: "If you expect to read
ASCII data add a third argument with the string 'ASCII', if you know
about encodings choose another one. If you know what raw binary data is,
and want to read it, here's another function."

One important part of Python philosophy is making it easy to do the
right thing and a little bit more work to do the wrong thing. Right now
we have the exact opposite situation. We make it incredibly convenient
for programmers to read data that they may consider strings or may
consider binary data into the same string type and then we complain: "Oh
geez, we can't do anything intelligent with strings because we don't
know whether the user intended them to be really strings or binary
data."

 Paul Prescod


From paulp@ActiveState.com  Wed Feb  7 19:51:51 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Wed, 07 Feb 2001 11:51:51 -0800
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
References: <20010207060106.20984.qmail@web102.mail.yahoo.co.jp>
Message-ID: <3A81A757.78B3F527@ActiveState.com>

Hooper Brian wrote:
> 
> ...
> 
> As someone who is frequently using Python with Japanese
> from day to day, I'd just like to offer that I think that
> most Japanese users are not philosophically opposed to
> Unicode, they would just like support for Unicode to have
> as little an impact as possible on older
> pre-Unicode-support code.  One fairly extended discussion
> on this list concerned how to allow for a different
> encoding default than UTF-8, since a lot of programs here
> are written to handle EUC and SJIS directly as byte-string
> literals.

In my opinion there should be *no* encoding default. New code should
always specify an encoding. Old code should continue to work the same.

> ... What about adding an
> optional encoding argument to the existing open(),
> allowing encoding to be passed to that, and using 'raw' as
> the default format (what it does now)?

I'm not content to have a "default" in the long term. Users should just
choose their encodings. Why would your Japanese user prefer to work with
the raw bytes of their Shift-JIS instead of having it decoded into
Unicode characters? Requiring Asians hacking bytes instead of characters
is what we are trying to avoid! Shift-JIS and Unicode are not at odds.
Shift-JIS is a great *encoding* for Unicode (the abstract character
set). Shift-JIS is what should be on the disk. Unicode is what you
should be working with in memory. Of course there will always be some
corner cases where this is not the case but that should be the general
model...

 Paul Prescod


From paulp@ActiveState.com  Wed Feb  7 19:59:35 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Wed, 07 Feb 2001 11:59:35 -0800
Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808339.7B2BD5D6@ActiveState.com> <200102062350.f16Noqc02391@mira.informatik.hu-berlin.de> <3A8091D3.F45F666A@ActiveState.com> <200102070725.f177P4X00905@mira.informatik.hu-berlin.de>
Message-ID: <3A81A927.FAE4303D@ActiveState.com>

"Martin v. Loewis" wrote:
> 
> ...
> 
> So every s and s# conversion would trigger a copying of the
> string. How is that implemented? Currently, every Unicode object has a
> reference to a string object that is produced by converting to the
> default character set. Would it grow another reference to a string
> object that is carrying the Latin-1-conversion?

I'm not clear on the status of the concept of "default charater set."
First, I think you mean "default character encoding". Second, I thought
that that idea was removed from user-view at least, wasn't it? I was
thinking that we would use that slot to hold the char->ord->char
conversion (which you can interpret as Latin-1 or not depending on your
philosophy).

> Certainly. Applications expect to write to the resulting memory, and
> expect to change the underlying string; this is valid only if one had
> been passing NULL to PyString_FromStringAndSize.

The documentation says that the PyString_AsString and PyString_AS_STRING
buffers must never be modified. I forgot that the "real" protocol is
that that buffer can be modified. We'll need to copy its contents back
to the Unicode string before the next operation that uses the Unicode
value. Not rocket science but somewhat tedious.

 Paul Prescod


From paulp@ActiveState.com  Wed Feb  7 20:13:48 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Wed, 07 Feb 2001 12:13:48 -0800
Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de>
Message-ID: <3A81AC7C.3FFE73E5@ActiveState.com>

"Martin v. Loewis" wrote:
> 
> > ...
> > XXX is a series of non-ASCII bytes. Those are mapped into Unicode
> > characters with the same ordinals. Now you write them to a file. You
> > presumably do not specify an encoding on the file write operation. So
> > the characters get mapped back to bytes with the same ordinals. It all
> > behaves as it did in Python 1.0 ...
> 
> They don't write them to a file. Instead, they print them in the IDLE
> terminal, or display them in a Tk or PythonWin window. Both support
> arbitrary many characters, and will treat the bytes as characters
> originating from Latin-1 (according to their ordinals).

I'm lost here. Let's say I'm using Python 1.5. I have some KOI8-R data
in a string literal. PythonWin and Tk expect Unicode. How could they
display the characters correctly?

> Or, they pass them as attributes in a DOM method, which, on
> write-back, will encode every string as UTF-8 (as that is the default
> encoding of XML). Then the characters will get changed, when they
> shouldn't.

What do you think *should* happen? These are the only choices I can
think of:

 1. DOM encodes it as UTF-8
 2. DOM blindly passes it through and creates illegal XML
 3. (correct) User explicitly decodes data into Unicode charset.

3) is unchanged today and under my proposal. You've got some bytes.
Python doesn't know what you mean. The only way to let it know what you
mean is to decode it.

>...
> Legacy code will pass them to applications that know to operate with
> the full Unicode character set, e.g. by applying encodings where
> necessary, or selecting proper fonts (which might include applying
> encodings). *That* is where it will break, and the library has no way
> of telling whether the strings where meant as byte strings (in an
> unspecified character set), or as Unicode character strings.

The only sane thing to do when you don't know is to pass the characters
as-is, char->ord->char.

> > It isn't the appropriate time to create such a core code patch. I'm
> > trying to figure out our direction so that we can figure out what can be
> > done in the short term. The only two things I can think of are merge
> > chr/unichr (easy) and provide encoding-smart alternatives to open() and
> > read() (also easy). The encoding-smart alternatives should also be
> > documented as preferred replacements as soon as possible.
> 
> I'm not sure they are preferred. They are if you know the encoding of
> your data sources. If you don't, you better be safe than sorry.

If you don't know the encoding of your data sources then you should say
that explicitly in code rather than using the same functions as people
who *do* know what their encoding is. Explicit is better than implicit,
right? Our current default is totally implicit.

 Paul Prescod


From paulp@ActiveState.com  Wed Feb  7 20:35:51 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Wed, 07 Feb 2001 12:35:51 -0800
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
References: <3A800EE5.A8122B3C@ActiveState.com> <200102062049.f16Kngq01092@mira.informatik.hu-berlin.de> <3A80A10B.1E978B30@ActiveState.com> <200102070806.f1786eg01079@mira.informatik.hu-berlin.de>
Message-ID: <3A81B1A7.4E1D022C@ActiveState.com>

"Martin v. Loewis" wrote:
> 
> ...
>
> Just try
> 
>   reader = codecs.lookup("ISO-8859-2")[2]
>   charfile = reader(file)
> 
> There could be a convenience function, but that also is a detail.

Usability is not a detail in this particular case. We are trying to
change people's behavior and help them make more robust code.
 
>...
> My definition of "international standard" is "defined by an
> international organization", such as ISO. So ISO 8859 certainly
> qualifies. ISO 646 (aka ASCII) is also an international standard; it
> even allows for "national variants", but it does not allow
> mixed-language information. As for ISO 8859, it also supports Arabic
> and Hebrew, BTW.

That's fine. I'll change the document to be more explicit. Would you
agree that: "Unicode is the only *character set* that supports *all of
the world's major written languages.*"

> ...
> > I don't recall suggesting any such thing! chr() of a byte string should
> > return the byte value. chr() of a unicode string should return the
> > character value.
> 
> chr of a byte string? How exactly do I write this down? I.e. if I have
> chr(42), what do I get?

Sorry, I meant ord. ord of a byte string (or byte array) should return
the byte value. Ord of a character string should return the character
value.

> > Not under my proposal. file.read returns a character string. Sometimes
> > the character string contains characters between 0 and 255 and is
> > indistinguishable from today's string type. Sometimes the file object
> > knows that you want the data decoded and it returns large characters.
> 
> I guess we have to defer this until I see whether it is feasible
> (which I believe it is not - it was the mistake Sun made in the early
> JDKs).

What was the mistake?

> > I can imagine all kinds of pickle-like or structured stream file
> > formats that switch back and forth between binary information,
> > strings and unicode.
> 
> For example? If a format supports mixing binary and text information,
> it needs to specify what encoding to use for the text fragments, and
> it needs to specify how exactly conversion is performed (in case of
> stateful codecs). It is certainly the application's job to get this
> right; only the application knows how the format is supposed to work.

You and I agree that streams can change encoding mid-stream. You
probably think that should be handled by passing the stream to various
codecs as you read (or by doing double-buffer reads). I think that it
should be possible right in the read method. But I don't care enough to
argue about it.

> > > The specific syntax may be debatable; I dislike semantics being put in
> > > comments. There should be first-class syntax for that. Agree on the
> > > principle approach.
> >
> > We need a backwards-compatible syntax...
> 
> Why is that? The backwards-compatible way of writing funny bytes is to use \x escapes.

Maybe we don't need a backards-compatible syntax after all. I haven't
thought through all of those issues.

> > This is a fundamental disagreement that we will have to work through.
> > What is "questionable" about interpreting a unicode 245 as a character
> > 245? If you wanted UTF-8 you would have asked for UTF-8!!!
> 
> Likewise, if you want Latin-1 you should ask for it. Explicit is
> better than implicit.

It's funny how we switch back and forth. If I say that Python reads byte
245 into character 245 and thus uses Latin 1 as its default encoding I'm
told I'm wrong. Python has no native encoding. If I claim that in
passing data to C we should treat character 245 as the C "char" with the
value 245 you tell me that I'm proposing Latin 1 as the default
encoding.

Python has a concept of character that extends from 0 to 255. C has a
concept of character that extends from 0 to 255. There is no issue of
"encoding" as long as you stay within those ranges. This is *exactly*
like the int/long int situation. 

Once you get out of these ranges you switch the type in C to wchar_t and
you are off to the races. If you can't change the C code then that means
you work around it from the Python side -- you UTF-8 encode it before
passing it to the C code.

> ...
> That Chinese Python programmer should use his editor of choice, and
> put _() around strings that are meant as text (as opposed to strings
> that are protocol). 

I don't know what you mean by "protocol" here. But nevertheless, you are
saying that the Chinese programmer must do more than the English
programmer does and I consider that a problem.

> Yes, using type("") is a problem. I'd like to see a symbolic name
> 
> StringTypes = [StringType, UnicodeType]
> 
> in the types module.

That doesn't help to reform the mass of code out there.

 Paul Prescod


From paulp@ActiveState.com  Wed Feb  7 20:38:35 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Wed, 07 Feb 2001 12:38:35 -0800
Subject: [I18n-sig] Re: [Python-Dev] unichr
References: <Pine.LNX.4.10.10102070807450.1876-100000@skuld.kingmanhall.org>
Message-ID: <3A81B24B.6AE348A9@ActiveState.com>

Ka-Ping Yee wrote:
> 
> ...
> 
> At the moment, since the default encoding is ASCII, something like
> 
>     u"abc" + chr(200)
> 
> would cause an exception because 200 is outside of the ASCII range.

Yes, this is another mistake in Python's current handling of strings.
there is absolutely nothing special about the 128-255 range of
characters. We shouldn't start throwing exceptions until we get to 256.

 Paul prescod


From paulp@ActiveState.com  Wed Feb  7 22:53:53 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Wed, 07 Feb 2001 14:53:53 -0800
Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <3A8037A9.2E842800@lemburg.com> <3A8041FE.F506891F@ActiveState.com> <1c728tobr3u4impgmih5nn6mmr5i00o2gg@4ax.com>
Message-ID: <3A81D201.DC88CDC0@ActiveState.com>

Toby Dickenson wrote:
> 
> I dislike the idea of burdening the file object interface with
> separate functions for binary and text IO, and a way of changing the
> encoding. There are many other types/classes that support the file
> interface, and I think it is desirable to support text IO on all of
> them.

It is not burdensome to change each of them over. It's probably about 10
lines of code each.

> The wrapper approach from the codecs module seems better, since it can
> be used to convert any byte file into a text file.

The wrapper approach is not user friendly and users will not make use of
it unless they are already i18n experts. My goal is to nudge people
toward thinking about i18n.

> Also consider a hypothetical new storage device that stores unicode
> natively: how should it implement readbytes?

It could simply choose not to.

> We can unify these two only if we change the default encoding from
> ASCII to latin1, otherwise:

I prefer not to think of it as a "default encoding of Latin1" and more
as "doing the obvious thing." C has a character 245. Python has a
character 245. Only someone who knows too much would expect anything
other than an obvious mapping.


> The counter-argument from last time around was that this will do the
> wrong thing for anyone mixing unicode objects with plain strings
> containing non-latin1 content. This argument goes away once there is
> only one type used for storing text.

That's where I'm trying to get to but I'm trying to minimize the amount
of cruft added to the language between here and there.

 Paul Prescod


From andy@reportlab.com  Wed Feb  7 23:06:12 2001
From: andy@reportlab.com (Andy Robinson)
Date: Wed, 7 Feb 2001 23:06:12 -0000
Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model
In-Reply-To: <3A801E49.F8DF70E2@ActiveState.com>
Message-ID: <PGECLPOBGNBNKHNAGIJHKEAPCIAA.andy@reportlab.com>

> The last time we went around there was an anti-Unicode faction who
> argued that adding Unicode support was fine but making it
> the default would inconvenience Japanese users.
Whoops, I nearly missed the biggest debate of the year!

I guess the faction was Brian and I, and our concerns were
misunderstood.  We can lay this to rest forever now as the current
implementation and forward direction incorporate everything I
originally hoped for:

(1) Frequently you need to work with byte arrays, but need a rich
bunch of string-like routines - search and replace, regex etc.
This applies both to non-natural-language data and also to
the special case of corrupt native encodings that need repair.
We loosely defined the 'string interface' in UserString, so that
other people could define string-like types if they wished
and so that users can expect to find certain methods and
operations in both Unicode and Byte Array types.

I'd be really happy one day to explicitly type
  x= ByteArray('some raw data')
as long as I had my old friends split, join, find etc.

(2) Japanese projects often need small extensions to codecs
to deal with user-defined characters.  Java and VB give you
some canned codecs but no way to extend them.  All the Python
asian codec drafts involve 'open' code you can hack and use
simple dictionaries for mapping tables; so it will be really easy
to roll your own "Shift-JIS-plus" with 20 extra characters
mapping to a private use area.  This will be a huge win over
other languages.

(3) The Unicode conversion was based on a more general notion
of 'stream conversion filters' which work with bytes. This
leaves the door open to writing, for example, a direct
Shift-JIS-to-EUC filter which adds nothing in the case of
clean data but is much more robust in the case of user-defined
characters or which can handle cleanup of misencoded data.
We could also write image manipulation or crypto codecs.
Some of us hope to provide general machinery for fast handling
of byte-stream-filters which could be useful in image
processing and crypto as well as encodings. This might
need an extended or different lookup function (after all,
neither end of the filter need be Unicode) but could be
cleanly layered on top of the codec mechanism we have built in.

(4) I agree 100% on being explicit whenever you do I/O
or conversion and on generally using Unicode characters
where possible.  Defaults are evil.  But we needed a
compatibility route to get there.  Guido has said that
long term there will be Unicode strings and Byte Arrays.
That's the time to require arguments to open().

> Similarly, we could improve socket objects so that they
> have different
> readtext/readbinary and writetext/writebinary without unifying the
> string objects. There are lots of small changes we can make without
> breaking anything. One I would like to see right now is a
> unification of
> chr() and unichr().

Here's a thought.  How about BinaryFile/BinarySocket/ByteArray which
do
not need an encoding, and File/Socket/String which require explicit
encodings on opeening.  We keep broad parity between their methods.
That seems more straightforward to me than having text/binary
methods, and also provides a cleaner upgrade path for existing
code.


- Andy


From martin@loewis.home.cs.tu-berlin.de  Thu Feb  8 00:22:50 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 8 Feb 2001 01:22:50 +0100
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
In-Reply-To: <3A81A757.78B3F527@ActiveState.com> (message from Paul Prescod on
 Wed, 07 Feb 2001 11:51:51 -0800)
References: <20010207060106.20984.qmail@web102.mail.yahoo.co.jp> <3A81A757.78B3F527@ActiveState.com>
Message-ID: <200102080022.f180Mo101584@mira.informatik.hu-berlin.de>

> In my opinion there should be *no* encoding default. New code should
> always specify an encoding. Old code should continue to work the same.

However, matter-of-factually, you propose that ISO-8859-1 is the
default encoding, as this is the encoding that is used when converting
character strings to char* in the C API. I'd certainly call it a
default.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Thu Feb  8 00:16:34 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 8 Feb 2001 01:16:34 +0100
Subject: [I18n-sig] Re: [4suite] 4Suite 0.10.2 alpha 1
In-Reply-To: <200102071921.MAA07019@localhost.localdomain> (message from Uche
 Ogbuji on Wed, 07 Feb 2001 12:21:28 -0700)
References: <200102071921.MAA07019@localhost.localdomain>
Message-ID: <200102080016.f180GYD01555@mira.informatik.hu-berlin.de>

> OK.  By the way, did you have any comments on the update procedure I
> suggested to you and Alexandre?  I'd like to get the German
> Translations of XPath (and ODS, etc.) in before release if possible.

I don't know what the proposal exactly was (*). Here's how updates are
typically done in the Linux Internationalization Project:

- each version of the .pot (**) file has a unique identification
  (e.g. 0.10.1a).
- each translator indicates which version of the pot file his
  translation corresponds to.
- once the message catalog changes, the *full* .pot is distributed to
  translators.
- each translator uses GNU msgmerge to carry-over old translations into
  the new catalog, and then updates the catalog (again indicating which
  version this is a translation of)

Using both unique identifications and msgmerge allows for quite
automatic processing, while at the same time giving good consistency
checks and flexible analysis of the changes. Automation goes as far
that the Robot produces the merged catalogs, but that is not a
requirement for me.

Regards,
Martin

(*)  I think you suggested to send diffs; that would be troublesome.
(**) What you call en_US.po really is the .pot file, as it is the
     output of the extractor. It would become a .po file if a translator
     checked it for proper application of US-English spelling etc.


From martin@loewis.home.cs.tu-berlin.de  Wed Feb  7 23:59:37 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 8 Feb 2001 00:59:37 +0100
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
In-Reply-To: <t8928t0qnh43nnlhc8kh9ic24v6guq7jc3@4ax.com> (message from Toby
 Dickenson on Wed, 07 Feb 2001 11:03:18 +0000)
References: <3A800EE5.A8122B3C@ActiveState.com> <200102062049.f16Kngq01092@mira.informatik.hu-berlin.de> <t8928t0qnh43nnlhc8kh9ic24v6guq7jc3@4ax.com>
Message-ID: <200102072359.f17NxbL01137@mira.informatik.hu-berlin.de>

> >>     1. Python should have a single string type. 
> >
> >I disagree. There should be a character string type and a byte string
> >type, at least. I would agree that a single character string type is
> >desirable.
> 
> There is already a large body of code that mixes text and binary data
> in the same type. If we have separate text/binary types, then we need
> to plan a transition period to allow code to distinguish between the
> two uses.

I think the current Unicode implementation has this property: Unicode
is the type for representing character strings; the string type the
one for representing byte strings.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Thu Feb  8 00:27:29 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 8 Feb 2001 01:27:29 +0100
Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model
In-Reply-To: <3A81A927.FAE4303D@ActiveState.com> (message from Paul Prescod on
 Wed, 07 Feb 2001 11:59:35 -0800)
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808339.7B2BD5D6@ActiveState.com> <200102062350.f16Noqc02391@mira.informatik.hu-berlin.de> <3A8091D3.F45F666A@ActiveState.com> <200102070725.f177P4X00905@mira.informatik.hu-berlin.de> <3A81A927.FAE4303D@ActiveState.com>
Message-ID: <200102080027.f180RTl01586@mira.informatik.hu-berlin.de>

> I'm not clear on the status of the concept of "default charater set."
> First, I think you mean "default character encoding". 

Both encoding and character set, yes. I disagree with the notion that
any encoding is a Unicode encoding, since not all encodings can
represent all of Unicode; nor where they originally designed to encode
Unicode.

> Second, I thought that that idea was removed from user-view at
> least, wasn't it?

Yes, unless you modify sitecustomize.py.

> I was thinking that we would use that slot to hold the
> char->ord->char conversion (which you can interpret as Latin-1 or
> not depending on your philosophy).

I would interpret it that way. What do you do about t# conversions,
then?

> The documentation says that the PyString_AsString and PyString_AS_STRING
> buffers must never be modified. I forgot that the "real" protocol is
> that that buffer can be modified. We'll need to copy its contents back
> to the Unicode string before the next operation that uses the Unicode
> value. Not rocket science but somewhat tedious.

This scheme is easy to break; the application could hold onto the
pointer and start using the object already. It remains to be seen
whether existing code would break; this I can only speculate about as
I don't know the exact scheme that you have in mind.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Thu Feb  8 00:37:56 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 8 Feb 2001 01:37:56 +0100
Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model
In-Reply-To: <3A81AC7C.3FFE73E5@ActiveState.com> (message from Paul Prescod on
 Wed, 07 Feb 2001 12:13:48 -0800)
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com>
Message-ID: <200102080037.f180bul01609@mira.informatik.hu-berlin.de>

> > They don't write them to a file. Instead, they print them in the IDLE
> > terminal, or display them in a Tk or PythonWin window. Both support
> > arbitrary many characters, and will treat the bytes as characters
> > originating from Latin-1 (according to their ordinals).
> 
> I'm lost here. Let's say I'm using Python 1.5. I have some KOI8-R data
> in a string literal. PythonWin and Tk expect Unicode. How could they
> display the characters correctly?

No, PythonWin and Tk both tell apart Unicode and byte strings
(although Tk uses quite a funny algorithm to do so). If they see a
byte string, they convert it using the platform encoding (which is
user-settable on both Windows and Unix) to a Unicode string, and
display that.

> > Or, they pass them as attributes in a DOM method, which, on
> > write-back, will encode every string as UTF-8 (as that is the default
> > encoding of XML). Then the characters will get changed, when they
> > shouldn't.
> 
> What do you think *should* happen? These are the only choices I can
> think of:
> 
>  1. DOM encodes it as UTF-8
>  2. DOM blindly passes it through and creates illegal XML
>  3. (correct) User explicitly decodes data into Unicode charset.

What users expect to happen is 2; blindly pass-through. They think
they can get it right; given enough control, this is feasible. It was
even common practice in the absence of Unicode objects, so a lot of
code depends on libraries passing things through as-is.

> The only sane thing to do when you don't know is to pass the characters
> as-is, char->ord->char.

So libraries need a way of telling for sure. With Python 2.0, they can
look at the type() and tell that something is really meant as a
character string; otherwise, I agree, they have to pass through.

Under your proposal, this strategy will fail: libraries cannot tell
for sure anymore that something is really meant as a character string.

> > > The encoding-smart alternatives should also be
> > > documented as preferred replacements as soon as possible.
> > 
> > I'm not sure they are preferred. They are if you know the encoding of
> > your data sources. If you don't, you better be safe than sorry.
> 
> If you don't know the encoding of your data sources then you should say
> that explicitly in code rather than using the same functions as people
> who *do* know what their encoding is. Explicit is better than implicit,
> right? Our current default is totally implicit.

No, it's not. The current default is: always produce byte strings. In
many applications, people certainly *should* use character strings,
but they have to change their code for that. Telling everybody to use
fopen for everything is wrong; telling them to use codecs.open for
character streams is right.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Thu Feb  8 00:21:00 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 8 Feb 2001 01:21:00 +0100
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
In-Reply-To: <3A81A585.F0771269@ActiveState.com> (message from Paul Prescod on
 Wed, 07 Feb 2001 11:44:05 -0800)
References: <20010207060106.20984.qmail@web102.mail.yahoo.co.jp> <3A814678.2F245D14@lemburg.com> <3A81A585.F0771269@ActiveState.com>
Message-ID: <200102080021.f180L0201582@mira.informatik.hu-berlin.de>

> There is a reason that Brian and I independently invented the same idea.
> It's because Joe Programmer without a degree in rocket science is going
> to expect it to work that way. Joe Programmer does not know what a codec
> is, will not consider importing the codecs module and will have no idea
> what to do with the object once they've got there hands on it.
> 
> It's a million times easier to tell a programmer: "If you expect to read
> ASCII data add a third argument with the string 'ASCII', if you know
> about encodings choose another one. If you know what raw binary data is,
> and want to read it, here's another function."

Of course, if Joe Programmer would suddenly be confronted with all his
open calls failing, he'd hate the new release, and would start to
flame comp.lang.python.

If he guesses that there is some issue with ASCII in his program, he'd
probably look into the documentation of open(); I agree. If that would
point to codecs.open, I think Joe could arrange to import the codecs
module and invoke the open function.

Regards,
Martin


From paulp@ActiveState.com  Thu Feb  8 01:10:53 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Wed, 07 Feb 2001 17:10:53 -0800
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
References: <20010207060106.20984.qmail@web102.mail.yahoo.co.jp> <3A814678.2F245D14@lemburg.com> <3A81A585.F0771269@ActiveState.com> <200102080021.f180L0201582@mira.informatik.hu-berlin.de>
Message-ID: <3A81F21D.650888C2@ActiveState.com>

"Martin v. Loewis" wrote:
> 
> ...
> 
> Of course, if Joe Programmer would suddenly be confronted with all his
> open calls failing, he'd hate the new release, and would start to
> flame comp.lang.python.

I don't believe that open() calls should fail! 

We should present a pair of explicit alternatives for strings and binary
data that are as easy as open() and document them as the recommended
way. We should change the tutorials and the books to encourage people to
choose the right function for the right job. 

Years from now we should deprecate open as an old way of doing things
that has been superceded.

> If he guesses that there is some issue with ASCII in his program, he'd
> probably look into the documentation of open(); I agree. 

How would a user guess that there is "some issue with ASCII." Only the
I18N-heads in this mailing list even understand that there is an issue.
We need to inform people that there is a decision to be made.

> If that would
> point to codecs.open, I think Joe could arrange to import the codecs
> module and invoke the open function.

I think it would be a really big mistake to make the right thing involve
so much more code than the easy thing.

What is your aversion to fopen/stropen/txtopen/binopen or whatever you
want to call them? Why not make life easier?

 Paul Prescod


From martin@loewis.home.cs.tu-berlin.de  Thu Feb  8 01:08:56 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 8 Feb 2001 02:08:56 +0100
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
In-Reply-To: <3A81B1A7.4E1D022C@ActiveState.com> (message from Paul Prescod on
 Wed, 07 Feb 2001 12:35:51 -0800)
References: <3A800EE5.A8122B3C@ActiveState.com> <200102062049.f16Kngq01092@mira.informatik.hu-berlin.de> <3A80A10B.1E978B30@ActiveState.com> <200102070806.f1786eg01079@mira.informatik.hu-berlin.de> <3A81B1A7.4E1D022C@ActiveState.com>
Message-ID: <200102080108.f1818uG01762@mira.informatik.hu-berlin.de>

> > Just try
> > 
> >   reader = codecs.lookup("ISO-8859-2")[2]
> >   charfile = reader(file)
> > 
> > There could be a convenience function, but that also is a detail.
> 
> Usability is not a detail in this particular case. We are trying to
> change people's behavior and help them make more robust code.

Ok, just propose a specific patch; I'd recommend to add another
function to the codecs module, rather than adding another built-in.

> That's fine. I'll change the document to be more explicit. Would you
> agree that: "Unicode is the only *character set* that supports *all of
> the world's major written languages.*"

That is certainly the case.

> > > Not under my proposal. file.read returns a character string. Sometimes
> > > the character string contains characters between 0 and 255 and is
> > > indistinguishable from today's string type. Sometimes the file object
> > > knows that you want the data decoded and it returns large characters.
> > 
> > I guess we have to defer this until I see whether it is feasible
> > (which I believe it is not - it was the mistake Sun made in the early
> > JDKs).
> 
> What was the mistake?

Java early had methods that treated Strings and byte array
interchangably if the strings had character values below 256. One
left-over from that is

  public String(byte[] ascii, int hibyte); // in class java.lang.String

It would use the ascii array, and fill it with hibyte in-between;
hibyte was typically 0. The documentation now says

# Deprecated. This method does not properly convert bytes into
# characters. As of JDK 1.1, the preferred way to do this is via the
# String constructors that take a character-encoding name or that use
# the platform's default encoding.

The reverse operation of that is getBytes(nt srcBegin, int srcEnd,
byte[] dst, int dstBegin):

# Deprecated. This method does not properly convert characters into
# bytes. As of JDK 1.1, the preferred way to do this is via the
# getBytes(String enc) method, which takes a character-encoding name,
# or the getBytes() method, which uses the platform's default
# encoding.

I'd say your proposal is in the direction of repeating this mistake.

> You and I agree that streams can change encoding mid-stream. You
> probably think that should be handled by passing the stream to various
> codecs as you read (or by doing double-buffer reads). I think that it
> should be possible right in the read method.

Please take it as a fact that it is impossible to do that at an
arbitrary point in the stream; codecs that need to maintain state will
result strangely.

> It's funny how we switch back and forth. If I say that Python reads byte
> 245 into character 245 and thus uses Latin 1 as its default encoding I'm
> told I'm wrong. Python has no native encoding. If I claim that in
> passing data to C we should treat character 245 as the C "char" with the
> value 245 you tell me that I'm proposing Latin 1 as the default
> encoding.

Python has no default character set *in its byte string type*. Once
you have Unicode objects, talking about language-specified character
sets is meaningful.

> Python has a concept of character that extends from 0 to 255. C has a
> concept of character that extends from 0 to 255. There is no issue of
> "encoding" as long as you stay within those ranges.

C supports various character sets, depending on context. Encodings do
matter here already, e.g. when selecting fonts. Some character sets
supported in C have characters >256, even if they are stored in char*
(in particular, MBCS have these properties).

> > That Chinese Python programmer should use his editor of choice, and
> > put _() around strings that are meant as text (as opposed to strings
> > that are protocol). 
> 
> I don't know what you mean by "protocol" here. 

If you do

print "GET "+url+" HTTP/1.0"

then the strings are really not meant to be human-readable, they are
part of some machine-to-machine communication protocol.

> But nevertheless, you are saying that the Chinese programmer must do
> more than the English programmer does and I consider that a problem.

It just works for the English programmer by coincidence; that
programmer should really tell apart text and byte strings in source as
well.

Following the Unicode path, source files should be UTF-8, but that
won't work in practice because of missing editor support.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Thu Feb  8 01:37:05 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 8 Feb 2001 02:37:05 +0100
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
In-Reply-To: <3A81F21D.650888C2@ActiveState.com> (message from Paul Prescod on
 Wed, 07 Feb 2001 17:10:53 -0800)
References: <20010207060106.20984.qmail@web102.mail.yahoo.co.jp> <3A814678.2F245D14@lemburg.com> <3A81A585.F0771269@ActiveState.com> <200102080021.f180L0201582@mira.informatik.hu-berlin.de> <3A81F21D.650888C2@ActiveState.com>
Message-ID: <200102080137.f181b5101963@mira.informatik.hu-berlin.de>

> What is your aversion to fopen/stropen/txtopen/binopen or whatever you
> want to call them?

I'm opposed to adding new builtins. There are already way too many
builtins. Just have a look at dir(__builtins__) and try to explain
what each and every of them exactly does.

People had been using string.join happily without demanding that it is
builtin.

I'd admit that codecs.open seems wrong also - it is not a codec that
is being opened. New builtins are worse, IMO (what is an f, a str, or
a bin?). Adding flags to open looks acceptable, though.

Regards,
Martin


From paulp@ActiveState.com  Thu Feb  8 02:24:37 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Wed, 07 Feb 2001 18:24:37 -0800
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
References: <20010207060106.20984.qmail@web102.mail.yahoo.co.jp> <3A814678.2F245D14@lemburg.com> <3A81A585.F0771269@ActiveState.com> <200102080021.f180L0201582@mira.informatik.hu-berlin.de> <3A81F21D.650888C2@ActiveState.com> <200102080137.f181b5101963@mira.informatik.hu-berlin.de>
Message-ID: <3A820365.3A351F72@ActiveState.com>

"Martin v. Loewis" wrote:
> 
> ...
> 
> I'd admit that codecs.open seems wrong also - it is not a codec that
> is being opened. New builtins are worse, IMO (what is an f, a str, or
> a bin?). Adding flags to open looks acceptable, though.

open already has two optional arguments. I want to add a new mandatory
argument. I don't see a way to do it cleanly. Actually, I thought of
something which I'll explain in more detail further down.

"fopen" stands for "file open". Now that you mention it, "fileopen" is
probably the best name -- more descriptive even than today's "open". It
would have a mandatory encoding attribute which can be None only if you
use the "b" flag to indicate that you want binary data.
----
   fileopen (filename, encoding, [mode[, bufsize]]))

Return a new file object (described earlier under Built-in Types). The
first and third argument are the same as for stdio's fopen(): filename
is the file name to be opened, mode indicates how the file is to be
opened: 'r' for reading, 'w' for writing (truncating an existing file),
and 'a' opens it for appending (which on some Unix systems means that
all writes append to the end of the file, regardless of the current seek
position).

Modes 'r+', 'w+' and 'a+' open the file for updating (note that 'w+'
truncates the file). If the file cannot be opened, IOError is raised. If
mode is omitted, it defaults to 'r'.

The encoding attribute should be a string indicating the encoding of the
file. Common values are "ASCII" (for English-only text), "ISO Latin 1"
for most Western scripts. "UTF-8" and "UTF-16" are often used for mixed
language documents. "Shift-JIS" and "Big5" are typically used to read
Eastern scripts.

The special value "RAW" means that the file object should return bytes
as-is with no translation into a "byte string".

The optional bufsize argument specifies the file's desired buffer size:
0 means unbuffered, 1 means line buffered, any other positive value
means use a buffer of (approximately) that size. A negative bufsize
means to use the system default, which is usually line buffered for for
tty devices and fully buffered for other files. If omitted, the system
default is used.
---

"open" could actually be extended to be like "fileopen" if we look at
the second parameter and interpret it according to its contents. If it
matches the regexp [rwa]+?b? then we treat it as the "deprecated form."
Otherwise we treat it as an encoding. I don't think we have to worry
about an encoding whose name matches that pattern any time soon!

So in documentation encoding would NOT be optional but in practice there
would be a period in which it would be optional so that people could
migrate their code.

 Paul Prescod


From paulp@ActiveState.com  Thu Feb  8 02:40:29 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Wed, 07 Feb 2001 18:40:29 -0800
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
References: <3A800EE5.A8122B3C@ActiveState.com> <200102062049.f16Kngq01092@mira.informatik.hu-berlin.de> <3A80A10B.1E978B30@ActiveState.com> <200102070806.f1786eg01079@mira.informatik.hu-berlin.de> <3A81B1A7.4E1D022C@ActiveState.com> <200102080108.f1818uG01762@mira.informatik.hu-berlin.de>
Message-ID: <3A82071D.812227F1@ActiveState.com>

"Martin v. Loewis" wrote:
> 
> ...
> 
>   public String(byte[] ascii, int hibyte); // in class java.lang.String
> 
> It would use the ascii array, and fill it with hibyte in-between;
> hibyte was typically 0. The documentation now says
>
> # Deprecated. This method does not properly convert bytes into
> # characters. 

That's right. This function could generate invalid Unicode. That's
totally different than what I'm proposing!

> ...
> It just works for the English programmer by coincidence; that
> programmer should really tell apart text and byte strings in source as
> well.

Are you really saying that if you were a writing a Python book you would
say that the appropriate way to write a "Hello World" program is:

print _("Hello World")

Please give some thought to usability! I love Python because it is
syntactically clean and semantically simple. I can show people Python
code and they immediately understand it.

If you are right, then Python is a scripting language that truly has a
simpler syntax for "byte strings" than it does for "character strings".
If that's so then there is something seriously broken in the language
and we need to figure out how to fix it.

 Paul Prescod


From paulp@ActiveState.com  Thu Feb  8 03:04:50 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Wed, 07 Feb 2001 19:04:50 -0800
Subject: [I18n-sig] Re: Python Character Model
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de>
Message-ID: <3A820CD2.25C3F978@ActiveState.com>

"Martin v. Loewis" wrote:
> 
> >
> > I'm lost here. Let's say I'm using Python 1.5. I have some KOI8-R data
> > in a string literal. PythonWin and Tk expect Unicode. How could they
> > display the characters correctly?
> 
> No, PythonWin and Tk both tell apart Unicode and byte strings
> (although Tk uses quite a funny algorithm to do so). If they see a
> byte string, they convert it using the platform encoding (which is
> user-settable on both Windows and Unix) to a Unicode string, and
> display that.

And if they read in a file from a Frenchmen then they get random Russian
characters on their screen. Or they crash the third-party software
because it couldn't decode properly. Or ...

This is what we need to move away from. The first step is to get people
to stop accidently passing around character strings as byte strings. To
do that we need to make it as easy as possible to get properly decoded
strings into Python.

> > ...
> > What do you think *should* happen? These are the only choices I can
> > think of:
> >
> >  1. DOM encodes it as UTF-8
> >  2. DOM blindly passes it through and creates illegal XML
> >  3. (correct) User explicitly decodes data into Unicode charset.
> 
> What users expect to happen is 2; blindly pass-through. They think
> they can get it right; given enough control, this is feasible. It was
> even common practice in the absence of Unicode objects, so a lot of
> code depends on libraries passing things through as-is.

Surely you agree with me that it is inappropriate for a user to *expect*
a DOM implementation to pass on binary data unmolested. That some
particular DOM may do so (like minidom) is probably just a performance
optimizatoin quirk that could go away at any time. Why would we go out
of our way to support people making this mistake?

> > If you don't know the encoding of your data sources then you should say
> > that explicitly in code rather than using the same functions as people
> > who *do* know what their encoding is. Explicit is better than implicit,
> > right? Our current default is totally implicit.
> 
> No, it's not. The current default is: always produce byte strings. 

A "byte string" is not something you'll find defined in the Python
tutorial, language reference or library reference. People who use open()
do not know that they are making a choice. If you ask a hundred Python
programmers whether the result of open() is a character stream or a byte
stream, most will say character stream. The same goes for string
literals.

The section of the Python language reference describing string literals
does not mention the word "byte" once. It mentions the world character
on almost every other line.

> In
> many applications, people certainly *should* use character strings,
> but they have to change their code for that. Telling everybody to use
> fopen for everything is wrong; telling them to use codecs.open for
> character streams is right.

In another message you admitted that the codec mechanism is somewhat
user unfriendly...so I hope we agree that we need something better.
People need to start making a choice and we have to make that as easy
for them as possible!

 Paul Prescod


From uche.ogbuji@fourthought.com  Thu Feb  8 03:22:12 2001
From: uche.ogbuji@fourthought.com (Uche Ogbuji)
Date: Wed, 07 Feb 2001 20:22:12 -0700
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
In-Reply-To: Message from "M.-A. Lemburg" <mal@lemburg.com>
 of "Wed, 07 Feb 2001 13:58:32 +0100." <3A814678.2F245D14@lemburg.com>
Message-ID: <200102080322.UAA03196@localhost.localdomain>

> Hooper Brian wrote:
> > ...
> > What about adding an
> > optional encoding argument to the existing open(),
> > allowing encoding to be passed to that, and using 'raw' as
> > the default format (what it does now)?
> 
> This is what codecs.open() already provides.

I think this should be codecs.fopen() to avoid any confusion.


-- 
Uche Ogbuji                               Principal Consultant
uche.ogbuji@fourthought.com               +1 303 583 9900 x 101
Fourthought, Inc.                         http://Fourthought.com 
4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA
Software-engineering, knowledge-management, XML, CORBA, Linux, Python


From paulp@ActiveState.com  Thu Feb  8 03:30:59 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Wed, 07 Feb 2001 19:30:59 -0800
Subject: [I18n-sig] Concatenation
Message-ID: <3A8212F3.6F7371D2@ActiveState.com>

Would anyone out there that would object if this were allowed in Python
2.1:

>>> u"abc"+"\245"
u"abc\245"

I can vaguely (only vaguely) understand the arguments about casting when
passing high-bit data to a C-API but I wonder if anyone would argue that
the code above is ambiguous in its intent.

 Paul Prescod


From mal@lemburg.com  Thu Feb  8 10:01:38 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 08 Feb 2001 11:01:38 +0100
Subject: [I18n-sig] Re: [Python-Dev] unichr
References: <Pine.LNX.4.10.10102070807450.1876-100000@skuld.kingmanhall.org> <3A81B24B.6AE348A9@ActiveState.com>
Message-ID: <3A826E82.446C68F9@lemburg.com>

Paul Prescod wrote:
> 
> Ka-Ping Yee wrote:
> >
> > ...
> >
> > At the moment, since the default encoding is ASCII, something like
> >
> >     u"abc" + chr(200)
> >
> > would cause an exception because 200 is outside of the ASCII range.
> 
> Yes, this is another mistake in Python's current handling of strings.
> there is absolutely nothing special about the 128-255 range of
> characters. We shouldn't start throwing exceptions until we get to 256.

You are forgetting that the range 128-255 is used by many codepages
to support language specific characters. chr(0xE0) will give different
characters in the US than e.g. in Russia. If we were to simply
let these conversions slip through, then people would find garbled
data in their text files.

Of course, if a user explicitly sets the default encoding to
Latin-1, then everything will be fine, but for ASCII (which is
the base of most character encodings in use today) there is
little other we can do except to raise an exception.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From tdickenson@geminidataloggers.com  Thu Feb  8 10:26:00 2001
From: tdickenson@geminidataloggers.com (Toby Dickenson)
Date: Thu, 8 Feb 2001 10:26:00 -0000
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
Message-ID: <9FC702711D39D3118D4900902778ADC81287BF@JUPITER>

> > There is already a large body of code that mixes text and 
> binary data
> > in the same type. If we have separate text/binary types, 
> then we need
> > to plan a transition period to allow code to distinguish between the
> > two uses.
> 
> I think the current Unicode implementation has this property: Unicode
> is the type for representing character strings; the string type the
> one for representing byte strings.

The problem isnt so much in the current implementation; its in the code that
has been written to that implementation. At the moment it is unnatural to
write

print u"hello world"

rather than the easier

print "hello world"

even though the message is clearly text.


I think we agree that, eventually, we would like the simple notation for a
string literal to create a unicode string. What Im not sure about is whether
we can make that change soon. How often are string literals used to create
what is logically just binary data?


From tdickenson@geminidataloggers.com  Thu Feb  8 11:03:16 2001
From: tdickenson@geminidataloggers.com (Toby Dickenson)
Date: Thu, 08 Feb 2001 11:03:16 +0000
Subject: [I18n-sig] Concatenation
In-Reply-To: <3A8212F3.6F7371D2@ActiveState.com>
References: <3A8212F3.6F7371D2@ActiveState.com>
Message-ID: <klu48tcngfjgtg73d0de0sqj87mov54bj0@4ax.com>

On Wed, 07 Feb 2001 19:30:59 -0800, Paul Prescod
<paulp@ActiveState.com> wrote:

>Would anyone out there that would object if this were allowed in Python
>2.1:

In 2.1, yes.

=46or as long as we have text data stored in a mix of string and unicode
objects, this rule is a good way of picking up encoding-assumption
bugs early.

(for 2.0 I argued against this, but today I can recognise its
usefulness)

>>>> u"abc"+"\245"
>u"abc\245"

Of course, this should work once type(u"abc")=3D=3Dtype("\245"). I think
we agree this is the long term goal.

>I wonder if anyone would argue that
>the code above is ambiguous in its intent.

A small variation:

>>> x =3D 'd'
>>> print u"abc"+x
abcd
>>> x =3D "\245"
>>> print u"abc"+x
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: ASCII decoding error: ordinal not in range(128)


Toby Dickenson
tdickenson@geminidataloggers.com


From mal@lemburg.com  Thu Feb  8 11:29:11 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 08 Feb 2001 12:29:11 +0100
Subject: [I18n-sig] Concatenation
References: <3A8212F3.6F7371D2@ActiveState.com>
Message-ID: <3A828307.3AD1504C@lemburg.com>

Paul Prescod wrote:
> 
> Would anyone out there that would object if this were allowed in Python
> 2.1:
> 
> >>> u"abc"+"\245"
> u"abc\245"
> 
> I can vaguely (only vaguely) understand the arguments about casting when
> passing high-bit data to a C-API but I wonder if anyone would argue that
> the code above is ambiguous in its intent.

Please see my other reply on this subject. We can't simply ignore
the default encoding here or else people will lose data !

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Thu Feb  8 12:40:19 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 08 Feb 2001 13:40:19 +0100
Subject: [I18n-sig] Move to codecs.open() as builtin open() (Pre-PEP:
 Proposed Python Character Model)
References: <200102080322.UAA03196@localhost.localdomain>
Message-ID: <3A8293B3.D8D4A2B3@lemburg.com>

Uche Ogbuji wrote:
> 
> > Hooper Brian wrote:
> > > ...
> > > What about adding an
> > > optional encoding argument to the existing open(),
> > > allowing encoding to be passed to that, and using 'raw' as
> > > the default format (what it does now)?
> >
> > This is what codecs.open() already provides.
> 
> I think this should be codecs.fopen() to avoid any confusion.

Isn't the need to import it from codecs enough to notice
the difference ?

from codecs import open as fopen

also does the trick in 2.1, BTW.

Perhaps we should make codecs.open the new open() in 2.2 ?!
(the API would have to tweaked a bit though to make the argument
order match the open() API)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Thu Feb  8 13:26:02 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 08 Feb 2001 14:26:02 +0100
Subject: [I18n-sig] Re: Pre-PEP: Python Character Model
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808339.7B2BD5D6@ActiveState.com> <200102062350.f16Noqc02391@mira.informatik.hu-berlin.de> <3A8091D3.F45F666A@ActiveState.com> <200102070725.f177P4X00905@mira.informatik.hu-berlin.de> <3A81A927.FAE4303D@ActiveState.com>
Message-ID: <3A829E6A.5129C048@lemburg.com>

Paul Prescod wrote:
> 
> "Martin v. Loewis" wrote:
> >
> > ...
> >
> > So every s and s# conversion would trigger a copying of the
> > string. How is that implemented? Currently, every Unicode object has a
> > reference to a string object that is produced by converting to the
> > default character set. Would it grow another reference to a string
> > object that is carrying the Latin-1-conversion?
> 
> I'm not clear on the status of the concept of "default charater set."
> First, I think you mean "default character encoding". Second, I thought
> that that idea was removed from user-view at least, wasn't it? I was
> thinking that we would use that slot to hold the char->ord->char
> conversion (which you can interpret as Latin-1 or not depending on your
> philosophy).

The extra slot is a merely needed to implement s and s# conversions
since these pass back references to a real C char buffer. Let's
*not* do more of those...
 
> > Certainly. Applications expect to write to the resulting memory, and
> > expect to change the underlying string; this is valid only if one had
> > been passing NULL to PyString_FromStringAndSize.
> 
> The documentation says that the PyString_AsString and PyString_AS_STRING
> buffers must never be modified. I forgot that the "real" protocol is
> that that buffer can be modified. We'll need to copy its contents back
> to the Unicode string before the next operation that uses the Unicode
> value. Not rocket science but somewhat tedious.

Paul, please have a look at the es and es# conversions -- I think
these do what you have in mind here. Writing to buffers returned
by s or s# is never permitted, you'd have to use w# to get at
a writeable C buffer.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Thu Feb  8 13:34:07 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 08 Feb 2001 14:34:07 +0100
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
References: <9FC702711D39D3118D4900902778ADC81287BF@JUPITER>
Message-ID: <3A82A04F.5A03CAB2@lemburg.com>

Toby Dickenson wrote:
> 
> > > There is already a large body of code that mixes text and
> > binary data
> > > in the same type. If we have separate text/binary types,
> > then we need
> > > to plan a transition period to allow code to distinguish between the
> > > two uses.
> >
> > I think the current Unicode implementation has this property: Unicode
> > is the type for representing character strings; the string type the
> > one for representing byte strings.
> 
> The problem isnt so much in the current implementation; its in the code that
> has been written to that implementation. At the moment it is unnatural to
> write
> 
> print u"hello world"
> 
> rather than the easier
> 
> print "hello world"
> 
> even though the message is clearly text.

Sure, but how is Python going to deduce this information from the
string ?

I once proposed to use a new qualifier for binary data, e.g.
b"binary data" or d"binary data". Don't remember the outcome though
as this was during the heated debate over how to do Unicode right
earlier last year.

Perhaps the only new type we need is an easy to manage
binary data type that behaves very much like the old-school
strings. 

In Py3K we can then all fit them into a new class hierarchie to 
come close to unification:

                  binary data string
                         |
                         |
                  text data string 
                    |           |
                    |           |
         Unicode string      encoded 8-bit string (with encoding 
                                                   information !)


> I think we agree that, eventually, we would like the simple notation for a
> string literal to create a unicode string. What Im not sure about is whether
> we can make that change soon. How often are string literals used to create
> what is logically just binary data?

Often enough to make "python -U" fail badly...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From tdickenson@geminidataloggers.com  Thu Feb  8 14:09:15 2001
From: tdickenson@geminidataloggers.com (Toby Dickenson)
Date: Thu, 8 Feb 2001 14:09:15 -0000
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
Message-ID: <9FC702711D39D3118D4900902778ADC83244A2@JUPITER>

> I once proposed to use a new qualifier for binary data, e.g.
> b"binary data" or d"binary data". Don't remember the outcome though
> as this was during the heated debate over how to do Unicode right
> earlier last year.
>
> Perhaps the only new type we need is an easy to manage
> binary data type that behaves very much like the old-school
> strings. 

Yes, that all sounds like a good idea. I think changing some "strings" to
b"strings" is a necessary step on the way to 'python -U'. 

I would want to avoid the need for a 2.0-style 'default encoding', so I
suggest it shouldnt be possible to mix this type with other strings:

>>> "1"+b"2"
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: cannot add type "binary" to string
>>> "3"==b"3"
0


From paulp@ActiveState.com  Thu Feb  8 15:16:01 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Thu, 8 Feb 2001 07:16:01 -0800 (PST)
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
In-Reply-To: <9FC702711D39D3118D4900902778ADC83244A2@JUPITER>
Message-ID: <Pine.LNX.4.30.0102080711420.10353-100000@latte.ActiveState.com>

I really like the idea of the

b"..." prefix

Is anyone opposed?

------

I think we are in sight of agreement on

1. [file]?open(filename, encoding, ...)

2. b"..."

3. an encoding declaration at the top of files

4. that concatenating Python strings and Unicode strings should do the
"obvious" thing for charcters from 127-255 and nothing for characters
beyond.

5. a bytestring type that behaves in every way shape and form like our
current string type but has a different type() and repr().

These would all be small but important incremental moves to a better
Python. As time goes by we can deprecate more and more "ambiguous" usages
like:

 * regular string literals that use non-ASCII characters when there is no
encoding declaration

 * open() calls that do not specify an encoding (or "RAW")

 Paul Prescod


From paulp@ActiveState.com  Thu Feb  8 15:22:51 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Thu, 8 Feb 2001 07:22:51 -0800 (PST)
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
In-Reply-To: <Pine.LNX.4.30.0102080711420.10353-100000@latte.ActiveState.com>
Message-ID: <Pine.LNX.4.30.0102080722260.10353-100000@latte.ActiveState.com>

On Thu, 8 Feb 2001, Paul Prescod wrote:

>
> 4. that concatenating Python strings and Unicode strings should do the
> "obvious" thing for charcters from 127-255 and nothing for characters
> beyond.

Sorry, I see now that this is still controversial...

 Paul Prescod


From paulp@ActiveState.com  Thu Feb  8 15:31:21 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Thu, 8 Feb 2001 07:31:21 -0800 (PST)
Subject: [I18n-sig] Re: [Python-Dev] unichr
In-Reply-To: <3A826E82.446C68F9@lemburg.com>
Message-ID: <Pine.LNX.4.30.0102080723070.10353-100000@latte.ActiveState.com>

On Thu, 8 Feb 2001, M.-A. Lemburg wrote:

> You are forgetting that the range 128-255 is used by many codepages
> to support language specific characters.

No, I'm not forgetting that. I just don't think it is relevant.

> chr(0xE0) will give different
> characters in the US than e.g. in Russia. If we were to simply
> let these conversions slip through, then people would find garbled
> data in their text files.

People in Russia understand the concept of code pages. They know that
if they put "special" characters in their files they will be interpreted
on other platforms as Western European characters. If we make it easy for
them to explicitly state their encoding then the will do so and get better
behavior then they did before. We can also simplify Python and remove an
arbitrary restriction at the same time.

> Of course, if a user explicitly sets the default encoding to
> Latin-1, then everything will be fine, but for ASCII (which is
> the base of most character encodings in use today) there is
> little other we can do except to raise an exception.

I don't think the "default encoding" is a relevant concept. Most people
came out strongly against it on the Python lists and it was hidden from
user view for that reason. It is a terrible idea to encourage people to
write software that works right on their computer but not on anyone
else's. I think that we should view the "default encoding" as an
implementation artifact and nothing more. We need to define portable rules
that will consistently make sense everywhere.

 Paul Prescod


From tdickenson@geminidataloggers.com  Thu Feb  8 15:33:48 2001
From: tdickenson@geminidataloggers.com (Toby Dickenson)
Date: Thu, 08 Feb 2001 15:33:48 +0000
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
In-Reply-To: <Pine.LNX.4.30.0102080711420.10353-100000@latte.ActiveState.com>
References: <9FC702711D39D3118D4900902778ADC83244A2@JUPITER> <Pine.LNX.4.30.0102080711420.10353-100000@latte.ActiveState.com>
Message-ID: <ise58tourgqnvpfv6rnuc0spomcgei48j1@4ax.com>

On Thu, 8 Feb 2001 07:16:01 -0800 (PST), Paul Prescod
<paulp@ActiveState.com> wrote:

>5. a bytestring type that behaves in every way shape and form like our
>current string type but has a different type() and repr().

I mentioned some other desirable differences too:

>I would want to avoid the need for a 2.0-style 'default encoding', so I
>suggest it shouldnt be possible to mix this type with other strings:
>
>>>> "1"+b"2"
>Traceback (most recent call last):
>  File "<stdin>", line 1, in ?
>TypeError: cannot add type "binary" to string
>>>> "3"=3D=3Db"3"
>0

Without a default encoding would need to change its str() too.

Toby Dickenson
tdickenson@geminidataloggers.com


From mal@lemburg.com  Thu Feb  8 16:33:49 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 08 Feb 2001 17:33:49 +0100
Subject: [I18n-sig] Binary data b"strings" (Pre-PEP: Proposed Python
 Character Model)
References: <9FC702711D39D3118D4900902778ADC83244A2@JUPITER>
Message-ID: <3A82CA6D.313D5E39@lemburg.com>

Toby Dickenson wrote:
> 
> > I once proposed to use a new qualifier for binary data, e.g.
> > b"binary data" or d"binary data". Don't remember the outcome though
> > as this was during the heated debate over how to do Unicode right
> > earlier last year.
> >
> > Perhaps the only new type we need is an easy to manage
> > binary data type that behaves very much like the old-school
> > strings.
> 
> Yes, that all sounds like a good idea. I think changing some "strings" to
> b"strings" is a necessary step on the way to 'python -U'.
> 
> I would want to avoid the need for a 2.0-style 'default encoding', so I
> suggest it shouldnt be possible to mix this type with other strings:
> 
> >>> "1"+b"2"
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> TypeError: cannot add type "binary" to string
> >>> "3"==b"3"
> 0

Right. This will cause people to rethink whether they are
using the object for text data or binary data. I still think that
at the interface level, b"" and "" should be treated the same (except
that b""-strings should not implement the char buffer interface).

OTOH, these b""-strings should implement the same methods as the
array type and probably seemlessly interact with it too. I don't
know which type should be considered "better" in coercion 
though, b""-strings or arrays (I guess b""-strings).

(Waiting for someone to tear down the idea... ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From tdickenson@geminidataloggers.com  Thu Feb  8 16:41:09 2001
From: tdickenson@geminidataloggers.com (Toby Dickenson)
Date: Thu, 8 Feb 2001 16:41:09 -0000
Subject: [I18n-sig] Binary data b"strings" (Pre-PEP: Proposed Python
 Character Model)
Message-ID: <9FC702711D39D3118D4900902778ADC83244A4@JUPITER>

> OTOH, these b""-strings should implement the same methods as the
> array type and probably seemlessly interact with it too. I don't
> know which type should be considered "better" in coercion 
> though, b""-strings or arrays (I guess b""-strings).

That raises the question of mutability... instances of the array type are
mutable. On every occasion that I have needed a mutable string type, it has
been when the string was holding binary data.

Do we want b"strings" to be mutable?

Do we want b"strings" to be the same as the array type? (if not now, maybe
at the same time we unify plain "strings" and u"strings")


From mal@lemburg.com  Thu Feb  8 16:45:21 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 08 Feb 2001 17:45:21 +0100
Subject: [I18n-sig] Re: [Python-Dev] unichr
References: <Pine.LNX.4.30.0102080723070.10353-100000@latte.ActiveState.com>
Message-ID: <3A82CD21.8F47E109@lemburg.com>

Paul Prescod wrote:
> 
> On Thu, 8 Feb 2001, M.-A. Lemburg wrote:
> 
> > You are forgetting that the range 128-255 is used by many codepages
> > to support language specific characters.
> 
> No, I'm not forgetting that. I just don't think it is relevant.

It is not irrelevant as you describe below...
 
> > chr(0xE0) will give different
> > characters in the US than e.g. in Russia. If we were to simply
> > let these conversions slip through, then people would find garbled
> > data in their text files.
> 
> People in Russia understand the concept of code pages. They know that
> if they put "special" characters in their files they will be interpreted
> on other platforms as Western European characters. If we make it easy for
> them to explicitly state their encoding then the will do so and get better
> behavior then they did before. We can also simplify Python and remove an
> arbitrary restriction at the same time.

Well, we can remove the restriction for string literals, but
the same coercion happens for generated strings and these are not
under control of some source encoding parameter.

I once suggested that strings (the 8-bit ones) get an .encoding
attribute to carry along that information, but it quickly showed
that the idea would not be of much use because of the generation
problem and because the only coercion from a string with encoding
information and one without that information is to produce a
new string without encoding information (or maybe not coerce them
at all).

See the python-dev archives for more on this idea (early last year).
 
> > Of course, if a user explicitly sets the default encoding to
> > Latin-1, then everything will be fine, but for ASCII (which is
> > the base of most character encodings in use today) there is
> > little other we can do except to raise an exception.
> 
> I don't think the "default encoding" is a relevant concept. Most people
> came out strongly against it on the Python lists and it was hidden from
> user view for that reason. It is a terrible idea to encourage people to
> write software that works right on their computer but not on anyone
> else's. I think that we should view the "default encoding" as an
> implementation artifact and nothing more. We need to define portable rules
> that will consistently make sense everywhere.

That is exactly why we made as hard as possible for people to
*change* the default. It is pretty obvious that they are on their
own when trying to fiddle with site.py or sitecustomize.py.

Still, I believe its a valid idea. Back when I wrote the proposal
for Unicode integration I had fixed the default encoding to UTF-8.
As the first working patches appeared, there was a long and heated
discussion about what encoding to choose as default (people didn't
like UTF-8). 

There were basically two camps: UTF-8 and Latin-1.
We then decided to make the encoding a variable for have people
try out different encodings. 

Next, the idea of a locale based
default encoding was brought up. Fredrik and I then implemented
the needed magic to figure out the platform specific default
encoding, but subsequently the idea was dropped by our BDFL
in favour of ASCII which is what we see now.

The support code was left in the distribution... and Pythoneers
quickly found it ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Thu Feb  8 16:54:27 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 08 Feb 2001 17:54:27 +0100
Subject: [I18n-sig] Binary data b"strings" (Pre-PEP: Proposed Python
 Character Model)
References: <9FC702711D39D3118D4900902778ADC83244A4@JUPITER>
Message-ID: <3A82CF42.C1B2F426@lemburg.com>

Toby Dickenson wrote:
> 
> > OTOH, these b""-strings should implement the same methods as the
> > array type and probably seemlessly interact with it too. I don't
> > know which type should be considered "better" in coercion
> > though, b""-strings or arrays (I guess b""-strings).
> 
> That raises the question of mutability... instances of the array type are
> mutable. On every occasion that I have needed a mutable string type, it has
> been when the string was holding binary data.
> 
> Do we want b"strings" to be mutable?

No way -- that would confuse the hell out of newbies and everyone
else ;-)

I basically want to reuse the string and/or buffer implementation
and add some more useful methods to it, eg. things from the struct
module and array module.
 
> Do we want b"strings" to be the same as the array type? (if not now, maybe
> at the same time we unify plain "strings" and u"strings")

Not really. Arrays should still be the right type for mutable
binary data chunks, even at that point.

(This idea clearly needs some more thought... :)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From paulp@ActiveState.com  Thu Feb  8 17:44:08 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Thu, 8 Feb 2001 09:44:08 -0800 (PST)
Subject: [I18n-sig] Strawman Proposal: Binary Strings
Message-ID: <Pine.LNX.4.30.0102080927020.11546-100000@latte.ActiveState.com>

A binary string is a string that is declared by the user to be a carrier
of binary data and not (directly) of textual data (Unicode characters).

In order to get a rapid adoption of binary strings, they are designed to
be as similar to Python strings as is possible. This means that they have
all of the same methods, are immutable and so forth. They also follow
Python's existing string->Unicode coercion rules.

These rules are arguably too "loose" but experience shows that coercion
rules are often highly personal and the arguments one way or the other
tend to be philosophical rather than practical. For example, Java and
JavaScript automatically coerce objects to strings when they are added to
strings. Python does not. Neither choice seems a large mistake.

Binary strings differ from regular strings in the following ways:

 a) they have a unique type object named types.BinaryString

 b) they are constucted in Python code in one of three ways:

     1. using a "b" prefix on string literals

     2. using a function called binary()

     3. from some other C-coded function such as a file i/o library

 c) they repr() themselves with a b"" prefix as per Unicode strings

One reason to add the binary data type is because at some point in the
future may deprecate the construction of binary data in ordinary string
literals. Although details remain to be worked out, it is a goal that in
the future string literals will always be interpreted as character
strings. That might mean that non-ASCII characters will some day be
disallowed or that they wil be interpeted according to a declared Unicode
transformation encoding.

Conventions for binary file I/O will be worked out in a separate proposal.


From tdickenson@geminidataloggers.com  Thu Feb  8 18:16:39 2001
From: tdickenson@geminidataloggers.com (Toby Dickenson)
Date: Thu, 08 Feb 2001 18:16:39 +0000
Subject: [I18n-sig] Strawman Proposal: Binary Strings
In-Reply-To: <Pine.LNX.4.30.0102080927020.11546-100000@latte.ActiveState.com>
References: <Pine.LNX.4.30.0102080927020.11546-100000@latte.ActiveState.com>
Message-ID: <aan58to5cf2hei5feje8kakorlrahgo7ff@4ax.com>

On Thu, 8 Feb 2001 09:44:08 -0800 (PST), Paul Prescod
<paulp@ActiveState.com> wrote:

>  b.2. using a function called binary()

You say that precise coercion rules are a personal preference, but
adding a coercion function just helps this ambiguity to persist.

What if string.encode() returned a binary string.... would we need a
'binary()' builtin at all?


>They also follow
>Python's existing string->Unicode coercion rules.

I agree any explicit coecion should follow the same rules as Unicode.

Im not sure we agree on whether that coercion happens automatically
and implicitly, as it does with Unicode strings; I feel fairly
strongly that it shouldnt. (Ill justify that tomorrow if we do
disagree).


An extra difference:

 d) The str() is the same as the repr().

I think this makes sense. The library reference says str() returns "a
nicely printable representation of an object" - and raw binary data
definitely isnt.  It gives users a chance to think about what they are
storing in the string.  Also, having repr the same as str is the same
as lists, dicts, and other 'data container' types.

Toby Dickenson
tdickenson@geminidataloggers.com


From martin@loewis.home.cs.tu-berlin.de  Thu Feb  8 19:29:36 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 8 Feb 2001 20:29:36 +0100
Subject: [I18n-sig] Re: Python Character Model
In-Reply-To: <3A820CD2.25C3F978@ActiveState.com> (message from Paul Prescod on
 Wed, 07 Feb 2001 19:04:50 -0800)
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com>
Message-ID: <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de>

> And if they read in a file from a Frenchmen then they get random Russian
> characters on their screen. Or they crash the third-party software
> because it couldn't decode properly. Or ...
> 
> This is what we need to move away from.

Move-away-from, perhaps. Outright force moving by breaking people's
code, no.

> Surely you agree with me that it is inappropriate for a user to
> *expect* a DOM implementation to pass on binary data
> unmolested. That some particular DOM may do so (like minidom) is
> probably just a performance optimizatoin quirk that could go away at
> any time. Why would we go out of our way to support people making
> this mistake?

Because of backwards compatibility. Breaking people's programs is not
good - even if they are using a style or an algorithm that you
despise.

> In another message you admitted that the codec mechanism is somewhat
> user unfriendly...so I hope we agree that we need something better.

No, I admitted that it is inconsequential if read as English. It is no
more or less friendly than a module that's called, say, file, so you'd
use file.open. In either case, the user will have to learn what to use.

Many Python users won't guess the right meaning into codec, just as
many people won't guess what "modem" stands for - yet they are fully
capable of using it (despite its demodulating nature :).

Regards,
Martin


From paulp@ActiveState.com  Thu Feb  8 20:11:12 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Thu, 08 Feb 2001 12:11:12 -0800
Subject: [I18n-sig] Re: Python Character Model
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de>
Message-ID: <3A82FD60.EFB38FAD@ActiveState.com>

"Martin v. Loewis" wrote:
> 
> ...
> 
> Move-away-from, perhaps. Outright force moving by breaking people's
> code, no.

Guido has been very clear that breaking incorrect code is not
necessarily a problem. Remember the two-arg socket issue?

Anyhow, I was against the two-arg socket change and I would be against a
string/unicode unification *today*. But I strongly believe that we
should announce that that is the direction we are going so that people
can fix their code to conform with the coming world order. We have a
deprecation/warning mechanism precisely so that we can change the
language gradually -- even in backwards-incompatible ways.

> > In another message you admitted that the codec mechanism is somewhat
> > user unfriendly...so I hope we agree that we need something better.
> 
> No, I admitted that it is inconsequential if read as English. It is no
> more or less friendly than a module that's called, say, file, so you'd
> use file.open. In either case, the user will have to learn what to use.

I'll repeat the point that when you make the recommended way to do
things harder than the non-recommended (in this case implicit) way,
people will be slow to move if they ever move at all. Usability!

> Many Python users won't guess the right meaning into codec, just as
> many people won't guess what "modem" stands for - yet they are fully
> capable of using it (despite its demodulating nature :).

You seem to agree above that file is a better name so I'm not sure why
we're still talking about "codec"! The only question is whether it
should be file.open, fileopen or reusing our existing open. Do you have
any problems with reusing open with a quasi-mandatory (HIGHLY
RECOMMENDED!) encoding argument?

 Paul Prescod


From martin@loewis.home.cs.tu-berlin.de  Thu Feb  8 19:58:20 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 8 Feb 2001 20:58:20 +0100
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
In-Reply-To: <9FC702711D39D3118D4900902778ADC81287BF@JUPITER> (message from
 Toby Dickenson on Thu, 8 Feb 2001 10:26:00 -0000)
References: <9FC702711D39D3118D4900902778ADC81287BF@JUPITER>
Message-ID: <200102081958.f18JwKd01167@mira.informatik.hu-berlin.de>

> print u"hello world"
> 
> rather than the easier
> 
> print "hello world"
> 
> even though the message is clearly text.

You can easily have the latter being Unicode by invoking Python with
the -U option. If the pragma PEP is ever implemented, one pragma
should be reserved to declare the source file encoding, and another
one to declare all strings as Unicode in this file.

> I think we agree that, eventually, we would like the simple notation
> for a string literal to create a unicode string. What Im not sure
> about is whether we can make that change soon. How often are string
> literals used to create what is logically just binary data?

Let's have a look. Excluding __doc__ strings (which can be recognized
syntactically), performing grep '"' in the Python library, I get

BaseHTTPServer.py:__version__ = "0.2"
BaseHTTPServer.py:__all__ = ["HTTPServer", "BaseHTTPRequestHandler"] 

Both are "protocol" in some sense, i.e. not meant to be
human-readable. +2 for binary data

BaseHTTPServer.py:DEFAULT_ERROR_MESSAGE = """\ 

This is text, giving +1 for binary data. Actually, it is HTML, so when
transferring it, it needs to be encoded in some encoding; so it
*could* be considered as the encoded message instead

BaseHTTPServer.py:    sys_version = "Python/" + string.split(sys.version)[0]
BaseHTTPServer.py:    server_version = "BaseHTTP/" + __version__ 
BaseHTTPServer.py:        self.request_version = version = "HTTP/0.9" # Default BaseHTTPServer.py:                self.send_error(400, "Bad request version (%s)BaseHTTPServer.py:                                "Bad HTTP/0.9 request type (%s 
BaseHTTPServer.py:            self.send_error(400, "Bad request syntax (%s)" % `
BaseHTTPServer.py:            self.send_error(501, "Unsupported method (%s)" % `

Part of the HTTP protocol, thus binary data. +9

BaseHTTPServer.py:        self.log_error("code %d, message %s", code, message) 

Log file; this is text, so +8

            self.wfile.write("%s %s %s\r\n" %

HTTP protocol, +9

There are a few more. In total, BaseHTTPServer.py contains more binary
strings than text strings.

For other files, the ratio may vary. In general, I believe "binary"
strings in source code, as many of the strings are typically processed
by some other program which expects a specific byte sequence, rather
than a character string. 

Human-readable strings or probably more common in GUI
applications. One should think about i18n here, which means that the
actual localized message catalogs must be separate from the program
logic.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Thu Feb  8 20:09:26 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 8 Feb 2001 21:09:26 +0100
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
In-Reply-To: <3A82A04F.5A03CAB2@lemburg.com> (mal@lemburg.com)
References: <9FC702711D39D3118D4900902778ADC81287BF@JUPITER> <3A82A04F.5A03CAB2@lemburg.com>
Message-ID: <200102082009.f18K9QI01197@mira.informatik.hu-berlin.de>

>               encoded 8-bit string (with encoding 
>                                     information !)

I'd like to point out that this is something that Bill Janssen always
wanted to see. In CORBA, they number encodings for efficient
representation; that's something that Python could do as well. CORBA
took the OSF charset registry. That was a mistake, they think about
using the IANA registry now. This registry provides both textual and
numeric identifiers for encodings (numeric in the form of MIBEnum
values).

Regards,
Martin


From paulp@ActiveState.com  Thu Feb  8 20:24:49 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Thu, 08 Feb 2001 12:24:49 -0800
Subject: [I18n-sig] Strawman Proposal: Binary Strings
References: <Pine.LNX.4.30.0102080927020.11546-100000@latte.ActiveState.com> <aan58to5cf2hei5feje8kakorlrahgo7ff@4ax.com>
Message-ID: <3A830091.3D855EDD@ActiveState.com>

Toby Dickenson wrote:
> 
> ...
> 
> You say that precise coercion rules are a personal preference, but
> adding a coercion function just helps this ambiguity to persist.
> 
> What if string.encode() returned a binary string.... would we need a
> 'binary()' builtin at all?

I guess not. But the encode method might already be in use. If we
combine your restrictive coercion suggestion with this suggestion we
might break some (admittedly newish) code. How about
"str.binencode(encoding)".

Also, it isn't entirely unbelievable that someone might want to encode
from a string to a string. e.g. base64 (do we call that an encoding??)
So having an binencode() seperate from encode() might be a good idea.
Alternate names are "binary", "asbinary", "tobinary", "getbinary" and
any underscore-separated variant.

> ...
> I agree any explicit coecion should follow the same rules as Unicode.
> Im not sure we agree on whether that coercion happens automatically
> and implicitly, as it does with Unicode strings; I feel fairly
> strongly that it shouldnt. (Ill justify that tomorrow if we do
> disagree).

If we were inventing something from whole cloth I would agree with you.
But I want people to quickly port their string-using applications over
to binary-strings and if we require a bunch more explicit conversions
then they will move more slowly.

Nevertheless, I'm not willing to fight about the issue. There are two
votes against coercion already and if the response is similarly
anti-coercion then I'll agree.

> An extra difference:
> 
>  d) The str() is the same as the repr().

That sounds okay with me...

 Paul Prescod


From martin@loewis.home.cs.tu-berlin.de  Thu Feb  8 20:46:16 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 8 Feb 2001 21:46:16 +0100
Subject: [I18n-sig] Re: Python Character Model
In-Reply-To: <3A82FD60.EFB38FAD@ActiveState.com> (message from Paul Prescod on
 Thu, 08 Feb 2001 12:11:12 -0800)
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> <3A82FD60.EFB38FAD@ActiveState.com>
Message-ID: <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de>

> > Move-away-from, perhaps. Outright force moving by breaking people's
> > code, no.
> 
> Guido has been very clear that breaking incorrect code is not
> necessarily a problem.

For that, a requirement is/should be that the code was documented as
incorrect (which I think was the case in the socket calls - even
though examples had it wrong).

In this case, I think the code currently clearly *is* correct. Nowhere
in the Python reference manual is using bytes > 128 in a string
declared as incorrect. Instead, the documentation says

# 8-bit characters may be used in string literals and comments but
# their interpretation is platform dependent; the proper way to insert
# 8-bit characters in string literals is by using octal or hexadecimal
# escape sequences.

So while people should be aware that their scripts many not work on
other platforms, I think they are granted permission to use that, and
can expect that this continues to work in the same platform-dependent
way in the future (if used on the same platform).

> > Many Python users won't guess the right meaning into codec, just as
> > many people won't guess what "modem" stands for - yet they are fully
> > capable of using it (despite its demodulating nature :).
> 
> You seem to agree above that file is a better name so I'm not sure why
> we're still talking about "codec"!

Because that is how it works in Python 2.0. I'm against moving
functions around like that, as this gives a clear message that
python-dev can and will break anything anytime. Please see a recent
message from Fredrik Lundh who was complaining about this very
problem: people seem to be fond of breaking other people's code "for
their own good".

> Do you have any problems with reusing open with a quasi-mandatory
> (HIGHLY RECOMMENDED!) encoding argument?

No, that is certainly better than another builtin. It'd depend on the
exact signature, though - as you point out, open has already two
optional arguments. Adding a third one won't fly; it has to be a
keyword argument, then.

Regards,
Martin


From paulp@ActiveState.com  Thu Feb  8 21:35:12 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Thu, 08 Feb 2001 13:35:12 -0800
Subject: [I18n-sig] Re: Python Character Model
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> <3A82FD60.EFB38FAD@ActiveState.com> <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de>
Message-ID: <3A831110.6AADE590@ActiveState.com>

"Martin v. Loewis" wrote:
> 
> > ...
> > You seem to agree above that file is a better name so I'm not sure why
> > we're still talking about "codec"!
> 
> Because that is how it works in Python 2.0. I'm against moving
> functions around like that, as this gives a clear message that
> python-dev can and will break anything anytime. 

Aliasing a function into builtins would not break anything!

> Please see a recent
> message from Fredrik Lundh who was complaining about this very
> problem: people seem to be fond of breaking other people's code "for
> their own good".

That will happen in the language's evolution. We sometimes make mistakes
that must be fixed later. I think most of us agree that in the year 2001
it is a mistake to have literal strings map to byte strings instead of
character strings. We're going to have to break some code to fix that
mistake. The only question is whether we do it a little at a time like
K&R C->ANSI C->C++ or in a big bang like C++ -> Java or VB to VB.NET. I
prefer the former.

> > Do you have any problems with reusing open with a quasi-mandatory
> > (HIGHLY RECOMMENDED!) encoding argument?
> 
> No, that is certainly better than another builtin. It'd depend on the
> exact signature, though - as you point out, open has already two
> optional arguments. Adding a third one won't fly; it has to be a
> keyword argument, then.

At the bottom of one of my messages I proposed that we insert it as the
second argument. Although the encoding and mode are both strings there
is no syntactic overlap between [rwa][+]?[tb]+ and the set of existent
or proposed encodings. If we merely outlaw encodings with that name then
we can quickly figure out whether the second argument is a mode or an
encoding. So the documented syntax would be

open(filename, encoding, [[mode], bytes])

And the documentation would say: 

"There is an obsolete variant that does not require an encoding string.
This may cause a warning in future versions of Python and be removed
sometime after that."

 Paul Prescod


From paulp@ActiveState.com  Thu Feb  8 23:14:40 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Thu, 08 Feb 2001 15:14:40 -0800
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
References: <9FC702711D39D3118D4900902778ADC81287BF@JUPITER> <200102081958.f18JwKd01167@mira.informatik.hu-berlin.de>
Message-ID: <3A832860.B5D15B3D@ActiveState.com>

"Martin v. Loewis" wrote:
> 
> ...
> 
> For other files, the ratio may vary. In general, I believe "binary"
> strings in source code, as many of the strings are typically processed
> by some other program which expects a specific byte sequence, rather
> than a character string.

I think that your counting methodology is highly suspect. I consider a
binary string to be a string that contains elements that the author did
not think of in terms of some subset of Unicode.

So for example:

sys_version = "Python/" + string.split(sys.version)[0]

Nobody would ever expect sys_version to have anything other than Unicode
characters in it. The pattern of strings produced here will always be
composed only of Unicode-legal elements. A GIF file is binary because
most bytes are not intended to be Unicode characters.

According to your definition, an XML document comprising a SOAP message
is "binary" rather than "text" despite what the XML specification says.
After all, what could be more "protocol" than SOAP.

Things like the Python version and SOAP messages are designed to be both
protocol and text. Thats a major part of what distinguishes SOAP from
DCOM or IIOP for example.

 Paul Prescod


From paulp@ActiveState.com  Thu Feb  8 23:23:50 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Thu, 08 Feb 2001 15:23:50 -0800
Subject: [I18n-sig] Binary data b"strings" (Pre-PEP: Proposed Python
 Character Model)
References: <9FC702711D39D3118D4900902778ADC83244A2@JUPITER> <3A82CA6D.313D5E39@lemburg.com>
Message-ID: <3A832A86.71833150@ActiveState.com>

I've thought about this coercion issue more...I think we need to
auto-coerece these binary strings using some well-defined rule (NOT a
default encoding!).

"M.-A. Lemburg" wrote:
> 
> > ...
> >
> > I would want to avoid the need for a 2.0-style 'default encoding', so I
> > suggest it shouldnt be possible to mix this type with other strings:
> >
> > >>> "1"+b"2"
> > Traceback (most recent call last):
> >   File "<stdin>", line 1, in ?
> > TypeError: cannot add type "binary" to string
> > >>> "3"==b"3"
> > 0
> 
> Right. This will cause people to rethink whether they are
> using the object for text data or binary data. I still think that
> at the interface level, b"" and "" should be treated the same (except
> that b""-strings should not implement the char buffer interface).

If C functions auto-convert these things then people will coerce them by
passing them through C functions. e.g. the regular expression engine or
null encoding functions or whatever.

If we do NOT auto-coerce these things then they will not be compatible
with many parts of the Python infrastructure, the regular expression
engine and codecs being the most important examples. A clear requirement
from Andy Robinson was that string-like code should work on binary data
because often binary strings are "really" un-decoded strings. I think he
is speaking on behalf of a lot of serious internationalizers there.

> OTOH, these b""-strings should implement the same methods as the
> array type and probably seemlessly interact with it too. I don't
> know which type should be considered "better" in coercion
> though, b""-strings or arrays (I guess b""-strings).

Let's keep arrays separate. Arrays are mutable! If users ask for some
particular features from arrays to be also implemented on byte strings,
so be it. Let's only add magic after we know we really need it.

 Paul Prescod


From paulp@ActiveState.com  Thu Feb  8 23:45:26 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Thu, 08 Feb 2001 15:45:26 -0800
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
Message-ID: <3A832F96.B6FEEB51@ActiveState.com>

Python source files may declare their encoding in the first few lines.
An encoding declaration must be found before the first statement in the
source file.

The encoding declaration is not a pragma. It does not show up in the
parse tree and has no semantic meaning for the compiler itself. It is
conceptually handled in a pre-compile "encoding sniffing" step. This
step is done using the Latin 1 encoding. 

The encoding declaration has the following basic syntax:

#?encoding="<some string>"

<some string> is the encoding name and must be associated with a
registered codec. The appropriate codec is used to decode the source
file. The decoded result is passed to the compiler. Once the decoding is
done, the encoding declaration has no other effect. In other words, it
does not further affect the interpretation of string literals with
non-ASCII characters or anything else.

The encoding declaration SHOULD be present in all Python source files
encoded in any character encoding other than 7-bit ASCII. Some future
version of Python may make this an absolute requirement.


From paulp@ActiveState.com  Fri Feb  9 00:24:39 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Thu, 08 Feb 2001 16:24:39 -0800
Subject: [I18n-sig] Strawman Proposal: Smart String Test
Message-ID: <3A8338C7.6824679C@ActiveState.com>

The types module will contain a new function called 

isstring(obj)

types.isstring returns true if the object could be directly interpreted
as a string. This is defined as: "implements the string interface and is
compatible with the re regular expression engine." At the moment no user
types fit this criteria so there is no mechanism for extending the
behavior of the types.isstring function yet. It's initial behavior is:

def isstring(obj):
    return type(obj) in (StringType, UnicodeType)

 Paul Prescod


From martin@loewis.home.cs.tu-berlin.de  Thu Feb  8 22:03:23 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 8 Feb 2001 23:03:23 +0100
Subject: [I18n-sig] Re: Python Character Model
In-Reply-To: <3A831110.6AADE590@ActiveState.com> (message from Paul Prescod on
 Thu, 08 Feb 2001 13:35:12 -0800)
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> <3A82FD60.EFB38FAD@ActiveState.com> <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de> <3A831110.6AADE590@ActiveState.com>
Message-ID: <200102082203.f18M3N105616@mira.informatik.hu-berlin.de>

> At the bottom of one of my messages I proposed that we insert it as the
> second argument. Although the encoding and mode are both strings there
> is no syntactic overlap between [rwa][+]?[tb]+ and the set of existent
> or proposed encodings.

I missed that; that is a good approach.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Fri Feb  9 08:24:08 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Fri, 9 Feb 2001 09:24:08 +0100
Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model
In-Reply-To: <3A832860.B5D15B3D@ActiveState.com> (message from Paul Prescod on
 Thu, 08 Feb 2001 15:14:40 -0800)
References: <9FC702711D39D3118D4900902778ADC81287BF@JUPITER> <200102081958.f18JwKd01167@mira.informatik.hu-berlin.de> <3A832860.B5D15B3D@ActiveState.com>
Message-ID: <200102090824.f198O8901238@mira.informatik.hu-berlin.de>

> So for example:
> 
> sys_version = "Python/" + string.split(sys.version)[0]
> 
> Nobody would ever expect sys_version to have anything other than Unicode
> characters in it. 

My point is that sys_version is used in

        self.send_header('Server', self.version_string())

That is, it is sent following a specific transfer syntax of the
underlying protocol (HTTP), and that transfer syntax is defined in
terms of byte sequences. There is a constraint in the protocol that
most of the bytes must be restricted to the printable characters of
ASCII, though.

Suppose we raise exceptions at some time if something other than bytes
are written into a byte stream which has no associated encoding. Then,
I suspect, that fragment should rewritten as

sys_version = b"Python/" + string.split(sys.version)[0].encode("ASCII")

The Server: header that we send will be a byte sequence, not a text
message.

> According to your definition, an XML document comprising a SOAP message
> is "binary" rather than "text" despite what the XML specification says.
> After all, what could be more "protocol" than SOAP.

It depends. If it goes through an encoding before being transmitted,
then it should be represented as a character string.

If it is written to a socket directly, e.g. with

msg = "<soap:body>some SOAP specific elements I don't know</soap:body>"
s.write(msg)

Then certainly, yes, that document is represented in a binary
string. Please note that some XML document can be represented in many
ways: character strings, binary strings, DOM trees, SAX event
sequences, etc. 

The "XML document comprising a SOAP message", in itself, has no
inherent representation; whether a specific representation ought to be
treated as text or binary primarily depends on whether there is
encoding or not.

Regards,
Martin


From tdickenson@geminidataloggers.com  Fri Feb  9 09:46:12 2001
From: tdickenson@geminidataloggers.com (Toby Dickenson)
Date: Fri, 09 Feb 2001 09:46:12 +0000
Subject: [I18n-sig] Strawman Proposal: Binary Strings
In-Reply-To: <3A830091.3D855EDD@ActiveState.com>
References: <Pine.LNX.4.30.0102080927020.11546-100000@latte.ActiveState.com> <aan58to5cf2hei5feje8kakorlrahgo7ff@4ax.com> <3A830091.3D855EDD@ActiveState.com>
Message-ID: <h7d78t4jr62eu7sg2m1mevigetko6i0sln@4ax.com>

On Thu, 08 Feb 2001 12:24:49 -0800, Paul Prescod
<paulp@ActiveState.com> wrote:

>> What if string.encode() returned a binary string.... would we need a
>> 'binary()' builtin at all?
>
>I guess not. But the encode method might already be in use. If we
>combine your restrictive coercion suggestion with this suggestion we
>might break some (admittedly newish) code. How about
>"str.binencode(encoding)".
>
>Also, it isn't entirely unbelievable that someone might want to encode
>from a string to a string. e.g. base64 (do we call that an encoding??)
>So having an binencode() seperate from encode() might be a good idea.
>Alternate names are "binary", "asbinary", "tobinary", "getbinary" and
>any underscore-separated variant.

Yes, the type of value returned from string.encode(x) depends on x. I
intended to suggest that string.encode('latin1') would be the best way
to convert from string to binary. However, I now see that wont work
for plain strings: their .encode() method always goes via unicode,
using the default encoding.

So: Im happy with you .binary() method on strings. Add it bstrings too
(as a 'return self'), but not unicode strings.

>> I agree any explicit coecion should follow the same rules as Unicode.
>> Im not sure we agree on whether that coercion happens automatically
>> and implicitly, as it does with Unicode strings; I feel fairly
>> strongly that it shouldnt. (Ill justify that tomorrow if we do
>> disagree).
>
>If we were inventing something from whole cloth I would agree with you.
>But I want people to quickly port their string-using applications over
>to binary-strings and if we require a bunch more explicit conversions
>then they will move more slowly.
>
>Nevertheless, I'm not willing to fight about the issue. There are two
>votes against coercion already and if the response is similarly
>anti-coercion then I'll agree.

Waaaaaah. There are some backward-compatability issues that complicate
my comparison proposal.....

Consider some old code that

print md5('some stuff').digest() =3D=3D 'reference'

We want this to do the right thing after:
* changing .digest() to return a string
* changing 'reference' to b'reference'
* changing both

Therefore we have to allow string/bstring comparisons. However,
raising an exception on unicode/bstring comparison still makes sense.


Toby Dickenson
tdickenson@geminidataloggers.com


From mal@lemburg.com  Fri Feb  9 10:10:33 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 09 Feb 2001 11:10:33 +0100
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
References: <3A832F96.B6FEEB51@ActiveState.com>
Message-ID: <3A83C219.86D3F355@lemburg.com>

Paul Prescod wrote:
> 
> Python source files may declare their encoding in the first few lines.
> An encoding declaration must be found before the first statement in the
> source file.
> 
> The encoding declaration is not a pragma. It does not show up in the
> parse tree and has no semantic meaning for the compiler itself. It is
> conceptually handled in a pre-compile "encoding sniffing" step. This
> step is done using the Latin 1 encoding.

I'd rather restrict this to ASCII since codec names must be ASCII
and this would also allow detecting wrong formats of the source file
in addition to make UTF-16 detection possible.
 
> The encoding declaration has the following basic syntax:
> 
> #?encoding="<some string>"
> 
> <some string> is the encoding name and must be associated with a
> registered codec. The appropriate codec is used to decode the source
> file. 

Decode to what other format ? Unicode, the current locale's encoding ?
What would happen if the decoding step fails ?

> The decoded result is passed to the compiler. Once the decoding is
> done, the encoding declaration has no other effect. In other words, it
> does not further affect the interpretation of string literals with
> non-ASCII characters or anything else.

But if it doesn't affect the interpretation of string literals then
what benefits do we gain from knowing the encoding ?
 
> The encoding declaration SHOULD be present in all Python source files
> encoded in any character encoding other than 7-bit ASCII. Some future
> version of Python may make this an absolute requirement.

I think that such a scheme is indeed possible, but not until we
have made all strings default to Unicode. Then decoding to Unicode
would be the proper thing to do.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Fri Feb  9 10:26:07 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 09 Feb 2001 11:26:07 +0100
Subject: [I18n-sig] Strawman Proposal: Smart String Test
References: <3A8338C7.6824679C@ActiveState.com>
Message-ID: <3A83C5BF.30606DD2@lemburg.com>

Paul Prescod wrote:
> 
> The types module will contain a new function called
> 
> isstring(obj)
> 
> types.isstring returns true if the object could be directly interpreted
> as a string. This is defined as: "implements the string interface and is
> compatible with the re regular expression engine."

re compatibility is given by read buffer compatibility; it is
not restricted to strings. In fact re works on buffers and mmap'ed
files too.

> At the moment no user
> types fit this criteria so there is no mechanism for extending the
> behavior of the types.isstring function yet. It's initial behavior is:
> 
> def isstring(obj):
>     return type(obj) in (StringType, UnicodeType)

Looks like a hack which would only serve a temporary need...

The right thing to do would be to add a
new abstract string type object and then have isinstance() 
return 1 for StringType and UnicodeType when asked for the
new abstract type.

The problem with this approach is that we would be constructing
a forward compatible mechanism before having designed the
class hierarchie (see my other post) for these types, e.g.

                  binary data string (BinaryDataType)
                         |
                         |
                  text data string (TextDataType)
                    |           |
                    |           |
         Unicode string      encoded 8-bit string (with encoding 
         (UnicodeType)       (StringType)          information !)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Fri Feb  9 11:56:53 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 09 Feb 2001 12:56:53 +0100
Subject: [I18n-sig] BOF session on IPC9 DevDay about this ?
Message-ID: <3A83DB05.15628F36@lemburg.com>

Would anyone here like to talk through these proposals on DevDay ?

Perhaps we could arrange some BOF-session for it, or integrate
it into one of the scheduled sessions ?! (Don't know what the 
procedure for this is, that's why I put Jeff Bauer, the chair
of the DevDay on CC).

Thanks,
-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From paulp@ActiveState.com  Fri Feb  9 15:29:54 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Fri, 9 Feb 2001 07:29:54 -0800 (PST)
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
In-Reply-To: <3A83C219.86D3F355@lemburg.com>
Message-ID: <Pine.LNX.4.30.0102090701220.12507-100000@latte.ActiveState.com>

On Fri, 9 Feb 2001, M.-A. Lemburg wrote:

> ...>
> I'd rather restrict this to ASCII since codec names must be ASCII
> and this would also allow detecting wrong formats of the source file
> in addition to make UTF-16 detection possible.

That's fine with me.

> > <some string> is the encoding name and must be associated with a
> > registered codec. The appropriate codec is used to decode the source
> > file.
>
> Decode to what other format ? Unicode, the current locale's encoding ?
> What would happen if the decoding step fails ?

We would decode to Unicode. If Decoding fails you get some kind of
EncodingException error. This would be trapped in import machinery to be
raised as an ImportError for imported modules.

> > The decoded result is passed to the compiler. Once the decoding is
> > done, the encoding declaration has no other effect. In other words, it
> > does not further affect the interpretation of string literals with
> > non-ASCII characters or anything else.
>
> But if it doesn't affect the interpretation of string literals then
> what benefits do we gain from knowing the encoding ?

Let's say that you have a string literal:

a="XX"

XX are bytes representing a character. If the character represented has an
ordinal less than 255 then this would work. More often you would say:

a=u"XX"

The system would treat those examples no differently than this one:t

XX="a"

This keeps the model very simple and allows us to evolve to wide-character
variable names some day.

> I think that such a scheme is indeed possible, but not until we
> have made all strings default to Unicode. Then decoding to Unicode
> would be the proper thing to do.

Making all strings default to Unicode is a good idea but it is a separate
project. I think that my proposal above is still useful. It means that a
Russian can type Unicode characters into their document using their KOI8-R
editor.

They can't type those Unicode characters directly into a string literal
but why would they want to now that we have Unicode? If there is some
reason they want to keep typing wide chars into string literals then there
must be some problem with our Unicode support and we should work that out.
Until we work that out, they probably just wouldn't use our encoding
declaration feature.

 Paul Prescod


From paulp@ActiveState.com  Fri Feb  9 15:39:02 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Fri, 9 Feb 2001 07:39:02 -0800 (PST)
Subject: [I18n-sig] Strawman Proposal: Smart String Test
In-Reply-To: <3A83C5BF.30606DD2@lemburg.com>
Message-ID: <Pine.LNX.4.30.0102090730340.12507-100000@latte.ActiveState.com>

On Fri, 9 Feb 2001, M.-A. Lemburg wrote:

> > types.isstring returns true if the object could be directly interpreted
> > as a string. This is defined as: "implements the string interface and is
> > compatible with the re regular expression engine."
>
> re compatibility is given by read buffer compatibility; it is
> not restricted to strings. In fact re works on buffers and mmap'ed
> files too.

There are two conditions listed above. mmap'd files (for example) do not
support the string interface. There is no join(), search() etc.

> > At the moment no user
> > types fit this criteria so there is no mechanism for extending the
> > behavior of the types.isstring function yet. It's initial behavior is:
> >
> > def isstring(obj):
> >     return type(obj) in (StringType, UnicodeType)
>
> Looks like a hack which would only serve a temporary need...

So? Sometimes a temporary hack is the right thing to do. If you want to
revive the types sig to figure out a hierarchical interface concept then
go ahead. I trying to solve a very simple, localized and pervasive
problem:

type(foo)==type("")

You proposed that we should handle it by having a tuple or list called
StringTypes in the types module. I tried to make a solution that is more
forward-thinking because we can bring in your interface hierarchy magic
later. Code will just keep working when we figure out how to "subclass
strings" because the determination will be made by the isstring
abstraction.

Is there a practical problem with this solution?

 Paul Prescod


From paulp@ActiveState.com  Fri Feb  9 15:40:57 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Fri, 9 Feb 2001 07:40:57 -0800 (PST)
Subject: [I18n-sig] BOF session on IPC9 DevDay about this ?
In-Reply-To: <3A83DB05.15628F36@lemburg.com>
Message-ID: <Pine.LNX.4.30.0102090739440.12507-100000@latte.ActiveState.com>

Are those BOFs ever useful? Everybody goes in with good intentions and
leaves with good intentions and nothing happens...a good email war
culminating in one or more PEPs seems more useful to me...the result is
concrete.

 Paul Prescod


From tdickenson@geminidataloggers.com  Fri Feb  9 16:15:21 2001
From: tdickenson@geminidataloggers.com (Toby Dickenson)
Date: Fri, 09 Feb 2001 16:15:21 +0000
Subject: [I18n-sig] Strawman Proposal: Smart String Test
In-Reply-To: <Pine.LNX.4.30.0102090730340.12507-100000@latte.ActiveState.com>
References: <3A83C5BF.30606DD2@lemburg.com> <Pine.LNX.4.30.0102090730340.12507-100000@latte.ActiveState.com>
Message-ID: <4o588t4683cpu32srcmp928b1m5dr003i3@4ax.com>

>Is there a practical problem with this solution?

def isstring(obj):
  return type(obj) in (StringType, UnicodeType) or isinstance(obj,
UserString)


Toby Dickenson
tdickenson@geminidataloggers.com


From paulp@ActiveState.com  Fri Feb  9 16:40:28 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Fri, 9 Feb 2001 08:40:28 -0800 (PST)
Subject: [I18n-sig] Strawman Proposal: Smart String Test
In-Reply-To: <4o588t4683cpu32srcmp928b1m5dr003i3@4ax.com>
Message-ID: <Pine.LNX.4.30.0102090836160.12507-100000@latte.ActiveState.com>

On Fri, 9 Feb 2001, Toby Dickenson wrote:
> Paul Prescod wrote:
> >Is there a practical problem with this solution?
>
> def isstring(obj):
>   return type(obj) in (StringType, UnicodeType) or isinstance(obj,
> UserString)

Are you saying that there is a problem with isstring? Or proposing a
slightly different formulation?

If the latter: does UserString really behave enough like a string to
"work"? I've never tried it. In particular, does passing a UserString to a
regexp do what you expect? Can you pass a UserString to the open() command
as a filename, etc.?

 Paul Prescod


From mal@lemburg.com  Fri Feb  9 17:24:45 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 09 Feb 2001 18:24:45 +0100
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
References: <Pine.LNX.4.30.0102090701220.12507-100000@latte.ActiveState.com>
Message-ID: <3A8427DD.2BABEF53@lemburg.com>

Paul Prescod wrote:
> 
> On Fri, 9 Feb 2001, M.-A. Lemburg wrote:
> 
> > ...>
> > I'd rather restrict this to ASCII since codec names must be ASCII
> > and this would also allow detecting wrong formats of the source file
> > in addition to make UTF-16 detection possible.
> 
> That's fine with me.
> 
> > > <some string> is the encoding name and must be associated with a
> > > registered codec. The appropriate codec is used to decode the source
> > > file.
> >
> > Decode to what other format ? Unicode, the current locale's encoding ?
> > What would happen if the decoding step fails ?
> 
> We would decode to Unicode. If Decoding fails you get some kind of
> EncodingException error. This would be trapped in import machinery to be
> raised as an ImportError for imported modules.
> 
> > > The decoded result is passed to the compiler. Once the decoding is
> > > done, the encoding declaration has no other effect. In other words, it
> > > does not further affect the interpretation of string literals with
> > > non-ASCII characters or anything else.
> >
> > But if it doesn't affect the interpretation of string literals then
> > what benefits do we gain from knowing the encoding ?
> 
> Let's say that you have a string literal:
> 
> a="XX"
> 
> XX are bytes representing a character. If the character represented has an
> ordinal less than 255 then this would work. More often you would say:
> 
> a=u"XX"
> 
> The system would treat those examples no differently than this one:t
> 
> XX="a"
> 
> This keeps the model very simple and allows us to evolve to wide-character
> variable names some day.
> 
> > I think that such a scheme is indeed possible, but not until we
> > have made all strings default to Unicode. Then decoding to Unicode
> > would be the proper thing to do.
> 
> Making all strings default to Unicode is a good idea but it is a separate
> project. I think that my proposal above is still useful. It means that a
> Russian can type Unicode characters into their document using their KOI8-R
> editor.
> 
> They can't type those Unicode characters directly into a string literal
> but why would they want to now that we have Unicode? If there is some
> reason they want to keep typing wide chars into string literals then there
> must be some problem with our Unicode support and we should work that out.
> Until we work that out, they probably just wouldn't use our encoding
> declaration feature.

Ah, ok. The encoding information will only be applied to literal
Unicode strings (u"text"), right ?

That's in line with what we have already discussed here or on
python-dev some time ago. Only then we tried to achive this using
some form of pragma statement.

So what this strawman suggest is in summary:

1. add an encoding identifier to the top of a source code file
2. use that encoding information to decode u"..." literals into
   Unicode
3. leave all other literals and text alone

Sounds ok, even though it should probably made clear that only
the u"" literals actually use the encoding information (perhaps
the name should be #?unicode-encoding="" ?) and nothing else.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Fri Feb  9 17:31:38 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 09 Feb 2001 18:31:38 +0100
Subject: [I18n-sig] BOF session on IPC9 DevDay about this ?
References: <Pine.LNX.4.30.0102090739440.12507-100000@latte.ActiveState.com>
Message-ID: <3A84297A.1F9CAB19@lemburg.com>

Paul Prescod wrote:
> 
> Are those BOFs ever useful? Everybody goes in with good intentions and
> leaves with good intentions and nothing happens...a good email war
> culminating in one or more PEPs seems more useful to me...the result is
> concrete.

I just think that it might be a good idea since people usually don't
have the time to read all the mails on high traffic threads like
these. 

It is still useful to get some feedback from them, since
the few participants in this thread are not representative for 
the typical i18n Python user and it is very likely that some
aspects are simply forgotten due to a limited view on the 
implications of a proposal (believe me, I've made that experience
more than once during the Unicode integration phase).

Anyway, was just a thought.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From paulp@ActiveState.com  Fri Feb  9 18:10:38 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Fri, 09 Feb 2001 10:10:38 -0800
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
References: <Pine.LNX.4.30.0102090701220.12507-100000@latte.ActiveState.com> <3A8427DD.2BABEF53@lemburg.com>
Message-ID: <3A84329E.7B7CE012@ActiveState.com>

"M.-A. Lemburg" wrote:
> 
> ...
> 
> Ah, ok. The encoding information will only be applied to literal
> Unicode strings (u"text"), right ?

No, that's very different than what I am suggesting.

The encoding is applied to the *text file*. In the initial version, the
only place Python would allow Unicode characters is in Unicode literals
so currently the only USEFUL place to put those special characters is in
Unicode literals. But Python may one day allow Unicode variable names or
"simple string literals" and this mechanism will not change its
definition or behavior. Only the Python grammar will change.

The interpretation of string literals is a totally separate issue.

 Paul Prescod


From mal@lemburg.com  Fri Feb  9 19:30:33 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 09 Feb 2001 20:30:33 +0100
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
References: <Pine.LNX.4.30.0102090701220.12507-100000@latte.ActiveState.com> <3A8427DD.2BABEF53@lemburg.com> <3A84329E.7B7CE012@ActiveState.com>
Message-ID: <3A844559.8C284F7A@lemburg.com>

Paul Prescod wrote:
> 
> "M.-A. Lemburg" wrote:
> >
> > ...
> >
> > Ah, ok. The encoding information will only be applied to literal
> > Unicode strings (u"text"), right ?
> 
> No, that's very different than what I am suggesting.
> 
> The encoding is applied to the *text file*.

-1

The parser has no idea of what to do with Unicode input...
this would mean that we would have to make it Unicode
aware and this opens a new can of worms; not only in the case
where this encoding specifier is used.

Also, string literals ("text") would have to translate the
Unicode input passed to the parser back to ASCII (or whatever
the default encoding is) and this would break code which currently
uses strings for data or some specific text encoding.

The result would be way to much breakage.

> In the initial version, the
> only place Python would allow Unicode characters is in Unicode literals
> so currently the only USEFUL place to put those special characters is in
> Unicode literals. But Python may one day allow Unicode variable names or
> "simple string literals" and this mechanism will not change its
> definition or behavior. Only the Python grammar will change.

Sorry, Paul, but this will never happen. Python is an ASCII 
programming language and does good at it.

> The interpretation of string literals is a totally separate issue.

See above.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From paulp@ActiveState.com  Fri Feb  9 20:04:23 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Fri, 09 Feb 2001 12:04:23 -0800
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
References: <Pine.LNX.4.30.0102090701220.12507-100000@latte.ActiveState.com> <3A8427DD.2BABEF53@lemburg.com> <3A84329E.7B7CE012@ActiveState.com> <3A844559.8C284F7A@lemburg.com>
Message-ID: <3A844D47.8DACEE2B@ActiveState.com>

"M.-A. Lemburg" wrote:
> 
> ...
> 
> The parser has no idea of what to do with Unicode input...
> this would mean that we would have to make it Unicode
> aware and this opens a new can of worms; not only in the case
> where this encoding specifier is used.

Obviously the parser cannot be made unicode aware for Python 2.1 but why
not for Python 2.2? What's so difficult about it? There's no rocket
science.

Also, if we wanted a quick hack, couldn't we implement it at first by
"decoding" to UTF-8? Then the parser could look for UTF-8 in Unicode
string literals and translate those into real Unicode.

> Also, string literals ("text") would have to translate the
> Unicode input passed to the parser back to ASCII (or whatever
> the default encoding is) and this would break code which currently
> uses strings for data or some specific text encoding.

It would only break code that adds the encoding declaration. If you
don't add the declaration you don't break any code!

Plus, we all agree that passing binary data in literal strings should be
a deprecated usage eventually. That's why we're inventing binary
strings.

> ...
> Sorry, Paul, but this will never happen. Python is an ASCII
> programming language and does good at it.

I am amazed to hear you say that. Why SHOULDN'T we allow Chinese
variables names some day? This is the 21st century. If we don't go after
Asian markets someone else will! I've gotta admit that that kind of
Euro-centric attitude sort of annoys me...

 Paul Prescod


From martin@loewis.home.cs.tu-berlin.de  Fri Feb  9 20:23:44 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Fri, 9 Feb 2001 21:23:44 +0100
Subject: [I18n-sig] Strawman Proposal: Smart String Test
In-Reply-To: <Pine.LNX.4.30.0102090730340.12507-100000@latte.ActiveState.com>
 (message from Paul Prescod on Fri, 9 Feb 2001 07:39:02 -0800 (PST))
References: <Pine.LNX.4.30.0102090730340.12507-100000@latte.ActiveState.com>
Message-ID: <200102092023.f19KNic00993@mira.informatik.hu-berlin.de>

> So? Sometimes a temporary hack is the right thing to do. If you want to
> revive the types sig to figure out a hierarchical interface concept then
> go ahead. I trying to solve a very simple, localized and pervasive
> problem:
> 
> type(foo)==type("")
> 
> You proposed that we should handle it by having a tuple or list called
> StringTypes in the types module. I tried to make a solution that is more
> forward-thinking because we can bring in your interface hierarchy magic
> later.

I'm in favour of adding types.isstring. I know that I have added

try:
  StringTypes = [types.StringType, types.UnicodeType]
except AttributeError:
  StringType = [types.StringType]
...

  if type(foo) in StringTypes:

into many places, and that I had considered suggesting
types.StringTypes as a standard feature. I did not provide a patch
since it won't help for programs that need to be backwards-compatible.

types.isstring won't help for backwards compatibility, either, but it
is enough simplification over the original test (type(foo) ==
type("")), and has a great chance of being forwards-compatible.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Fri Feb  9 20:27:14 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Fri, 9 Feb 2001 21:27:14 +0100
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
In-Reply-To: <3A8427DD.2BABEF53@lemburg.com> (mal@lemburg.com)
References: <Pine.LNX.4.30.0102090701220.12507-100000@latte.ActiveState.com> <3A8427DD.2BABEF53@lemburg.com>
Message-ID: <200102092027.f19KREW00995@mira.informatik.hu-berlin.de>

> So what this strawman suggest is in summary:
> 
> 1. add an encoding identifier to the top of a source code file
> 2. use that encoding information to decode u"..." literals into
>    Unicode
> 3. leave all other literals and text alone

I think the proposal was to do

3. raise an error if another literal uses bytes > 127

instead. Since users need to actively change their source to use the
encoding declaration, they'll combine this with putting u in front of
every affected string. If they then still have strings with bytes
>127, they need to use the \x notation, as the string should not
contain text.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Fri Feb  9 20:47:26 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Fri, 9 Feb 2001 21:47:26 +0100
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
In-Reply-To: <3A844D47.8DACEE2B@ActiveState.com> (message from Paul Prescod on
 Fri, 09 Feb 2001 12:04:23 -0800)
References: <Pine.LNX.4.30.0102090701220.12507-100000@latte.ActiveState.com> <3A8427DD.2BABEF53@lemburg.com> <3A84329E.7B7CE012@ActiveState.com> <3A844559.8C284F7A@lemburg.com> <3A844D47.8DACEE2B@ActiveState.com>
Message-ID: <200102092047.f19KlQO01127@mira.informatik.hu-berlin.de>

> > Sorry, Paul, but this will never happen. Python is an ASCII
> > programming language and does good at it.
>=20
> I am amazed to hear you say that. Why SHOULDN'T we allow Chinese
> variables names some day? This is the 21st century. If we don't go after
> Asian markets someone else will! I've gotta admit that that kind of
> Euro-centric attitude sort of annoys me...

I'm not sure it is Euro-centric; many European languages have
characters that can't be used in identifiers, either. People have
learned to accept this restriction.=20

Furthermore, there have been experiments with allowing arbitrary
characters in identifiers, so users could use their language for
identifiers. Turns out that this is nonsense, since it gives a mix of
English and local language; ie.

from string import atoi
def z=E4hle():
  global Z=E4hler
  try:
    Z=E4hler =3D Z=E4hler + 1
    print >>Datei, atoi(Z=E4hler)
  except IOError:
    raise Fehler()

does not read very well. People have then attempted to translate the
keywords as well, which would roughly give

Aus Zeichenkette importiere AzuG # ASCII zu Ganzzahl
def z=E4hle():
  globaler Z=E4hler # oder vielleicht besser: globales Z=E4hler?
  versuche:
    Z=E4hler =3D Z=E4hler + 1
    print >>Datei, atoi(Z=E4hler)
  au=DFer EinAusFehler:
    wirf Fehler()

which is clearly nonsense: for one thing, most people will have
difficulties to recognize the logic, even if they know both Python and
German. In addition, the constructs read well only in English - other
languages have different grammatical structures, for which you'd pick
different syntactical rules in your programming language. So Python's
syntax and standard library is already English-centric; allowing
additional identifiers won't fix that.

I'd really like to know whether speakers of non-Roman and non-Germanic
feel different about this issue: Should it be possible to pick Kanji
characters as identifiers?

Regards,
Martin

P.S. Something than can and should be done about Python itself is
translating the doc strings. Any volunteers that are interested in
doing so, please contact me - or just start translating the message
catalog that sits in the non-dist branch of the Python CVS.


From mal@lemburg.com  Fri Feb  9 21:39:42 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 09 Feb 2001 22:39:42 +0100
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
References: <Pine.LNX.4.30.0102090701220.12507-100000@latte.ActiveState.com> <3A8427DD.2BABEF53@lemburg.com> <200102092027.f19KREW00995@mira.informatik.hu-berlin.de>
Message-ID: <3A84639E.B86D57F2@lemburg.com>

"Martin v. Loewis" wrote:
> 
> > So what this strawman suggest is in summary:
> >
> > 1. add an encoding identifier to the top of a source code file
> > 2. use that encoding information to decode u"..." literals into
> >    Unicode
> > 3. leave all other literals and text alone
> 
> I think the proposal was to do
> 
> 3. raise an error if another literal uses bytes > 127
> 
> instead. Since users need to actively change their source to use the
> encoding declaration, they'll combine this with putting u in front of
> every affected string. If they then still have strings with bytes
> >127, they need to use the \x notation, as the string should not
> contain text.

Hmm, are you sure this would make the encoding declaration a
popular tool ?

If we would just allow ASCII-supersets as source file encoding,
then we wouldn't have to make that restriction, since only the
Unicode literal handling in the parser would have to be adjusted
(and this is easy to do).

This would make UTF-16 encodings impossible, but I think that
two-byte encodings not the right approach to maintainable programs
anyways ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Fri Feb  9 21:55:46 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 09 Feb 2001 22:55:46 +0100
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
References: <Pine.LNX.4.30.0102090701220.12507-100000@latte.ActiveState.com> <3A8427DD.2BABEF53@lemburg.com> <3A84329E.7B7CE012@ActiveState.com> <3A844559.8C284F7A@lemburg.com> <3A844D47.8DACEE2B@ActiveState.com>
Message-ID: <3A846762.C0735C2F@lemburg.com>

Paul Prescod wrote:
> 
> "M.-A. Lemburg" wrote:
> >
> > ...
> >
> > The parser has no idea of what to do with Unicode input...
> > this would mean that we would have to make it Unicode
> > aware and this opens a new can of worms; not only in the case
> > where this encoding specifier is used.
> 
> Obviously the parser cannot be made unicode aware for Python 2.1 but why
> not for Python 2.2? What's so difficult about it? There's no rocket
> science.
> 
> Also, if we wanted a quick hack, couldn't we implement it at first by
> "decoding" to UTF-8? Then the parser could look for UTF-8 in Unicode
> string literals and translate those into real Unicode.

I don't want to do "quick hacks", so this is a non-option.

Making the parser Unicode aware is non-trivial as it requires 
changing lots of the internals which expect 8-bit C char buffers.
It will eventually happen, but is not high priority since it
only servers a convenience and not a real need.
 
> > Also, string literals ("text") would have to translate the
> > Unicode input passed to the parser back to ASCII (or whatever
> > the default encoding is) and this would break code which currently
> > uses strings for data or some specific text encoding.
> 
> It would only break code that adds the encoding declaration. If you
> don't add the declaration you don't break any code!

If we change the parser to use Unicode, then we would
have to decode *all* program text into Unicode and this is very
likely to fail for people who put non-ASCII characters into their
string literals.
 
> Plus, we all agree that passing binary data in literal strings should be
> a deprecated usage eventually. That's why we're inventing binary
> strings.

Yes, but this move needs time... binary strings are meant as
easy to use alternative, so that programmers can easily make the
required changes to their code (adding a few b's in front of their
string literals won't hurt that much).
 
> > ...
> > Sorry, Paul, but this will never happen. Python is an ASCII
> > programming language and does good at it.
> 
> I am amazed to hear you say that. Why SHOULDN'T we allow Chinese
> variables names some day? This is the 21st century. If we don't go after
> Asian markets someone else will! I've gotta admit that that kind of
> Euro-centric attitude sort of annoys me...

ASCII is not Euro-centric at all since it is a common subset
of very many common encodings which are in use today. Latin-1 
would be, though... which is why ASCII was chosen as standard 
default encoding.

The added flexibility in choosing identifiers would soon turn
against the programmers themselves. Others have tried this and
failed badly (e.g. look at the language specific versions of 
Visual Basic).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From martin@loewis.home.cs.tu-berlin.de  Fri Feb  9 22:13:36 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Fri, 9 Feb 2001 23:13:36 +0100
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
In-Reply-To: <3A84639E.B86D57F2@lemburg.com> (mal@lemburg.com)
References: <Pine.LNX.4.30.0102090701220.12507-100000@latte.ActiveState.com> <3A8427DD.2BABEF53@lemburg.com> <200102092027.f19KREW00995@mira.informatik.hu-berlin.de> <3A84639E.B86D57F2@lemburg.com>
Message-ID: <200102092213.f19MDaY01399@mira.informatik.hu-berlin.de>

> If we would just allow ASCII-supersets as source file encoding,
> then we wouldn't have to make that restriction, since only the
> Unicode literal handling in the parser would have to be adjusted
> (and this is easy to do).

That would work, I'm not feeling to strongly either way.

> This would make UTF-16 encodings impossible, but I think that
> two-byte encodings not the right approach to maintainable programs
> anyways ;-)

I certainly agree. I think Python should assume UTF-8 for Unicode
strings in the long run unless declared otherwise, as that seems to be
the winning encoding - unless MS can talk everybody into putting a BOM
into every file.

Regards,
Martin


From paulp@ActiveState.com  Sat Feb 10 03:07:39 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Fri, 09 Feb 2001 19:07:39 -0800
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
References: <Pine.LNX.4.30.0102090701220.12507-100000@latte.ActiveState.com> <3A8427DD.2BABEF53@lemburg.com> <3A84329E.7B7CE012@ActiveState.com> <3A844559.8C284F7A@lemburg.com> <3A844D47.8DACEE2B@ActiveState.com> <3A846762.C0735C2F@lemburg.com>
Message-ID: <3A84B07B.24834996@ActiveState.com>

"M.-A. Lemburg" wrote:
> 
> > ...
> > Also, if we wanted a quick hack, couldn't we implement it at first by
> > "decoding" to UTF-8? Then the parser could look for UTF-8 in Unicode
> > string literals and translate those into real Unicode.
> 
> I don't want to do "quick hacks", so this is a non-option.

If it works and it is easy, there should not be a problem!

> Making the parser Unicode aware is non-trivial as it requires
> changing lots of the internals which expect 8-bit C char buffers.

Are you talking about the Python internals or the parser internals. If
the former, then I do not think you are correct. Only the parser needs
to change.

> If we change the parser to use Unicode, then we would
> have to decode *all* program text into Unicode and this is very
> likely to fail for people who put non-ASCII characters into their
> string literals.

Files with no declaration could be interpreted byte for char just as
they are today!

> ....
> ASCII is not Euro-centric at all since it is a common subset
> of very many common encodings which are in use today. 

Oh come on! The ASCII characters are sufficient to encode English and a
very few other languages.

> Latin-1
> would be, though... which is why ASCII was chosen as standard
> default encoding.

We could go back and forth on this but let me suggest you type in a
program with Latin 1 in your Unicode literals and try and see what
happens. Python already "recognizes" that there is a single logical
translation from "old style strings" to Unicode strings and vice versa.

> The added flexibility in choosing identifiers would soon turn
> against the programmers themselves. Others have tried this and
> failed badly (e.g. look at the language specific versions of
> Visual Basic).

That's a totally different and unrelated issue. Nobody is talking about
language specific Pythons. We're talking about allowing people to name
variables in their own languages. I think that anything else is
Euro-centric.

 Paul Prescod


From paulp@ActiveState.com  Sat Feb 10 03:11:33 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Fri, 09 Feb 2001 19:11:33 -0800
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
References: <Pine.LNX.4.30.0102090701220.12507-100000@latte.ActiveState.com> <3A8427DD.2BABEF53@lemburg.com> <200102092027.f19KREW00995@mira.informatik.hu-berlin.de> <3A84639E.B86D57F2@lemburg.com>
Message-ID: <3A84B165.5F1D20D4@ActiveState.com>

"M.-A. Lemburg" wrote:
> 
> ...
> 
> Hmm, are you sure this would make the encoding declaration a
> popular tool ?
> 
> If we would just allow ASCII-supersets as source file encoding,
> then we wouldn't have to make that restriction, since only the
> Unicode literal handling in the parser would have to be adjusted
> (and this is easy to do).

We have always said that only ASCII-supersets should be legal source
file encodings.

The compromise is to make the use of non-ASCII bytes only legal inside
of Unicode literals. Then in the future we can either go "my way"
(decode the whole file) or "your way" (decode only literals).

Is that acceptable?

 Paul Prescod


From paulp@ActiveState.com  Sat Feb 10 04:01:36 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Fri, 09 Feb 2001 20:01:36 -0800
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
References: <Pine.LNX.4.30.0102090701220.12507-100000@latte.ActiveState.com> <3A8427DD.2BABEF53@lemburg.com> <3A84329E.7B7CE012@ActiveState.com> <3A844559.8C284F7A@lemburg.com> <3A844D47.8DACEE2B@ActiveState.com> <200102092047.f19KlQO01127@mira.informatik.hu-berlin.de>
Message-ID: <3A84BD20.3042C87A@ActiveState.com>

"Martin v. Loewis" wrote:
>=20
> ...
>
> Furthermore, there have been experiments with allowing arbitrary
> characters in identifiers, so users could use their language for
> identifiers. Turns out that this is nonsense, since it gives a mix of
> English and local language; ie.
>=20
> ...
>=20
> def z=E4hle():
>   global Z=E4hler

I have seen this kind of code on Python-list. Maybe the examples did not
use "funny characters" but they certainly used other languages. I see no
reason to restrict it, or only to English-like languages. You and I may
or may not agree that it is a great idea but do you really feel
comfortable saying: "this is technically possible and maybe someone has
even submitted a patch to allow it but we won't support it because we
think everyone should code in English."

 Paul Prescod


From tim.one@home.com  Sat Feb 10 04:45:45 2001
From: tim.one@home.com (Tim Peters)
Date: Fri, 9 Feb 2001 23:45:45 -0500
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
In-Reply-To: <3A84BD20.3042C87A@ActiveState.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEINIOAA.tim.one@home.com>

[Paul Prescod, to Martin]
> ... do you really feel comfortable saying: "this is technically
> possible and maybe someone has even submitted a patch to allow
> it but we won't support it because we think everyone should code
> in English."

I'm comfortable with saying that, regardless of tech issues.  The keywords
and builtins and standard libraries are all written with English names no
matter what, and that maximizes readability regardless of native tongue.
Readability is important with Python.  It's not like Python is unique here,
either:  from Algol to Ruby, virtually everyone from Europe to Asia who
invents their own programming language sticks to English too.

Now Java has supported Unicode source code since its beginning.  If someone
can dig up evidence that this has done more than complicate their compilers,
*that* would be good to hear.  I simply see no (zilch, nada) demand for
this, outside of a handful of Gallic purists who would abandon the language
anyway as soon as they realized Guido was Dutch <0.9 wink>.


From martin@loewis.home.cs.tu-berlin.de  Sat Feb 10 06:27:45 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Sat, 10 Feb 2001 07:27:45 +0100
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
In-Reply-To: <3A84BD20.3042C87A@ActiveState.com> (message from Paul Prescod on
 Fri, 09 Feb 2001 20:01:36 -0800)
References: <Pine.LNX.4.30.0102090701220.12507-100000@latte.ActiveState.com> <3A8427DD.2BABEF53@lemburg.com> <3A84329E.7B7CE012@ActiveState.com> <3A844559.8C284F7A@lemburg.com> <3A844D47.8DACEE2B@ActiveState.com> <200102092047.f19KlQO01127@mira.informatik.hu-berlin.de> <3A84BD20.3042C87A@ActiveState.com>
Message-ID: <200102100627.f1A6Rj400812@mira.informatik.hu-berlin.de>

> you really feel comfortable saying: "this is technically possible
> and maybe someone has even submitted a patch to allow it but we
> won't support it because we think everyone should code in English."

I would not feel comfortable, because I doubt it is technically
possible. It may work for identifiers of variables (including
functions and classes), however, it will surely fail for the names of
packages and modules.

As I said before, I tried to write such a program in Java, which is
intended to support this. It failed, because the interpreter could not
find the class file (which must have the name of the class in Java,
unfortunately).

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Sat Feb 10 06:55:12 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Sat, 10 Feb 2001 07:55:12 +0100
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
In-Reply-To: <3A84B165.5F1D20D4@ActiveState.com> (message from Paul Prescod on
 Fri, 09 Feb 2001 19:11:33 -0800)
References: <Pine.LNX.4.30.0102090701220.12507-100000@latte.ActiveState.com> <3A8427DD.2BABEF53@lemburg.com> <200102092027.f19KREW00995@mira.informatik.hu-berlin.de> <3A84639E.B86D57F2@lemburg.com> <3A84B165.5F1D20D4@ActiveState.com>
Message-ID: <200102100655.f1A6tCT01130@mira.informatik.hu-berlin.de>

> We have always said that only ASCII-supersets should be legal source
> file encodings.

That may be a bit too restrictive. I understand that people use all of
EUC-JP, Shift-JIS, and ISO-2022-JP to encode Japanese text. I'm not
certain whether iso-2022 is used in source code, but the first two
certainly are (euc-jp on Unix, shift-jis on Windows).

My understanding is that only EUC-JP is an ASCII superset (*)
(i.e. all bytes representing JIS characters are >127); in Shift-JIS,
the encoding of a character is two bytes, of which only the first byte
is always >128. Since Shift-JIS is quite common, it should be
supported as a file encoding.

Regards,
Martin

(*) ignoring the question whether \x24 is the DOLLAR SIGN or the YEN
SIGN.


From mal@lemburg.com  Sat Feb 10 12:14:57 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sat, 10 Feb 2001 13:14:57 +0100
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
References: <Pine.LNX.4.30.0102090701220.12507-100000@latte.ActiveState.com> <3A8427DD.2BABEF53@lemburg.com> <200102092027.f19KREW00995@mira.informatik.hu-berlin.de> <3A84639E.B86D57F2@lemburg.com> <3A84B165.5F1D20D4@ActiveState.com>
Message-ID: <3A8530C1.479544D2@lemburg.com>

Paul Prescod wrote:
> 
> "M.-A. Lemburg" wrote:
> >
> > ...
> >
> > Hmm, are you sure this would make the encoding declaration a
> > popular tool ?
> >
> > If we would just allow ASCII-supersets as source file encoding,
> > then we wouldn't have to make that restriction, since only the
> > Unicode literal handling in the parser would have to be adjusted
> > (and this is easy to do).
> 
> We have always said that only ASCII-supersets should be legal source
> file encodings.

Right.
 
> The compromise is to make the use of non-ASCII bytes only legal inside
> of Unicode literals. Then in the future we can either go "my way"
> (decode the whole file) or "your way" (decode only literals).
> 
> Is that acceptable?

No, it's too restrictive and would break programs written using
non-ASCII characters in normal string literals. We could agree
on this though:

1. programs which do not use the encoding declaration are free
   to use non-ASCII bytes in literals; Unicode literals must
   use Latin-1 (for historic reasons)

2. programs which do make use of the encoding declaration may
   only use non-ASCII bytes in Unicode literals; these are then
   interpreted using the given encoding information and decoded
   into Unicode during the compilation step

Part 1 assures backward compatibility. Part 2 assures that programmers
start to think about where they have to use Unicode and which
program literals are allowed to go into string literals. Part 1
is already implemented, part 2 is easy to do, since only the
compiler will have to be changed (in two places).

How's that for a compromise ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Sat Feb 10 12:37:06 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sat, 10 Feb 2001 13:37:06 +0100
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
References: <Pine.LNX.4.30.0102090701220.12507-100000@latte.ActiveState.com> <3A8427DD.2BABEF53@lemburg.com> <3A84329E.7B7CE012@ActiveState.com> <3A844559.8C284F7A@lemburg.com> <3A844D47.8DACEE2B@ActiveState.com> <3A846762.C0735C2F@lemburg.com> <3A84B07B.24834996@ActiveState.com>
Message-ID: <3A8535F2.B56642B6@lemburg.com>

Paul Prescod wrote:
> 
> "M.-A. Lemburg" wrote:
> >
> > > ...
> > > Also, if we wanted a quick hack, couldn't we implement it at first by
> > > "decoding" to UTF-8? Then the parser could look for UTF-8 in Unicode
> > > string literals and translate those into real Unicode.
> >
> > I don't want to do "quick hacks", so this is a non-option.
> 
> If it works and it is easy, there should not be a problem!

This is how I started into the Unicode debate (making UTF-8 the default
encoding). It doesn't work out... let's not restart that discussion.

> > Making the parser Unicode aware is non-trivial as it requires
> > changing lots of the internals which expect 8-bit C char buffers.
> 
> Are you talking about the Python internals or the parser internals. If
> the former, then I do not think you are correct. Only the parser needs
> to change.

The parser would have to accept Py_UNICODE strings and work
on these. The compiler needs to be able to convert Py_UNICODE
back to char for e.g. string literals.

We'd also have to provide external interfaces which convert
char input for the parser into Unicode. This would introduce
many new locations of possible breakage (please remember that
variable length encodings are *very* touchy about wrong byte 
sequences).

> > If we change the parser to use Unicode, then we would
> > have to decode *all* program text into Unicode and this is very
> > likely to fail for people who put non-ASCII characters into their
> > string literals.
> 
> Files with no declaration could be interpreted byte for char just as
> they are today!

Then we'd have to write two sets of parsers and compilers:
one for Py_UNICODE and one for char... no way ;-)
 
> > ....
> > ASCII is not Euro-centric at all since it is a common subset
> > of very many common encodings which are in use today.
> 
> Oh come on! The ASCII characters are sufficient to encode English and a
> very few other languages.

Paul, programs have been written in ASCII for many many years.
Are you trying to tell me that 30+ years of common usage should
be ignored ? Programmers have gotten along with ASCII quite well,
not only English speaking ones -- ASCII can be used to approximate
quite a few other languages as well (provided you ignore accents
and the like). For most other languages there are transliterations
into ASCII which are in common use.

For other good arguments, see Tim's post on the subject.

> > Latin-1
> > would be, though... which is why ASCII was chosen as standard
> > default encoding.
> 
> We could go back and forth on this but let me suggest you type in a
> program with Latin 1 in your Unicode literals and try and see what
> happens. Python already "recognizes" that there is a single logical
> translation from "old style strings" to Unicode strings and vice versa.

Fact is, I would never use Latin-1 characters outside of literals.
All my programs are written in (more or less ;) English, even the
comments and doc-strings. If you ever write applications which
programmers from around the world are supposed to comprehend
and maintain, then English is the only reasonable common base,
at least IMHO.

> > The added flexibility in choosing identifiers would soon turn
> > against the programmers themselves. Others have tried this and
> > failed badly (e.g. look at the language specific versions of
> > Visual Basic).
> 
> That's a totally different and unrelated issue. Nobody is talking about
> language specific Pythons. We're talking about allowing people to name
> variables in their own languages. I think that anything else is
> Euro-centric.

Funny, how you always refer to "Euro"-centric... ASCII is an
American standard ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Sat Feb 10 14:56:10 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sat, 10 Feb 2001 15:56:10 +0100
Subject: [I18n-sig] Strawman Proposal: Binary Strings
References: <Pine.LNX.4.30.0102080927020.11546-100000@latte.ActiveState.com> <aan58to5cf2hei5feje8kakorlrahgo7ff@4ax.com> <3A830091.3D855EDD@ActiveState.com> <h7d78t4jr62eu7sg2m1mevigetko6i0sln@4ax.com>
Message-ID: <3A85568A.5B694917@lemburg.com>

Toby Dickenson wrote:
> 
> On Thu, 08 Feb 2001 12:24:49 -0800, Paul Prescod
> <paulp@ActiveState.com> wrote:
> 
> >> What if string.encode() returned a binary string.... would we need a
> >> 'binary()' builtin at all?

binary() is needed one way or another. It is standard Python
philosophy that all types need to have an exposed constructor and
these should do some form of implicit or explicit but well-defined
coercion from other data types to binary strings.

About changing .encode() or the existing codecs to return binary
strings instead of normal strings: I'm -1 on this one since it
will break existing code. The outcome of .encode() is totally
up the codec doing the work, BTW (and this is by design), so
new codecs could choose to return binary strings.

For converting strings or Unicode to binary data, I'd suggest
to add a "binary" codec which then returns the raw bytes of the
string ior Unicode object in question as binary string.

Note that changing e.g. .encode('latin-1') to return a binary string
doesn't really make sense, since here we know the encoding ! Instead,
strings should probably carry along the encoding information in an
additional attribute (it is not always useful, but can help in
a few situations) provided that it is known.

This would give us three string types:

1. standard 8-bit strings with encoding attribute
2. binary 8-bit strings without encoding attribute or a constant
   value of 'binary' for this attribute
3. Unicode strings which don't need an encoding attribute :-)

Hmm, getting all these to properly interoperate without breaking
existing code will be troublesome...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From paulp@ActiveState.com  Sat Feb 10 15:17:29 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Sat, 10 Feb 2001 07:17:29 -0800
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
References: <LNBBLJKPBEHFEDALKOLCIEINIOAA.tim.one@home.com>
Message-ID: <3A855B89.459A18E4@ActiveState.com>

Tim Peters wrote:
> 
> ...
>
> I'm comfortable with saying that, regardless of tech issues.  The keywords
> and builtins and standard libraries are all written with English names no
> matter what, and that maximizes readability regardless of native tongue.

People keep bringing up this issue of keywords. I've never disputed that
the keywords should always be English.

> Now Java has supported Unicode source code since its beginning.  If someone
> can dig up evidence that this has done more than complicate their compilers,
> *that* would be good to hear.  I simply see no (zilch, nada) demand for
> this, outside of a handful of Gallic purists who would abandon the language
> anyway as soon as they realized Guido was Dutch <0.9 wink>.

I'm not personally willing to design in such a limitiation. I have seen
a lot of code that mixes other languages with English. e.g.:

http://starship.python.net/pipermail/python-de/2000q3/000597.html

I don't think this guy is doing anything wrong. If a Japansese person
asks me if they could do the same I would say: "Not now, but hopefully
someday." There are a lot of people who write code that will never be
seen by a speaker of an ASCII-compatible language. Why should they be
forced to write it in ASCII?

 Paul Prescod


From paulp@ActiveState.com  Sat Feb 10 15:24:02 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Sat, 10 Feb 2001 07:24:02 -0800
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
References: <Pine.LNX.4.30.0102090701220.12507-100000@latte.ActiveState.com> <3A8427DD.2BABEF53@lemburg.com> <3A84329E.7B7CE012@ActiveState.com> <3A844559.8C284F7A@lemburg.com> <3A844D47.8DACEE2B@ActiveState.com> <200102092047.f19KlQO01127@mira.informatik.hu-berlin.de> <3A84BD20.3042C87A@ActiveState.com> <200102100627.f1A6Rj400812@mira.informatik.hu-berlin.de>
Message-ID: <3A855D12.46E4376E@ActiveState.com>

"Martin v. Loewis" wrote:
> 
> ...
> 
> I would not feel comfortable, because I doubt it is technically
> possible. It may work for identifiers of variables (including
> functions and classes), however, it will surely fail for the names of
> packages and modules.

It isn't worth arguing about because I'm not even proposing to move to
Unicode variable names today. But can we agree not to cut ourselves off
from that option?

> As I said before, I tried to write such a program in Java, which is
> intended to support this. It failed, because the interpreter could not
> find the class file (which must have the name of the class in Java,
> unfortunately).

As you point out, the problem is much more serious in Java because of
the classname/filename binding. Anyhow, file systems are getting more
and more i18n aware.

 Paul Prescod


From paulp@ActiveState.com  Sat Feb 10 15:27:58 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Sat, 10 Feb 2001 07:27:58 -0800
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
References: <Pine.LNX.4.30.0102090701220.12507-100000@latte.ActiveState.com> <3A8427DD.2BABEF53@lemburg.com> <200102092027.f19KREW00995@mira.informatik.hu-berlin.de> <3A84639E.B86D57F2@lemburg.com> <3A84B165.5F1D20D4@ActiveState.com> <3A8530C1.479544D2@lemburg.com>
Message-ID: <3A855DFE.6FA48392@ActiveState.com>

"M.-A. Lemburg" wrote:
> 
> > ...
> 
> We could agree on this though:
> 
> 1. programs which do not use the encoding declaration are free
>    to use non-ASCII bytes in literals; Unicode literals must
>    use Latin-1 (for historic reasons)
> 
> 2. programs which do make use of the encoding declaration may
>    only use non-ASCII bytes in Unicode literals; these are then
>    interpreted using the given encoding information and decoded
>    into Unicode during the compilation step

I thought that's what I suggested! I am comfortable with that design.

 Paul Prescod


From guido@digicool.com  Sat Feb 10 15:32:19 2001
From: guido@digicool.com (Guido van Rossum)
Date: Sat, 10 Feb 2001 10:32:19 -0500
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
In-Reply-To: Your message of "Sat, 10 Feb 2001 07:24:02 PST."
 <3A855D12.46E4376E@ActiveState.com>
References: <Pine.LNX.4.30.0102090701220.12507-100000@latte.ActiveState.com> <3A8427DD.2BABEF53@lemburg.com> <3A84329E.7B7CE012@ActiveState.com> <3A844559.8C284F7A@lemburg.com> <3A844D47.8DACEE2B@ActiveState.com> <200102092047.f19KlQO01127@mira.informatik.hu-berlin.de> <3A84BD20.3042C87A@ActiveState.com> <200102100627.f1A6Rj400812@mira.informatik.hu-berlin.de>
 <3A855D12.46E4376E@ActiveState.com>
Message-ID: <200102101532.KAA27642@cj20424-a.reston1.va.home.com>

> As you point out, the problem is much more serious in Java because of
> the classname/filename binding.

And Python doesn't have this problem?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From paulp@ActiveState.com  Sat Feb 10 15:37:04 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Sat, 10 Feb 2001 07:37:04 -0800
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
References: <Pine.LNX.4.30.0102090701220.12507-100000@latte.ActiveState.com> <3A8427DD.2BABEF53@lemburg.com> <200102092027.f19KREW00995@mira.informatik.hu-berlin.de> <3A84639E.B86D57F2@lemburg.com> <3A84B165.5F1D20D4@ActiveState.com> <200102100655.f1A6tCT01130@mira.informatik.hu-berlin.de>
Message-ID: <3A856020.7D55951C@ActiveState.com>

"Martin v. Loewis" wrote:
> 
> ...
> 
> My understanding is that only EUC-JP is an ASCII superset (*)
> (i.e. all bytes representing JIS characters are >127); in Shift-JIS,
> the encoding of a character is two bytes, of which only the first byte
> is always >128. Since Shift-JIS is quite common, it should be
> supported as a file encoding.

I don't think it is reasonable in the short term to support characte
sets that cannot be lexed with the current Python lexer. I think we
should design with Shift-JIS in mind for the future but for now I think
we should limit our list of supported encodings to those that don't
require large Python parser changes.

 Paul Prescod


From paulp@ActiveState.com  Sat Feb 10 15:46:32 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Sat, 10 Feb 2001 07:46:32 -0800
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
References: <Pine.LNX.4.30.0102090701220.12507-100000@latte.ActiveState.com> <3A8427DD.2BABEF53@lemburg.com> <3A84329E.7B7CE012@ActiveState.com> <3A844559.8C284F7A@lemburg.com> <3A844D47.8DACEE2B@ActiveState.com> <200102092047.f19KlQO01127@mira.informatik.hu-berlin.de> <3A84BD20.3042C87A@ActiveState.com> <200102100627.f1A6Rj400812@mira.informatik.hu-berlin.de>
 <3A855D12.46E4376E@ActiveState.com> <200102101532.KAA27642@cj20424-a.reston1.va.home.com>
Message-ID: <3A856258.46EFCF68@ActiveState.com>

Guido van Rossum wrote:
> 
> > As you point out, the problem is much more serious in Java because of
> > the classname/filename binding.
> 
> And Python doesn't have this problem?

The problem is not as serious in Python because of "import as". I
suspect that clever use of introspective features would also allow you
to map classnames back and forth between filesystem ASCII and your
native language.

Nevertheless, I want to point out that I am not advocating that Python
support full Unicode source files in the short term. I will carefully
scrutinize any new language feature designed with the assumption that
Python will be "forever ASCII." I see that as Y2K thinking. Also, our
i18n migration issues are painful enough right now to make me "twice
shy" about assumptions. I don't want to go through this again in five
years.

 Paul Prescod


From paulp@ActiveState.com  Sat Feb 10 15:58:22 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Sat, 10 Feb 2001 07:58:22 -0800
Subject: [I18n-sig] Strawman Proposal: Encoding Declaration V2
Message-ID: <3A85651E.C11C7B2B@ActiveState.com>

The encoding declaration controls the interpretation of non-ASCII bytes
in the Python source file. The declaration manages the mapping of
non-ASCII byte strings into Unicode characters.

A source file with an encoding declaration must only use non-ASCII bytes
in places that can legally support Unicode characters. In Python 2.x the
only place is within a Unicode literal. This restriction may be lifted
in future versions of Python.

In Python 2.x, the initial parsing of a Python script is done in terms
of the file's byte values. Therefore it is not legal to use any byte
sequence that has a byte that would be interpreted as a special
character (e.g. quote character or backslash) according to the ASCII
character set. This restriction may be lifted in future versions of
Python.

The encoding declaration must be found before the first statement in the
source file. The declaration is not a pragma. It does not show up in the
parse tree and has no semantic meaning for the compiler itself. It is
conceptually handled in a pre-compile "encoding sniffing" step. This
step is also done using the ASCII encoding. 

The encoding declaration has the following basic syntax:

#?encoding="<some string>"

<some string> is the encoding name and must be associated with a
registered codec. The codec is used to interpret non-ASCII byte
sequences.

The encoding declaration should be present in all Python source files
containing non-ASCII bytes. Some future version of Python may make this
an absolute requirement.

 Paul Prescod


From andy@reportlab.com  Sat Feb 10 16:13:37 2001
From: andy@reportlab.com (Andy Robinson)
Date: Sat, 10 Feb 2001 16:13:37 -0000
Subject: [I18n-sig] Python and Unicode == Britain and the Euro?
Message-ID: <PGECLPOBGNBNKHNAGIJHOECICIAA.andy@reportlab.com>

This reminds me a lot of another debating going on close to home :-)

- people who are in favour assume everyone else is, and that the only
question is how to get there
- people who are against are just plain worried but can't say why
- the government stays very quiet and avoids asking for a referendum

I want to re-ask the big question:  is it desirable that the
standard string type should become a Unicode string one day?

To my knowledge, all the pressure for making Unicode strings
the fundamental data type comes from Americans and Westerm
Europeans who think they are doing the right thing. This is
far from proven.  Please consider these points:

1. To my knowledge we have not seen posts from anyone outside the
ISO-Latin-1 zone in this thread.

2. I have been told that there are angry mumblings on the
Python-Japan mailing list that such a change would break all
their existing Python programs; I'm trying to set up my tools to
ask out loud in that forum.

3. Ruby was designed in Japan and that's where most of its users are.
They have a few conversion functions and seem perfectly happy.

4. Visual Basic running under Windows 2000 with every international
option I can find will accept unicode characters in string literals
but will not accept characters outside of ISO-Latin-1 in

5. All the Japanese-written code I have seen (not much of it
is in Python, lots in Java and VB) either uses english variable
names or the romanized japanese ones ('total'='gokei').  No
one I know of has complained about this limitation.

I do NOT want to kill off this discussion, which is producing
an interesting proposal and I am in favour of many point it
raises.   However, I think we should make a real effort to
see what the market actually wants and if the implied goal
is right.  It would be tragic to break old code one day
for improvements nobody cares about  - or, worse, to
alienate exactly the people we are trying to cater for.

Now, who can we ask outide our own community who could
have insights into this?  My shortlist so far is:

- Frank Chen (wrote our Chinese and Korean Codecs)
- Tamito Kajiyama (wrote our Japanese Codecs)
- Ruby mailing lists
- Python Japan Mailing List
- Basis Technologies (Tom, are you there?)
- Digital Garage and recent escapees (Brian?)
- the CTO of a Kuwaiti bank I know
- Ken Lunde (author of that CJKV book)
- Tony Graham (author, of Unicode: A Primer and a member of the
Unicode consortium)
- James Clark (XML fame, lives in Thailand)

I'm going to try to think up a questionnaire. If anyone can suggest
other domain experts, or mailing lists of user groups in other
language
zones, I will be happy to try and pursue them and get some real hard
data.

Best Regards,

Andy Robinson


From andy@reportlab.com  Sat Feb 10 16:58:39 2001
From: andy@reportlab.com (Andy Robinson)
Date: Sat, 10 Feb 2001 16:58:39 -0000
Subject: [I18n-sig] Python and Unicode == Britain and the Euro?
In-Reply-To: <PGECLPOBGNBNKHNAGIJHOECICIAA.andy@reportlab.com>
Message-ID: <PGECLPOBGNBNKHNAGIJHMECJCIAA.andy@reportlab.com>

> Now, who can we ask outide our own community who could
> have insights into this?  My shortlist so far is:
> 
[snip]
> - Frank Chen (wrote our Chinese and Korean Codecs)
> - Tamito Kajiyama (wrote our Japanese Codecs)
> - Basis Technologies (Tom, are you there?)
> - Digital Garage and recent escapees (Brian?)

Sorry, really bad wording.  The above are definitely
valued parts of our community - no offence intended!

- Andy Robinson


From paulp@ActiveState.com  Sat Feb 10 19:17:19 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Sat, 10 Feb 2001 11:17:19 -0800
Subject: [I18n-sig] Python and Unicode == Britain and the Euro?
References: <PGECLPOBGNBNKHNAGIJHOECICIAA.andy@reportlab.com>
Message-ID: <3A8593BF.8AFCEBB3@ActiveState.com>

Andy, I think that part of the reason that Westerners push harder for
Unicode than Japanese is because we are pressured (rightly) to right
software that works world-wide and it is simply not sane to try to do
that by supporting multiple character sets. Multiple encodings maybe.
Multiple character sets? Forget it.

I don't know of any commercial software written in Japan but used in the
west so I think that they probably have less I18N pressure than we do.
Unicode is only interesting when you want the same software to run in
multiple character set environments!

Andy Robinson wrote:
> 
> ...
> 
> 2. I have been told that there are angry mumblings on the
> Python-Japan mailing list that such a change would break all
> their existing Python programs; I'm trying to set up my tools to
> ask out loud in that forum.

I don't think it is posssible to say in the abstract that a move to
Unicode would break code. Depending on implementation strategy it might.
But I can't imagine there is really a ton of code that would break
merely from widening the character.

> 3. Ruby was designed in Japan and that's where most of its users are.
> They have a few conversion functions and seem perfectly happy.

Don't know enough to comment except to point out that Ruby has a command
line option to set the character set to Kanji.

> 4. Visual Basic running under Windows 2000 with every international
> option I can find will accept unicode characters in string literals
> but will not accept characters outside of ISO-Latin-1 in

The VB in Visual Studio 7 will happily accept wide characters (e.g.
U+652B: CJK Unified Ideograph) on Windows 2000. Of course you need to
set your font to have the right character.

Compared to where we were a few years ago (better install DOS-J!) this
is a real miracle. Of course Unix systems will move over more slowly
(grumble..). Nevertheless its coming:

http://www.li18nux.net/li18nux2k/

> 
> I'm going to try to think up a questionnaire. If anyone can suggest
> other domain experts, or mailing lists of user groups in other
> language
> zones, I will be happy to try and pursue them and get some real hard
> data.

I like your list but I don't know that there is really a reasonable
question we can ask. 

What does it mean for Python's "standard string type" to be "Unicode?"
If you ask the question as: "Should Python's standard string type
support ordinal values beyond 255?", who would say no? If you say:
"Should Python standardize on the Unicode character set" you might get
different answers. As you yourself point out, it depends on whether that
means that you would LOSE the ability to do string-like things on
byte-strings.

 Paul Prescod


From paulp@ActiveState.com  Sat Feb 10 19:45:58 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Sat, 10 Feb 2001 11:45:58 -0800
Subject: [I18n-sig] Python and Unicode == Britain and the Euro?
References: <PGECLPOBGNBNKHNAGIJHOECICIAA.andy@reportlab.com>
Message-ID: <3A859A76.D4C30372@ActiveState.com>

Andy Robinson wrote:
> 
> ...
> 
> 4. Visual Basic running under Windows 2000 with every international
> option I can find will accept unicode characters in string literals
> but will not accept characters outside of ISO-Latin-1 in

The more I look at I18N in VB.NET, the more impressed I am. It has no
language restrictions on variable names etc.

    Protected Sub Form1_Click(ByVal sender As Object, ByVal e As
System.EventArgs)
        Dim ? As String
        Dim font As New System.Drawing.Font("Batang", 10)
        
        ? = "??"
        
        TextBox1.Text = ?
    End Sub

Each "?" is an ideograph. It seems to "just work".

 Paul Prescod


From tree@basistech.com  Sat Feb 10 21:17:47 2001
From: tree@basistech.com (Tom Emerson)
Date: Sat, 10 Feb 2001 16:17:47 -0500
Subject: [I18n-sig] Random thoughts on Unicode and Python
Message-ID: <14981.45051.945099.633730@cymru.basistech.com>

Andy has raised some important and interesting points. I'd like to
chime in with some random thoughts.

> 2. I have been told that there are angry mumblings on the
> Python-Japan mailing list that such a change would break all
> their existing Python programs; I'm trying to set up my tools to
> ask out loud in that forum.

Both Shift-JIS and EUC-JP are 8-bit, multibyte encodings. You can use
them on systems that are 8-bit clean and things "just work". You don't
need to worry about embedded nulls or any other such noise. While you
can't use len() to get the number of *characters* in a
Shift-JIS/EUC-JP encoded string, you can find out how many "octets"
are in it so you can loop over it and calculate the character length.

In essence the Japanese (and Chinese and Koreans) are using the
existing Python string type as a raw-byte string, and imposing the
semantics over that.

The Ruby string class is a byte-string. You can specify how the bytes
are to be treated for operations such as regular expression searches
and such. It supports EUC-JP, Shift JIS, UTF-8, or just plan
bytes. You can set the default when you configure the sources, on the
command-line when you invoke the interpreter, or (I believe) at
runtime.

Ruby also contains a library with a replacement String class for
dealing with EUC-JP and Shift-JIS encoded strings.

-----

The internal representation used for strings is an orthogonal issue to
how raw bytes are interpreted for string operations. This is what
Emacs 20 does: in essence it uses ISO 2022 internally to allow
characters from multiple character sets to be represented.

-----

The interpretation of strings and the interpretation of bytes in a
source file are different things: Dylan, for example, supports Unicode
and byte strings, but the language definition requires identifiers and
keywords to be in the US-ASCII range. Java, on the other hand,
specifies Unicode as language's character set: even source files are
encoded in UTF-8, allowing identifiers to be in the user's
language. IMHO either is fine. Note that if the language allows
identifiers to include 8-bit characters then users can already use
identifiers in their local language: it "just works".

-----

Japanese and Chinese arguments against Unicode are often ideological:
"It doesn't contain all of the characters we need." Of course they
forget to mention that the character sets in regular use in these
locales, JIS X 0201-1990, JIS X 0212-1990, GB 2312-80, and Big Five,
are all represented in Unicode. The same is true for Korean: all of
the hanja in KS C 5601 et al. are available in Unicode, as are the
precomposed han'gul.


-- 
Tom Emerson                                          Basis Technology Corp.
Stringologist                                      http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From mal@lemburg.com  Sat Feb 10 21:56:09 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sat, 10 Feb 2001 22:56:09 +0100
Subject: [I18n-sig] Random thoughts on Unicode and Python
References: <14981.45051.945099.633730@cymru.basistech.com>
Message-ID: <3A85B8F9.1F494BF8@lemburg.com>

Tom Emerson wrote:
> 
> Andy has raised some important and interesting points. I'd like to
> chime in with some random thoughts.
> 
> > 2. I have been told that there are angry mumblings on the
> > Python-Japan mailing list that such a change would break all
> > their existing Python programs; I'm trying to set up my tools to
> > ask out loud in that forum.
> 
> Both Shift-JIS and EUC-JP are 8-bit, multibyte encodings. You can use
> them on systems that are 8-bit clean and things "just work". You don't
> need to worry about embedded nulls or any other such noise. While you
> can't use len() to get the number of *characters* in a
> Shift-JIS/EUC-JP encoded string, you can find out how many "octets"
> are in it so you can loop over it and calculate the character length.
> 
> In essence the Japanese (and Chinese and Koreans) are using the
> existing Python string type as a raw-byte string, and imposing the
> semantics over that.
> 
> The Ruby string class is a byte-string. You can specify how the bytes
> are to be treated for operations such as regular expression searches
> and such. It supports EUC-JP, Shift JIS, UTF-8, or just plan
> bytes. You can set the default when you configure the sources, on the
> command-line when you invoke the interpreter, or (I believe) at
> runtime.
> 
> Ruby also contains a library with a replacement String class for
> dealing with EUC-JP and Shift-JIS encoded strings.

How does Ruby (which seems to be the direct Python-competitor
in Japan) deal with the difference between binary data and
text data ?

I think that much concern about these proposals lies in a misunder-
standing of the general idea behind the proposed move to Unicode for
text data:

We are trying to tell people that storing text data is better
done in Unicode than in a raw data buffer like Python's current
string data type. This doesn't mean that working with text encoded
in such a binary data buffer will somehow fail in a future Python
version, it only means that the programmer will sooner or later
have to decide whether she wants to store text data or binary
and then choose the proper type of storage to be able to
take advantage of the advanced features which a text data type
can provide over a binary data buffer.

The module which we are currently talking about can be outlined
as follows:

                  binary data string *)
                         |
                         |
                  text data string 
                    |           |
                    |           |
         Unicode string      encoded 8-bit string (with encoding 
           *)                                      information !)

*) these are implemented in Python 1.6-2.1.

How does this compare to e.g. Ruby ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Sat Feb 10 22:03:51 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sat, 10 Feb 2001 23:03:51 +0100
Subject: [I18n-sig] Storing string encoding information (Pre-PEP: Proposed
 Python Character Model)
References: <9FC702711D39D3118D4900902778ADC81287BF@JUPITER> <3A82A04F.5A03CAB2@lemburg.com> <200102082009.f18K9QI01197@mira.informatik.hu-berlin.de>
Message-ID: <3A85BAC7.303460B1@lemburg.com>

"Martin v. Loewis" wrote:
> 
> >               encoded 8-bit string (with encoding
> >                                     information !)
> 
> I'd like to point out that this is something that Bill Janssen always
> wanted to see. In CORBA, they number encodings for efficient
> representation; that's something that Python could do as well. CORBA
> took the OSF charset registry. That was a mistake, they think about
> using the IANA registry now. This registry provides both textual and
> numeric identifiers for encodings (numeric in the form of MIBEnum
> values).

I was thinking of using plain integers which map into a list
of currently used encodings. Every time a new encodings is used,
the new encoding is appended to the list and the new index is used
in the generated string objects.

This allows us to separate the internal representation of the
encoding from an outside view, e.g. there could be translators
which map the integers into IANA identifiers or OSF charset numbers.

We'd have to find a way to store this encoding information in Python
pickles and the marshal format, though... a job for our compression
experts ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Sat Feb 10 22:08:06 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sat, 10 Feb 2001 23:08:06 +0100
Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model)
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> <3A82FD60.EFB38FAD@ActiveState.com> <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de> <3A831110.6AADE590@ActiveState.com>
Message-ID: <3A85BBC6.BBAA8D70@lemburg.com>

Paul Prescod wrote:
> 
> At the bottom of one of my messages I proposed that we insert it as the
> second argument. Although the encoding and mode are both strings there
> is no syntactic overlap between [rwa][+]?[tb]+ and the set of existent
> or proposed encodings. If we merely outlaw encodings with that name then
> we can quickly figure out whether the second argument is a mode or an
> encoding. So the documented syntax would be
> 
> open(filename, encoding, [[mode], bytes])
> 
> And the documentation would say:
> 
> "There is an obsolete variant that does not require an encoding string.
> This may cause a warning in future versions of Python and be removed
> sometime after that."

Any reason why we cannot use a keyword argument for encoding
and put it at the end of the argument list ? The result is:

1. no ambiguity
2. backward compatibility
3. good visibility of what the argument stands for (without having
   to look up the manual for e.g. the meaning of 'mbcs')

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Sat Feb 10 22:16:02 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sat, 10 Feb 2001 23:16:02 +0100
Subject: [I18n-sig] Binary data b"strings" (Pre-PEP: Proposed Python
 Character Model)
References: <9FC702711D39D3118D4900902778ADC83244A2@JUPITER> <3A82CA6D.313D5E39@lemburg.com> <3A832A86.71833150@ActiveState.com>
Message-ID: <3A85BDA2.BBE72C77@lemburg.com>

Paul Prescod wrote:
> 
> I've thought about this coercion issue more...I think we need to
> auto-coerece these binary strings using some well-defined rule (NOT a
> default encoding!).
> 
> "M.-A. Lemburg" wrote:
> >
> > > ...
> > >
> > > I would want to avoid the need for a 2.0-style 'default encoding', so I
> > > suggest it shouldnt be possible to mix this type with other strings:
> > >
> > > >>> "1"+b"2"
> > > Traceback (most recent call last):
> > >   File "<stdin>", line 1, in ?
> > > TypeError: cannot add type "binary" to string
> > > >>> "3"==b"3"
> > > 0
> >
> > Right. This will cause people to rethink whether they are
> > using the object for text data or binary data. I still think that
> > at the interface level, b"" and "" should be treated the same (except
> > that b""-strings should not implement the char buffer interface).
> 
> If C functions auto-convert these things then people will coerce them by
> passing them through C functions. e.g. the regular expression engine or
> null encoding functions or whatever.
> 
> If we do NOT auto-coerce these things then they will not be compatible
> with many parts of the Python infrastructure, the regular expression
> engine and codecs being the most important examples. A clear requirement
> from Andy Robinson was that string-like code should work on binary data
> because often binary strings are "really" un-decoded strings. I think he
> is speaking on behalf of a lot of serious internationalizers there.

b""-strings will expose all necessary APIs to be compatible with
the re-engine, with codecs and most other C level interfaces which
use the s or s# parser marker.

In reality the only breakage will be for code which explicitly
requests a string object and these instances should really be
modified to work using the above parser markers instead.

Given these semantics, auto-conversion is not really necessary
for b""-strings.

Note that I see b""-string as replacement for our current 8-bit
strings in the context of handling non-text data. 8-bit strings
should still remain intact and available (even after making
"" produce Unicode strings), but should be extended to provide
additional encoding information (see the small image I posted on
the "Storing encoding information" thread).


-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Sat Feb 10 22:20:05 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sat, 10 Feb 2001 23:20:05 +0100
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
References: <Pine.LNX.4.30.0102090701220.12507-100000@latte.ActiveState.com> <3A8427DD.2BABEF53@lemburg.com> <200102092027.f19KREW00995@mira.informatik.hu-berlin.de> <3A84639E.B86D57F2@lemburg.com> <3A84B165.5F1D20D4@ActiveState.com> <3A8530C1.479544D2@lemburg.com> <3A855DFE.6FA48392@ActiveState.com>
Message-ID: <3A85BE95.565653D1@lemburg.com>

Paul Prescod wrote:
> 
> "M.-A. Lemburg" wrote:
> >
> > > ...
> >
> > We could agree on this though:
> >
> > 1. programs which do not use the encoding declaration are free
> >    to use non-ASCII bytes in literals; Unicode literals must
> >    use Latin-1 (for historic reasons)
> >
> > 2. programs which do make use of the encoding declaration may
> >    only use non-ASCII bytes in Unicode literals; these are then
> >    interpreted using the given encoding information and decoded
> >    into Unicode during the compilation step
> 
> I thought that's what I suggested! I am comfortable with that design.

Well, not quite since 2. doesn't decode the whole file, but only
the Unicode literals. The restriction on the rest of the file could
be made a convention or be actually checked to assure forward
compatibility.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From tree@basistech.com  Sat Feb 10 22:45:25 2001
From: tree@basistech.com (Tom Emerson)
Date: Sat, 10 Feb 2001 17:45:25 -0500
Subject: [I18n-sig] Random thoughts on Unicode and Python
In-Reply-To: <3A85B8F9.1F494BF8@lemburg.com>
References: <14981.45051.945099.633730@cymru.basistech.com>
 <3A85B8F9.1F494BF8@lemburg.com>
Message-ID: <14981.50309.44552.652650@cymru.basistech.com>

M.-A. Lemburg writes:
> How does Ruby (which seems to be the direct Python-competitor
> in Japan) deal with the difference between binary data and
> text data ?

Strings are strings. The interpretation of the bytes in a string is
affected by the setting of the KCODE built-in variable.

> I think that much concern about these proposals lies in a misunder-
> standing of the general idea behind the proposed move to Unicode for
> text data:

Agreed.

> The module which we are currently talking about can be outlined
> as follows:
> 
>                   binary data string *)
>                          |
>                          |
>                   text data string 
>                     |           |
>                     |           |
>          Unicode string      encoded 8-bit string (with encoding 
>            *)                                      information !)
> 
> *) these are implemented in Python 1.6-2.1.
> 
> How does this compare to e.g. Ruby ?

As I said, Ruby has a String type, and an override for
Japanese-encoded strings.

The above is much more similar to the model used by Dylan.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Stringologist                                      http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From guido@digicool.com  Sat Feb 10 22:23:54 2001
From: guido@digicool.com (Guido van Rossum)
Date: Sat, 10 Feb 2001 17:23:54 -0500
Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model)
In-Reply-To: Your message of "Sat, 10 Feb 2001 23:08:06 +0100."
 <3A85BBC6.BBAA8D70@lemburg.com>
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> <3A82FD60.EFB38FAD@ActiveState.com> <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de> <3A831110.6AADE590@ActiveState.com>
 <3A85BBC6.BBAA8D70@lemburg.com>
Message-ID: <200102102223.RAA28498@cj20424-a.reston1.va.home.com>

> Paul Prescod wrote:
> > 
> > At the bottom of one of my messages I proposed that we insert it as the
> > second argument. Although the encoding and mode are both strings there
> > is no syntactic overlap between [rwa][+]?[tb]+ and the set of existent
> > or proposed encodings. If we merely outlaw encodings with that name then
> > we can quickly figure out whether the second argument is a mode or an
> > encoding. So the documented syntax would be
> > 
> > open(filename, encoding, [[mode], bytes])
> > 
> > And the documentation would say:
> > 
> > "There is an obsolete variant that does not require an encoding string.
> > This may cause a warning in future versions of Python and be removed
> > sometime after that."

I am appalled at this lack of respect for existing conventions, when a
simple and obvious alternative (see below) is easily available.  I
will have a hard time not to take this into account when I finally get
to reading up on your proposals.

> Any reason why we cannot use a keyword argument for encoding
> and put it at the end of the argument list ? The result is:
> 
> 1. no ambiguity
> 2. backward compatibility
> 3. good visibility of what the argument stands for (without having
>    to look up the manual for e.g. the meaning of 'mbcs')

Of course this is what should be done when adding a new argument to an
existing API.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From mal@lemburg.com  Sat Feb 10 22:26:10 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sat, 10 Feb 2001 23:26:10 +0100
Subject: [I18n-sig] Strawman Proposal: Encoding Declaration V2
References: <3A85651E.C11C7B2B@ActiveState.com>
Message-ID: <3A85C002.60873564@lemburg.com>

Paul Prescod wrote:
> 
> The encoding declaration controls the interpretation of non-ASCII bytes
> in the Python source file. The declaration manages the mapping of
> non-ASCII byte strings into Unicode characters.
> 
> A source file with an encoding declaration must only use non-ASCII bytes
> in places that can legally support Unicode characters. In Python 2.x the
> only place is within a Unicode literal. This restriction may be lifted
> in future versions of Python.
> 
> In Python 2.x, the initial parsing of a Python script is done in terms
> of the file's byte values. Therefore it is not legal to use any byte
> sequence that has a byte that would be interpreted as a special
> character (e.g. quote character or backslash) according to the ASCII
> character set. This restriction may be lifted in future versions of
> Python.
> 
> The encoding declaration must be found before the first statement in the
> source file. The declaration is not a pragma. It does not show up in the
> parse tree and has no semantic meaning for the compiler itself. It is
> conceptually handled in a pre-compile "encoding sniffing" step. This
> step is also done using the ASCII encoding.
> 
> The encoding declaration has the following basic syntax:
> 
> #?encoding="<some string>"
> 
> <some string> is the encoding name and must be associated with a
> registered codec. The codec is used to interpret non-ASCII byte
> sequences.
> 
> The encoding declaration should be present in all Python source files
> containing non-ASCII bytes. Some future version of Python may make this
> an absolute requirement.

Sounds overly complicated to me; even though the resulting semantics
seem to be the same as those which I summarized in the last mail
on the original "Encoding Declaration" thread:

"""
1. programs which do not use the encoding declaration are free
   to use non-ASCII bytes in literals; Unicode literals must
   use Latin-1 (for historic reasons)

2. programs which do make use of the encoding declaration may
   only use non-ASCII bytes in Unicode literals; these are then
   interpreted using the given encoding information and decoded
   into Unicode during the compilation step

Part 1 assures backward compatibility. Part 2 assures that programmers
start to think about where they have to use Unicode and which
program literals are allowed to go into string literals. Part 1
is already implemented, part 2 is easy to do, since only the
compiler will have to be changed (in two places).
"""

If you want to keep your version, please add an explicit section
about 1. to it. Otherwise it will cause unnecessary confusion.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Sat Feb 10 22:32:03 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sat, 10 Feb 2001 23:32:03 +0100
Subject: [I18n-sig] Python and Unicode == Britain and the Euro?
References: <PGECLPOBGNBNKHNAGIJHOECICIAA.andy@reportlab.com>
Message-ID: <3A85C163.4CFAAE4@lemburg.com>

Andy Robinson wrote:
> 
> This reminds me a lot of another debating going on close to home :-)
> 
> - people who are in favour assume everyone else is, and that the only
> question is how to get there
> - people who are against are just plain worried but can't say why
> - the government stays very quiet and avoids asking for a referendum
> 
> I want to re-ask the big question:  is it desirable that the
> standard string type should become a Unicode string one day?

Note that we are not moving to *one* new string type, but instead
make use of object orientation and fit the current use of strings
into different subclasses of a binary string type:

                  binary data string *)
                         |
                         |
                  text data string 
                    |           |
                    |           |
         Unicode string      encoded 8-bit string (with encoding 
           *)                                      information !)

*) these are implemented in Python 1.6-2.1.

The basic idea here is to differentiate between text data and
binary data.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From andy@reportlab.com  Sat Feb 10 23:43:04 2001
From: andy@reportlab.com (Andy Robinson)
Date: Sat, 10 Feb 2001 23:43:04 -0000
Subject: [I18n-sig] Python and Unicode == Britain and the Euro?
In-Reply-To: <3A859A76.D4C30372@ActiveState.com>
Message-ID: <PGECLPOBGNBNKHNAGIJHAECMCIAA.andy@reportlab.com>

> The more I look at I18N in VB.NET, the more impressed I am. 
> It has no language restrictions on variable names etc.
> 
>     Protected Sub Form1_Click(ByVal sender As Object, ByVal e As
> System.EventArgs)
>         Dim ? As String
>         Dim font As New System.Drawing.Font("Batang", 10)
>         
>         ? = "??"
>         
>         TextBox1.Text = ?
>     End Sub
> 
> Each "?" is an ideograph. It seems to "just work".

That is good news.  I'm still on Visual Studio 6, MS Office 2000
and Win2000, and was still busy being impressed that I could
write Word docs in Japanese.  I'd better do some catching up!

- Andy


From andy@reportlab.com  Sat Feb 10 23:43:06 2001
From: andy@reportlab.com (Andy Robinson)
Date: Sat, 10 Feb 2001 23:43:06 -0000
Subject: [I18n-sig] Random thoughts on Unicode and Python
In-Reply-To: <14981.45051.945099.633730@cymru.basistech.com>
Message-ID: <PGECLPOBGNBNKHNAGIJHCECMCIAA.andy@reportlab.com>

> Both Shift-JIS and EUC-JP are 8-bit, multibyte encodings.
> You can use
> them on systems that are 8-bit clean and things "just
> work". You don't
> need to worry about embedded nulls or any other such noise.
> While you
> can't use len() to get the number of *characters* in a
> Shift-JIS/EUC-JP encoded string, you can find out how many "octets"
> are in it so you can loop over it and calculate the
> character length.
>
> In essence the Japanese (and Chinese and Koreans) are using the
> existing Python string type as a raw-byte string, and imposing the
> semantics over that.

That's my concern, and the thing I want to poll people on.
If Python "just works" for these users, and if we already offer
Unicode strings and a good codec library for people to use when they
want to, is there really a need to go further?

> Japanese and Chinese arguments against Unicode are often
> ideological:
> "It doesn't contain all of the characters we need." Of course they
> forget to mention that the character sets in regular use in these
> locales, JIS X 0201-1990, JIS X 0212-1990, GB 2312-80, and Big Five,
> are all represented in Unicode. The same is true for Korean: all of
> the hanja in KS C 5601 et al. are available in Unicode, as are the
> precomposed han'gul.

That's interesting. I have never heard that objection voiced before
and agree that it is unfounded.  I have seen objections based
on two specific families of problems:

(1) user defined characters:  the big three Japanese encodings
use the Kuten space of 94x94 characters. There are lots of slight
venddor variations on the basic JIS0208 character set, as well
as people adding new Gaiji in their office workgroups. Generic
conversion routines from, say, EUC to Shift-JIS still work
perfectly whether you use Shift-JIS, cp932, or cp932 plus
ten extra in-house characters.  Conversions to Unicode involve
selecting new codecs, or even making new ones, for all these
situations.

(2) slightly corrupt data: Let's say you are dealing with files
or database fields containing some truncated kanji.  If you
use 8-bit-clean strings and no conversion, the data will not
be corrupted or changed; if you try to magically convert
it to Unicode you will get error messages or possibly even
more corruption.  Maybe you're writing an app whose job is
to get text from machine A to machine B without changing it;
suddenly it will stop working.  I know people who spent
weeks debugging a VB print spooler which was cutting up
Postscript files containing kanji.

Suddenly upgrading to a new version of Python where all
your data undergoes invisible transformations to Unicode
and back is going to cause trouble for quite a few people.
Arguably, it is GOOD trouble which will force them to
standardise their character sets, document their extensions
and clean their data - but it it still going to be trouble.
It's a bit different in a language like Java which was
defined to be Unicode-based from day one.


- Andy


From paulp@ActiveState.com  Sun Feb 11 03:44:35 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Sat, 10 Feb 2001 19:44:35 -0800
Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model)
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> <3A82FD60.EFB38FAD@ActiveState.com> <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de> <3A831110.6AADE590@ActiveState.com>
 <3A85BBC6.BBAA8D70@lemburg.com> <200102102223.RAA28498@cj20424-a.reston1.va.home.com>
Message-ID: <3A860AA3.655F4207@ActiveState.com>

Guido van Rossum wrote:
> 
> ...
> > open(filename, encoding, [[mode], bytes])
> > 
> > And the documentation would say:
> > 
> > "There is an obsolete variant that does not require an encoding string.
> > This may cause a warning in future versions of Python and be removed
> > sometime after that."
> 
> I am appalled at this lack of respect for existing conventions, 

You're the one who told everyone to move from string functions to string
methods. This is a move of similar scope but for a much more important
purpose than merely changing coding style.

> when a
> simple and obvious alternative (see below) is easily available.  I
> will have a hard time not to take this into account when I finally get
> to reading up on your proposals.

There is an important reason that we did not use a keyword argument. 

We (at least some subset of the people in the i18n-sig) want every
single new instance of the "open" function to declare an encoding. Right
now we allow a lot of "ambiguous data" into the system. We do not know
whether the user meant it to be binary or textual data and so we don't
know the correct/valid coercions, conversions and operations. We are
trying to retroactively make an open function that strongly encourages
(and perhaps finally forces) people to make their intent known.

The open extension is a backwards compatible way to allow people to move
from the "old" ambiguous form to the new form. I considered it pretty
well thought out in terms of backwards and forwards compatibility. We
could also just invent a new function like "file" or "fileopen" but
upgrading "open" seemed to show the *most* respect for existing
conventions (and clutters up builtins the least).

 Paul Prescod


From tree@basistech.com  Sun Feb 11 04:06:01 2001
From: tree@basistech.com (Tom Emerson)
Date: Sat, 10 Feb 2001 23:06:01 -0500
Subject: [I18n-sig] Random thoughts on Unicode and Python
In-Reply-To: <PGECLPOBGNBNKHNAGIJHCECMCIAA.andy@reportlab.com>
References: <14981.45051.945099.633730@cymru.basistech.com>
 <PGECLPOBGNBNKHNAGIJHCECMCIAA.andy@reportlab.com>
Message-ID: <14982.4009.542031.914222@cymru.basistech.com>

Andy Robinson writes:
> (1) user defined characters:  the big three Japanese encodings
> use the Kuten space of 94x94 characters. There are lots of slight
> venddor variations on the basic JIS0208 character set, as well
> as people adding new Gaiji in their office workgroups. Generic
> conversion routines from, say, EUC to Shift-JIS still work
> perfectly whether you use Shift-JIS, cp932, or cp932 plus
> ten extra in-house characters.  Conversions to Unicode involve
> selecting new codecs, or even making new ones, for all these
> situations.

There is no reason that we couldn't provide a set of unified codecs
for EUC-JP, Shift JIS, ISO-2022-JP, and CP932 that provide appropriate
mappings between the EUDC sections in the legacy character sets and
the PUA of Unicode, such that these conversions work.

> (2) slightly corrupt data: Let's say you are dealing with files
> or database fields containing some truncated kanji.  If you
> use 8-bit-clean strings and no conversion, the data will not
> be corrupted or changed; if you try to magically convert
> it to Unicode you will get error messages or possibly even
> more corruption.  Maybe you're writing an app whose job is
> to get text from machine A to machine B without changing it;
> suddenly it will stop working.  I know people who spent
> weeks debugging a VB print spooler which was cutting up
> Postscript files containing kanji.

Yes, this is a problem that I cannot suggest a good answer to: reality
raises its ugly head.

> Suddenly upgrading to a new version of Python where all
> your data undergoes invisible transformations to Unicode
> and back is going to cause trouble for quite a few people.

Absolutely.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Stringologist                                      http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From paulp@ActiveState.com  Sun Feb 11 04:01:02 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Sat, 10 Feb 2001 20:01:02 -0800
Subject: [I18n-sig] Random thoughts on Unicode and Python
References: <PGECLPOBGNBNKHNAGIJHCECMCIAA.andy@reportlab.com>
Message-ID: <3A860E7E.A58E37BC@ActiveState.com>

Andy Robinson wrote:
> 
> ...
> 
> That's my concern, and the thing I want to poll people on.
> If Python "just works" for these users, and if we already offer
> Unicode strings and a good codec library for people to use when they
> want to, is there really a need to go further?

Let me point out again that while I don't want to discount the needs of
these people, the fact is that over here in the West we need to use
Unicode ourselves! I've already figured out how the Unicode works and
how it interacts with "ordinary strings" but I don't think that
everybody I hire to work at ActiveState should have to figure that out
themselves. Obviously the Unicode source file issue is separate but the
"Unicode as basic string literal" helps all of us.

In a year, a lot of my work will involve XML on a Unicode-enabled
operating system. I'll only have to think about 8-bit extended ASCII
because Python forces me to sometimes. Now I know most people are not
going to be moving to full Unicode as quickly as I am but that is the
future and we need to start laying the groundwork now.

>...
> (2) slightly corrupt data: Let's say you are dealing with files
> or database fields containing some truncated kanji.  If you
> use 8-bit-clean strings and no conversion, the data will not
> be corrupted or changed; if you try to magically convert
> it to Unicode you will get error messages or possibly even
> more corruption.

I think we've all agreed that Python should never, ever, magically
convert binary data to Unicode. I think that most people's fears about
Unicode are precisely that it will some day magically covert binary data
to Unicode. But we all agree that that should never happen.

Even in my original proposal when I said that the standard string should
be widened to Unicode, I never, ever, suggested that binary data should
be converted to Unicode. Rather I said that in some cases Unicode
characters could be a transport -- a representation layer -- for binary
data. Just as in some cases integers are a transport for characters or
(shudder pointers).

> Suddenly upgrading to a new version of Python where all
> your data undergoes invisible transformations to Unicode
> and back is going to cause trouble for quite a few people.

But I do not believe that anyone has ever suggested that! I understand
where the misunderstanding comes from but it is nevertheless a
misunderstanding.

 Paul Prescod


From brian@tomigaya.shibuya.tokyo.jp  Sun Feb 11 05:58:44 2001
From: brian@tomigaya.shibuya.tokyo.jp (Brian Takashi Hooper)
Date: Sun, 11 Feb 2001 14:58:44 +0900
Subject: [I18n-sig] Python and Unicode == Britain and the Euro?
In-Reply-To: <3A8593BF.8AFCEBB3@ActiveState.com>
References: <PGECLPOBGNBNKHNAGIJHOECICIAA.andy@reportlab.com> <3A8593BF.8AFCEBB3@ActiveState.com>
Message-ID: <20010211140545.49DF.BRIAN@tomigaya.shibuya.tokyo.jp>

Hi there, Brian in Tokyo again,

On Sat, 10 Feb 2001 11:17:19 -0800
Paul Prescod <paulp@ActiveState.com> wrote:

> Andy, I think that part of the reason that Westerners push harder for
> Unicode than Japanese is because we are pressured (rightly) to right
> software that works world-wide and it is simply not sane to try to do
> that by supporting multiple character sets. Multiple encodings maybe.
> Multiple character sets? Forget it.
I think this is a true and valid point (that Westerners are more likely
to want to make internationalized software), but it sounds here like
because Westerners want to make it easier to internationalize software,
that that is a valid reason to make it harder to make software that has
no particular need for internationalization, in non-Western languages,
and change the _meaning_ of such a basic data type as the Python string.

If in fact, as the proposal proposes, usage of open() without an
encoding, for example, is at some point deprecated, then if I am
manipulating non-Unicode data in "" strings, then I think I _do_ at some
point have to port them over.  b"<blob of binary data>" then becomes
different from "<blob of binary data>", because "<blob of binary data>"
is now automatically being interpreted behind the scenes into an
internal Unicode representation.  If the blob of binary data actually
happened to be in Unicode, or some Unicode-favored representation (like
UTF-8), then I might be happy about this - but if it wasn't, I think
that this result would instead be rather dismaying.

The current Unicode support is more explicit about this - the meaning of
the string literal itself has not changed, so I can continue to ignore
Unicode in cases where it serves no useful purpose.  I realize that it
would be nicer from a design perspective, more consistent, to have
Python string mean only character data, but right now, it does sometimes
mean binary and sometimes mean characters. The only one who can
distinguish which is the programmer - if at some point "" means only
Unicode character strings, then the programmer _does_, I think, have to
go through all their programs looking for places where they are using
strings to hold non-Unicode character data, or binary data, and
explicitly convert them over.  I have difficulty seeing how we would be
able to provide a smooth upgrade path - maybe a command-line backwards
compatibility option?  Maybe defaults?  I've heard a lot of people
voicing dislike for default encodings, but from my perspective,
something like ISO-Latin-1, or UTF-8, or even ASCII (EUC-JP and SJIS are,
strictly speaking, not supersets of ASCII because the ASCII ranges are
usually interpreted as JIS-Roman, which contains about 4 different
characters) is functionally a default encoding...  Requiring encoding
declarations, as the proposal suggests, is nice for people working in
the i18n domain, but is an unnecessary inconvenience for those who are
not.

> 
> I don't know of any commercial software written in Japan but used in the
> west so I think that they probably have less I18N pressure than we do.
> Unicode is only interesting when you want the same software to run in
> multiple character set environments!
That's exactly true.  The point I would like to make is that a lot,
probably the majority of Python software and libraries that are out
there today, don't have any need to run in multiple character set
environments.  Python is useful for a lot more things than just for
commercial development of products designed for international markets.

> 
> Andy Robinson wrote:
> > 
> > ...
> > 
> > 2. I have been told that there are angry mumblings on the
> > Python-Japan mailing list that such a change would break all
> > their existing Python programs; I'm trying to set up my tools to
> > ask out loud in that forum.
> 
> I don't think it is posssible to say in the abstract that a move to
> Unicode would break code. Depending on implementation strategy it might.
> But I can't imagine there is really a ton of code that would break
> merely from widening the character.
See above.  I think there is, at least outside of Europe.  Is it a
higher priority for Python to make it easier for Western users to
internationalize, or to save people who currently use Python strings to
manipulate binary data the trouble of having to port their applications
to support the new conventions?  I guess my own personal preference is
not to change things too much, because from my perspective, the Unicode
support is fine - if it's not broken, don't fix it.

Maybe it would be instructive to take the current proposal and any
others that come out, and without actually implementing, pretend-apply
the changes to parts of the existing code base to try to see how big the
effect would be?  That way, neither of us has to accept just on faith
that changing so-and-so would or would not break existing code...

--Brian


From frank63@ms5.hinet.net  Sun Feb 11 13:20:10 2001
From: frank63@ms5.hinet.net (Frank Chen)
Date: Sun, 11 Feb 2001 13:20:10 -0000
Subject: [I18n-sig] =?BIG5?B?UmU6U3RyYXdtYW4gUHJvcG9zYWw6IEVuY29kaW5nIERlY2xhcmF0aW9uIFY=?=
 =?BIG5?B?Mg==?=
Message-ID: <200102110601.OAA14610@ms5.hinet.net>

> Date: Sat, 10 Feb 2001 07:58:22 -0800
> From: Paul Prescod <paulp@ActiveState.com>
> Organization: ActiveState
> To: "i18n-sig@python.org" <i18n-sig@python.org>
> Subject: [I18n-sig] Strawman Proposal: Encoding Declaration V2
> 
> 
> A source file with an encoding declaration must only use non-ASCII bytes
> in places that can legally support Unicode characters. In Python 2.x the
> only place is within a Unicode literal. This restriction may be lifted
> in future versions of Python.

So, if one day I declare Big5 as the encoding, I cannot use any ASCII
character in my Python script?
Does it mean this?
if I set a = "characters='abc'", in the future it doesn't work? I need to
use Big5 characters 
as identifiers and also the contents of strings when encoding declaraction
is set to Big5?

> 
> The encoding declaration must be found before the first statement in the
> source file. The declaration is not a pragma. It does not show up in the
> parse tree and has no semantic meaning for the compiler itself. It is
> conceptually handled in a pre-compile "encoding sniffing" step. This
> step is also done using the ASCII encoding. 
> 

Like a preprocessor, to convert local encoding characters into Unicode
first?
And then feed it to the compiler?


Frank Chen


From frank63@ms5.hinet.net  Sun Feb 11 14:02:45 2001
From: frank63@ms5.hinet.net (Frank Chen)
Date: Sun, 11 Feb 2001 14:02:45 -0000
Subject: [I18n-sig] Re: All this Unicode discussion
Message-ID: <200102110601.OAA14633@ms5.hinet.net>

> Brian and I are worried about all these proposals flying around.
> Americans seem to feel that having Unicode everywhere is
> 'the right thing'. But we have not heard from enough people
> in Japan or in Chinese-speaking countries, and the list has
> NEVER had input from  e.g. Arabic speakers or Eastern Europe.
> 

In fact, some people in mainland China look like to arguly object Unicode 
in Chinese softwares. The Han Unification for CJK reveals their unknowns
about CJK ideography. If in the future, the UCS4 can deploy a complete
allocation area for each written language, especially for CJK, I think it
is fine
to use Unicode as the internal data type. I am even thinking is there a 
chance to embrace ancient Egyptian hieroglyphics into Unicode, but it was
a dead script though.

> Is it really desirable, long term, to have Unicode strings as the
> default
> type in Python?  Do we need separate Unicode file and Binary
> file annd socket types? Or are we better with what we have now -
> no fundamental changes, but with codecs and Unicode strings
> when you want them?

I see the proposal, it seems not to treat Unicode as pivot internally, but
an add-on when an encoding declaration is set. If there is no encoding
declaration setting, it should function like before, right? Or if it is set
to Latin-1, it should work like current Python, right? For now, I can
put Big5 characters in Python strings, and the Windows or Chinese emulator
can interpret Big5 strings correctly when Python displays them on the 
screen. I think the future version should keep this alive.

But I am worries about the conversion time when mapping to Unicode.
The Python start-up time for initialization may take too long.

> 
> In addition, are there any benefits or problems when you
> deal with double-byte data in Java, VB, or any other languages
> you are familiar with?
> 

I think the reason that Java or Windows use Unicode in internal processing
is mainly for quick universal delivering. And the reason why Unicode raises
is the same, for many local encodings slow down the productivity when the
product is world-widely spreaded. So, if Python wants to ship with i18n &
10n
(then it can display local encoding message with its environment in
different areas and the like),
it surely can use Unicode for delivering efficiency.


Frank Chen


From paulp@ActiveState.com  Sun Feb 11 07:05:03 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Sat, 10 Feb 2001 23:05:03 -0800
Subject: [I18n-sig] Re:Strawman Proposal: Encoding Declaration V2
References: <200102110601.OAA14610@ms5.hinet.net>
Message-ID: <3A86399F.F4B2C6E5@ActiveState.com>

Frank Chen wrote:
> 
> ...
> 
> So, if one day I declare Big5 as the encoding, I cannot use any ASCII
> character in my Python script?
> Does it mean this?
> if I set a = "characters='abc'", in the future it doesn't work? I need to
> use Big5 characters
> as identifiers and also the contents of strings when encoding declaraction
> is set to Big5?

I'm pretty sure that ASCII characters are Big5 characters and they are
encoded in the same way as in pure ASCII. So yes, you can continue to
use ASCII characters in Big5-encoded scripts.

The current proposal only has any "effect" on Unicode literals anyhow.
The only danger is that just as today you must not use a Big5 character
with a second byte that would confuse an ASCII-based parser. The second
byte must never equate to ASCII "\" or '"'. I presume you are already
careful about that.

> ...
>
> Like a preprocessor, to convert local encoding characters into Unicode
> first?
> And then feed it to the compiler?

*Conceptually* this is how I think of it. That *could* one day allow
identifiers to be in any language. It also means that we could one day
get rid of the silly restrictions on the second byte of two-byte
characters.

Others think of it as just a post-parse transformation on ONLY Unicode
(u"") literals. Until the issue of non-ASCII identifiers comes up, there
is no practical difference. So you can think of it either way.

The first implementation will likely be a post-parse transformation
because it is easier to implement in a non-Unicode parser.

 Paul Prescod


From paulp@ActiveState.com  Sun Feb 11 08:05:22 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Sun, 11 Feb 2001 00:05:22 -0800
Subject: [I18n-sig] Python and Unicode == Britain and the Euro?
References: <PGECLPOBGNBNKHNAGIJHOECICIAA.andy@reportlab.com> <3A8593BF.8AFCEBB3@ActiveState.com> <20010211140545.49DF.BRIAN@tomigaya.shibuya.tokyo.jp>
Message-ID: <3A8647C2.398822CB@ActiveState.com>

Brian Takashi Hooper wrote:
> 
> ...
>
> I think this is a true and valid point (that Westerners are more likely
> to want to make internationalized software), but it sounds here like
> because Westerners want to make it easier to internationalize software,
> that that is a valid reason to make it harder to make software that has
> no particular need for internationalization, in non-Western languages,
> and change the _meaning_ of such a basic data type as the Python string.

I do not think that any of the proposals make it much harder to make
non-internationlized software. We are merely asking people to be
explicit about their assumptions so that code will have a better chance
of working on other people's computers. That means adding an encoding
declaration here, prepending a "b" prefix there and so forth. Asians
understand encoding issues and I do not think that they will be confused
by these changes.

If you ask an Asian "what is Python's character set" they will either
answer Latin 1 (which looks bad) or "Python has no native character set,
only binary strings of bytes." If they think of strings as strings of
bytes then what is the harm in prefixing a "b" to make that assumption
explicit?

> If in fact, as the proposal proposes, usage of open() without an
> encoding, for example, is at some point deprecated, then if I am
> manipulating non-Unicode data in "" strings, then I think I _do_ at some
> point have to port them over.  

No, those would be two unrelated changes. In order to get open() to have
its old behavior you would say something like:

open( "filename", "raw")

or

open( "filename", "binary")

> b"<blob of binary data>" then becomes
> different from "<blob of binary data>", because "<blob of binary data>"
> is now automatically being interpreted behind the scenes into an
> internal Unicode representation.  

Yes, this is a separate proposal for some time down the road. Sometime
down the road is likely at least two years because the deployment of new
versions of Python is very slow and it would be wrong to quickly
deprecate a usage which is "recommended practice" in Python 2.x.

> If the blob of binary data actually
> happened to be in Unicode, or some Unicode-favored representation (like
> UTF-8), then I might be happy about this - but if it wasn't, I think
> that this result would instead be rather dismaying.

The vast majority of the world's encodings are "Unicode-favored" at some
level. As long as the character set is compatible with Unicode and you
add an encoding declaration, everything should just work. If you do NOT
want to work with Unicode then you would have to prepend a "b" prefix to
your literal strings.

As I've described, you will have several years to choose which path you
want to take. And the "fixups" are easy. I don't see why this is a cause
for alarm.

> The current Unicode support is more explicit about this - the meaning of
> the string literal itself has not changed, so I can continue to ignore
> Unicode in cases where it serves no useful purpose.  

Python is EXPLICIT about the fact that the character set is NOT Unicode.

Python is NOT explicit about the fact that the character set is Latin 1
or "binary data" -- depending on your point of view. If you take the
former point of view then Python is Western centric. If you take the
latter point of view then it is just plain confusing to use the term
"character string" as the name for your "binary data" container. You
acknowledge this below:

> I realize that it
> would be nicer from a design perspective, more consistent, to have
> Python string mean only character data, but right now, it does sometimes
> mean binary and sometimes mean characters. The only one who can
> distinguish which is the programmer - if at some point "" means only
> Unicode character strings, then the programmer _does_, I think, have to
> go through all their programs looking for places where they are using
> strings to hold non-Unicode character data, or binary data, and
> explicitly convert them over.  I have difficulty seeing how we would be
> able to provide a smooth upgrade path - maybe a command-line backwards
> compatibility option?  

It is my personal opinion that time itself is an "upgrade path." If you
tell people where things are going then in the course of basic software
maintenance they will change their software. This is how we managed the
transition from K&R C to ANSI C to C++. Yes, a command-line backwards
compatibility option is another way of extending the amount of
"change-over" time people have.

> Maybe defaults?  I've heard a lot of people
> voicing dislike for default encodings, but from my perspective,
> something like ISO-Latin-1, or UTF-8, or even ASCII (EUC-JP and SJIS are,
> strictly speaking, not supersets of ASCII because the ASCII ranges are
> usually interpreted as JIS-Roman, which contains about 4 different
> characters) is functionally a default encoding...  Requiring encoding
> declarations, as the proposal suggests, is nice for people working in
> the i18n domain, but is an unnecessary inconvenience for those who are
> not.

One of the things I like about Python is that it encourages me to write
software in ways that allow my simple scripts to grow into complex
programs. Perl programmers consider many of these "encouragements" to be
unnecessary inconveniences. Similarly, I think Python should help me
(and encourage me) to write software that works on computers that are
configured differently than mine.

Think of it also as an investment in the unification of the Python
world. Wouldn't it be great if Chinese programmers could email Guido and
say: "Here's a cool Python program I wrote. Give it a whirl?" Is it
possible that we duplicate more code than we need to because it is too
hard to share programs right now? Obviously spoken language barriers are
not going away but at least our code can be portable.

Also, think of all of the great software being written in Python. Maybe
the next killer Python app will work better in Japan and China because
we made it easier to internationalize code.

And if Python itself can distinguish between textual and binary
information then we can do a lot of things more intelligently:
coercions, exceptions, concatenations, extension library integration
etc. Explicit is better than implicit!

Finally, I think it is in the best interests of even people who do not
want i18n to have the Python language be more explicit and consistent.
When Python is taught in a Japananese school they can say: "See, this
character 'b' means that the string contains binary data. We choose to
use a binary string for reason X, Y and Z." or "See, this string
contains Unicode characters. That means len() works as you would expect
on a per-character basis and the software works just as well with
Chinese text as Japanese text and ..."

> > I don't think it is posssible to say in the abstract that a move to
> > Unicode would break code. Depending on implementation strategy it might.
> > But I can't imagine there is really a ton of code that would break
> > merely from widening the character.
> See above.  I think there is, at least outside of Europe.  

Note that we are discussing three or four or five different proposals as
if they are one. I think it would be easy to demonstrate that there is
little code that would break based ONLY on the change that Python
strings could contain characters with ordinals greater than 255.

If we added a single character to the range at position 256, would that
break much Python code? Ignore Unicode. Just extend the range by one
character. Now keep extending it until you get to the size of Unicode.

The separate proposal that tries to clean up the interpretation of
literals with non-Unicode bytes WOULD break code (if only some time far
in the future and after a long changeover period). 

> ...
> Maybe it would be instructive to take the current proposal and any
> others that come out, and without actually implementing, pretend-apply
> the changes to parts of the existing code base to try to see how big the
> effect would be?  That way, neither of us has to accept just on faith
> that changing so-and-so would or would not break existing code...

Python changes are always implemented as patches which are tested and
then backed-out if they break things. Nevertheless, you are right that
there are some of us with the goal of having string literals directly
contain Unicode characters one day. Guido may or may not have an opinion
on the issue.

Either way, Guido wouldn't make the change if it were going to break a
lot of code. So the immediate issue is whether the explicitness
requirements of b"" strings and an encoding declaration are too onerous.

Anyhow, at this point we are not even talking about adding any mandatory
features or turning new features into recommended practice. We are just
talking about ALLOWING people to be explicit about the distinction
between binary and text data and ALLOWING people to directly enter
Unicode text data. 

I haven't tried to hide where I think things should go but still these
new features deserve to be evaluated on their own. They are good ideas
even if we never deprecate the other ways of doing things. I know I
started this discussion with my single big-bang proposal but I'd like to
take a more incremental approach now. I don't think that the current
proposals make anyone's life harder yet.

 Paul Prescod


From paulp@ActiveState.com  Sun Feb 11 08:16:50 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Sun, 11 Feb 2001 00:16:50 -0800
Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model)
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> <3A82FD60.EFB38FAD@ActiveState.com> <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de> <3A831110.6AADE590@ActiveState.com> <3A85BBC6.BBAA8D70@lemburg.com>
Message-ID: <3A864A72.B18E5C31@ActiveState.com>

"M.-A. Lemburg" wrote:
> 
> ...
> 
> Any reason why we cannot use a keyword argument for encoding
> and put it at the end of the argument list ? The result is:
> 
> 1. no ambiguity
> 2. backward compatibility
> 3. good visibility of what the argument stands for (without having
>    to look up the manual for e.g. the meaning of 'mbcs')

I would like to have the option of one day making it a required argument
without having to also make mode and bytes required. Mode would be a
minor inconvenience but bytes would be major.

 Paul Prescod


From andy@reportlab.com  Sun Feb 11 08:22:44 2001
From: andy@reportlab.com (Andy Robinson)
Date: Sun, 11 Feb 2001 08:22:44 -0000
Subject: [I18n-sig] Python and Unicode == Britain and the Euro?
In-Reply-To: <3A85C163.4CFAAE4@lemburg.com>
Message-ID: <PGECLPOBGNBNKHNAGIJHIECOCIAA.andy@reportlab.com>


[Marc-Andre]
> Note that we are not moving to *one* new string type, but instead
> make use of object orientation and fit the current use of strings
> into different subclasses of a binary string type:
> 
>                   binary data string *)
>                          |
>                          |
>                   text data string 
>                     |           |
>                     |           |
>          Unicode string      encoded 8-bit string (with encoding 
>            *)                                      information !)
> 
> *) these are implemented in Python 1.6-2.1.
> 
> The basic idea here is to differentiate between text data and
> binary data.
> 

Thanks. It's finally starting to make sense to me.    

- Andy


From tim.one@home.com  Sun Feb 11 08:34:32 2001
From: tim.one@home.com (Tim Peters)
Date: Sun, 11 Feb 2001 03:34:32 -0500
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
In-Reply-To: <3A855B89.459A18E4@ActiveState.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEKOIOAA.tim.one@home.com>

[Paul Prescod]
> I'm not personally willing to design in such a limitiation. I have seen
> a lot of code that mixes other languages with English. e.g.:
>
> http://starship.python.net/pipermail/python-de/2000q3/000597.html
>
> I don't think this guy is doing anything wrong. If a Japansese person
> asks me if they could do the same I would say: "Not now, but hopefully
> someday."

But of course they could:  "this guy" you point to as evidence used plain
7-bit ASCII, writing an approximation to German in that. *That's* certainly
widespread, in and out of the Python world.  But more than that isn't.

Again, pick a language that already supports what you suggest and find some
evidence that it's *used*.  As I said before, I've seen no evidence that it
is, and the evidence of languages designed by non-Euros suggests it's rare
even for them to cater to these complications (and, yes, the Java Character
class's .isIdentifierIgnorable(), .isUnicodeIdentifierPart(),
.isUnicodeIdentifierStart() etc methods are indeed complications:  write a
regexp to match a valid Unicode identifier; write a UserDict that manages to
collapse valid Unicode identifiers that differ only in ignorable characters
into a single key; etc; explain to users that their little source-munging
tools need to take all of that into account in the New World).

> ...
> People keep bringing up this issue of keywords. I've never disputed that
> the keywords should always be English.

What about the names of builtins and std library names and the names of
classes and functions and methods and attributes in the std libraries?  I
mentioned keywords in the context of all of those.

> There are a lot of people who write code that will never be
> seen by a speaker of an ASCII-compatible language. Why should they be
> forced to write it in ASCII?

"Forced" presumes it's against their will.  That's what I question.  There
is nothing more Eurocentric than to embark on unilateral crusades for the
purported benefit of non-Euros who aren't asking for help <0.7 wink>.

it's-a-programming-language-not-a-word-processor-ly y'rs  - tim


From tim.one@home.com  Sun Feb 11 08:50:17 2001
From: tim.one@home.com (Tim Peters)
Date: Sun, 11 Feb 2001 03:50:17 -0500
Subject: [I18n-sig] Python and Unicode == Britain and the Euro?
In-Reply-To: <3A8647C2.398822CB@ActiveState.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEKPIOAA.tim.one@home.com>

[Paul Prescod]
> ...
> If you ask an Asian "what is Python's character set" they will either
> answer Latin 1 (which looks bad) or "Python has no native character set,
> only binary strings of bytes."

The Python Reference Manual says (chapter 2, "Lexical analysis"):

    Python uses the 7-bit ASCII character set for program text and
    string literals.

That was Guido's intent, and it's actually a bug that the parser uses
isalpha() etc (it wasn't intended to vary according to locale; locale was an
ANSI invention Guido didn't have in mind when that stuff was coded; and,
e.g., in some locales even characters like "|" meet the isalpha() test).


From andy@reportlab.com  Sun Feb 11 09:18:57 2001
From: andy@reportlab.com (Andy Robinson)
Date: Sun, 11 Feb 2001 09:18:57 -0000
Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model)
In-Reply-To: <3A864A72.B18E5C31@ActiveState.com>
Message-ID: <PGECLPOBGNBNKHNAGIJHMEDCCIAA.andy@reportlab.com>

> > Any reason why we cannot use a keyword argument for encoding
> > and put it at the end of the argument list ? The result is:
> >
> > 1. no ambiguity
> > 2. backward compatibility
> > 3. good visibility of what the argument stands for (without having
> >    to look up the manual for e.g. the meaning of 'mbcs')
>
> I would like to have the option of one day making it a
> required argument
> without having to also make mode and bytes required. Mode would be a
> minor inconvenience but bytes would be major.
>
>  Paul Prescod

I can see three separate proposals going on here.  Here's what I
think:

(1) introduce b"whatever".

I'm 100% in favour - breaks nothing, adds clarity, and having it early
may ease the pain if we ever do break old code in a few years.

(2) widen the string representation so they can hold single or
multi-byte
data but without implying their semantics.

I'm not sure on this one - it goes further than any other language
and the extra power may lead to new classes of errors.  Alongside the
proposal,
we need a bunch of examples of how this could be used, and of how it
could be abused, and then I think we all need to sit on it for a
while.
Which is what you've been saying too.

(3) changing open().

This should be contingent on (2).  As long as u"hello" and "hello"
have a different type, our current solution is exactly right -
we have wrappers classes around files which handle Unicode strings,
but files themselves always do I/O in bytes. We've actually got
the explicit position you favour right now - to write Unicode to a
file, I need to explicitly create a wrapper with an encoding.

If you go to (2), it becomes possible to write a string containing
unicode straight to a file object, and therefore it is desirable
to let the file object handle conversion, so you need a way to
specify it etc.   I am still not sure this is right.  The stackable
streams concept is well understood from Java and gives a lot of
power.

- Andy


From andy@reportlab.com  Sun Feb 11 09:23:52 2001
From: andy@reportlab.com (Andy Robinson)
Date: Sun, 11 Feb 2001 09:23:52 -0000
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIEKOIOAA.tim.one@home.com>
Message-ID: <PGECLPOBGNBNKHNAGIJHKEDDCIAA.andy@reportlab.com>

> "Forced" presumes it's against their will.  That's what I 
> question.  There
> is nothing more Eurocentric than to embark on unilateral 
> crusades for the
> purported benefit of non-Euros who aren't asking for help 
> <0.7 wink>.
> 

Beatifully put.  This is the empirical question and one
I am determined to get real answers to.

- Andy


From paulp@ActiveState.com  Sun Feb 11 09:34:31 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Sun, 11 Feb 2001 01:34:31 -0800
Subject: [I18n-sig] Python and Unicode == Britain and the Euro?
References: <LNBBLJKPBEHFEDALKOLCAEKPIOAA.tim.one@home.com>
Message-ID: <3A865CA7.249910F1@ActiveState.com>

Tim Peters wrote:
> 
> > ...
> 
> The Python Reference Manual says (chapter 2, "Lexical analysis"):
> 
>     Python uses the 7-bit ASCII character set for program text and
>     string literals.
> 
> That was Guido's intent,

That may be the rule but try enforcing it. It is so widely violated as
to be irrelevant. I would love it if you did try to enforce it in Python
2.1. You would take the heat for breaking everyone's non-ASCII programs
and then I could come in and propose the draconian rule be eased with
the encoding declaration.

The wide violation of this rule should inform our discussions about
where Python source code is going in the future...

 Paul Prescod


From paulp@ActiveState.com  Sun Feb 11 09:46:18 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Sun, 11 Feb 2001 01:46:18 -0800
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
References: <LNBBLJKPBEHFEDALKOLCIEKOIOAA.tim.one@home.com>
Message-ID: <3A865F6A.6CC12CC3@ActiveState.com>

Tim Peters wrote:
> 
> ...
> 
> Again, pick a language that already supports what you suggest and find some
> evidence that it's *used*.  

We will see. Before Unicode it would have been very hard to do this and
yet achieve source code portability between systems. Unicode and the
tools and languages that use it are just being deployed. There is no
need to move aggressively in that direction. But I'll say again that I
think it would be a big mistake to add any further impedements to
getting there.

> it's-a-programming-language-not-a-word-processor-ly y'rs  - tim

I don't understand your fundamental point.

We agree that German people want to use German variable names. If it was
*just as easy* for them to use non-ASCII German characters, why wouldn't
they? What's magical about ASCII? And if Japanese people are more like
German people than they are different from them (carbon based, bipedal,
etc.) then why wouldn't they want to write code using their special
characters? Why would they choose to approximate and translate?

I'm not claiming it's a burning need, but I don't see why a Japanese
teenager learning to program for the first time would choose to use a
language that requires English variable names over one that offered
choice. There's nothing magical about ASCII. Hell, American teenagers
would probably love to put happy faces and summation signs into their
variable names. I use a teenager as an example of a person coming to the
computer world fresh without ASCII brain-damage. Where's Greg Wilson
when I need him?

 Paul Prescod


From paulp@ActiveState.com  Sun Feb 11 09:59:08 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Sun, 11 Feb 2001 01:59:08 -0800
Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model)
References: <PGECLPOBGNBNKHNAGIJHMEDCCIAA.andy@reportlab.com>
Message-ID: <3A86626C.AFFF32B0@ActiveState.com>

Andy Robinson wrote:
> 
> ....
> 
> I can see three separate proposals going on here.  Here's what I
> think:
> 
> (1) introduce b"whatever".
> 
> I'm 100% in favour - breaks nothing, adds clarity, and having it early
> may ease the pain if we ever do break old code in a few years.
> 
> (2) widen the string representation so they can hold single or
> multi-byte
> data but without implying their semantics.

This is not a short-term proposal because it involves more
implementation work than the others.

> I'm not sure on this one - it goes further than any other language
> and the extra power may lead to new classes of errors.  

Actually, the way you describe it, it sounds alot like wchar.

> (3) changing open().
> 
> This should be contingent on (2).  As long as u"hello" and "hello"
> have a different type, our current solution is exactly right -
> we have wrappers classes around files which handle Unicode strings,
> but files themselves always do I/O in bytes. We've actually got
> the explicit position you favour right now - to write Unicode to a
> file, I need to explicitly create a wrapper with an encoding.

I don't follow why this should be contingent on widening the basic
string representation! Given a Unicode type, we need to read and write
Unicode data today. In my personal opinion, wrappers are too obscure and
too optional. The average programmer is not going to even know they
exist.

> If you go to (2), it becomes possible to write a string containing
> unicode straight to a file object, and therefore it is desirable
> to let the file object handle conversion, so you need a way to
> specify it etc. 

We already have Unicode strings that we need to write to files!

> I am still not sure this is right.  The stackable
> streams concept is well understood from Java and gives a lot of
> power.

The stackable streams will still exist. But Python is "flatter" than
Java in general. Java's IO libraries are in my opinion almost
incomprehensible. Yes ,very powerful once you understand them but a lot
to learn to do basic things.

I would not be embarrassed to tell a newbie Python programmer that they
should write:

file = open("/etc/passwd.txt", "ASCII")

It's pretty clear what's going on and they don't need any understanding
of Unicode. What's the Java equivalent?

 Paul Prescod


From fredrik@effbot.org  Sun Feb 11 10:14:25 2001
From: fredrik@effbot.org (Fredrik Lundh)
Date: Sun, 11 Feb 2001 11:14:25 +0100
Subject: [I18n-sig] Re: Pre-PEP: Python Character Model
Message-ID: <012101c09413$a5b5d2e0$e46940d5@hagrid>

(trying to catch up from the archives; just realized
that I wasn't subscribed to i18n)

> > I'm lost here. Let's say I'm using Python 1.5. I have some KOI8-R dat=
a
> > in a string literal. PythonWin and Tk expect Unicode. How could they
> > display the characters correctly?
>
> No, PythonWin and Tk both tell apart Unicode and byte strings
> (although Tk uses quite a funny algorithm to do so). If they see a
> byte string, they convert it using the platform encoding (which is
> user-settable on both Windows and Unix) to a Unicode string, and
> display that.

Not quite true for Tk: Tcl's 8-bit to Unicode conversion expects
UTF-8.  When it sees a lead byte with not enough trailbytes, the
lead byte is copied as is.  Naked trail bytes are also copied as is.

Under Latin-1, the following three Python strings all result in the
same Tcl string value:

    str =3D "=E5=E4=F6"
    str =3D u"=E5=E4=F6".encode("utf-8")
    str =3D u"=E5=E4=F6"

But under a hypothetical platform encoding where "=E5" looks like
a UTF-8 lead byte, and "=E4" like a trail byte, this will fail (if you
think that's unlikely, feel tree to replace "=E5" and "=E4" with other
characters...).

Cheers /F


From fredrik@effbot.org  Sun Feb 11 10:34:32 2001
From: fredrik@effbot.org (Fredrik Lundh)
Date: Sun, 11 Feb 2001 11:34:32 +0100
Subject: [I18n-sig] Re: Pre-PEP: Proposed Python Character Model
Message-ID: <013401c09416$881b0f40$e46940d5@hagrid>

> > In my opinion there should be *no* encoding default. New code should
> > always specify an encoding. Old code should continue to work the same.
> 
> However, matter-of-factually, you propose that ISO-8859-1 is the
> default encoding, as this is the encoding that is used when converting
> character strings to char* in the C API. I'd certainly call it a
> default.

It's not an encoding.  It's the subset of Unicode that you can store
in an 8-bit character.

(If you have a problem with that, complain to the Unicode designers)

Cheers /F


From fredrik@effbot.org  Sun Feb 11 10:46:09 2001
From: fredrik@effbot.org (Fredrik Lundh)
Date: Sun, 11 Feb 2001 11:46:09 +0100
Subject: [I18n-sig] Re: Pre-PEP: Proposed Python Character Model
Message-ID: <013c01c09417$e11c49a0$e46940d5@hagrid>

> I really like the idea of the
> 
> b"..." prefix
> 
> Is anyone opposed?

yes.

> 1. [file]?open(filename, encoding, ...)

you mean (?:file)?open, right?

I still think we can reuse the builtin "open" primitive (and
don't forget the text vs. binary mode issue -- binary files
never have encodings).

> 2. b"..."

-0 (I'm sceptical)

> 3. an encoding declaration at the top of files

+1

> 4. that concatenating Python strings and Unicode strings should do the
> "obvious" thing for charcters from 127-255 and nothing for characters
> beyond.

+1

> 5. a bytestring type that behaves in every way shape and form like our
> current string type but has a different type() and repr().

almost: it shouldn't implement text-related method.  isupper, upper,
etc doesn't make sense here.

(but like in SRE, the *source* code should be reused)

Cheers /F


From fredrik@effbot.org  Sun Feb 11 10:51:29 2001
From: fredrik@effbot.org (Fredrik Lundh)
Date: Sun, 11 Feb 2001 11:51:29 +0100
Subject: [I18n-sig] Re: Pre-PEP: Proposed Python Character Model
Message-ID: <014d01c09419$07dfbad0$e46940d5@hagrid>

> >I would want to avoid the need for a 2.0-style 'default encoding', so I
> >Suggest it shouldnt be possible to mix this type with other strings:
> >
> >>>> "1"+b"2"
> >Traceback (most recent call last):
> >  File "<stdin>", line 1, in ?
> >TypeError: cannot add type "binary" to string
> >>>> "3"=3D=3Db"3"
> >0

a more pragmatic approach would be to assume ASCII en-
codings for binary data, and choke on non-ASCII chars.

>>> "1" + b"2"
12
>>> "1" + buffer("2")
12
>>> "1" + b"\xff"
ValueError: ASCII decoding error: ordinal not in range(128)

Cheers /F


From fredrik@effbot.org  Sun Feb 11 11:00:44 2001
From: fredrik@effbot.org (Fredrik Lundh)
Date: Sun, 11 Feb 2001 12:00:44 +0100
Subject: [I18n-sig] Re: Strawman Proposal: Binary Strings
Message-ID: <017f01c0941f$4b13a6d0$e46940d5@hagrid>

> About changing .encode() or the existing codecs to return binary
> strings instead of normal strings: I'm -1 on this one since it
> will break existing code. 

-1.  core features shouldn't return binary data in text strings.
foo.upper() shouldn't work if "foo" isn't known to contain text.
if this breaks code (not sure it does), the binary data type
needs more work.

> Instead, strings should probably carry along the encoding
> information in an additional attribute (it is not always useful,
> but can help in a few situations) provided that it is known.

-1.  evil.

Cheers /F


From fredrik@effbot.org  Sun Feb 11 11:08:08 2001
From: fredrik@effbot.org (Fredrik Lundh)
Date: Sun, 11 Feb 2001 12:08:08 +0100
Subject: [I18n-sig] Re: Strawman Proposal (2): Encoding attributes
Message-ID: <018001c0941f$4c1140b0$e46940d5@hagrid>

> > > Ah, ok. The encoding information will only be applied to literal
> > > Unicode strings (u"text"), right ?
> > 
> > No, that's very different than what I am suggesting.
> > 
> > The encoding is applied to the *text file*.
> 
> -1

and -1 on your -1.

MAL, you're stuck in a "unicode strings are something special" modus
operandi.  the goal should be to get rid of u"foo" strings, not continue
to make Python more and more dependent on this artificial distinction.

> The result would be way to much breakage.

I doubt it.

Cheers /F


From fredrik@effbot.org  Sun Feb 11 11:20:09 2001
From: fredrik@effbot.org (Fredrik Lundh)
Date: Sun, 11 Feb 2001 12:20:09 +0100
Subject: [I18n-sig] Re: Strawman Proposal (2): Encoding attributes
Message-ID: <018801c0941f$4ce42110$e46940d5@hagrid>

> "Forced" presumes it's against their will.  That's what I question.  There
> is nothing more Eurocentric than to embark on unilateral crusades for the
> purported benefit of non-Euros who aren't asking for help <0.7 wink>.

if you think that ASCII is good enough for european languages, or that
europeans like having to use an approximation of their own language just
because american programmers are lazy, I'm not sure you should be on
this list at all <wink>.

Cheers /F


From fredrik@effbot.org  Sun Feb 11 11:22:58 2001
From: fredrik@effbot.org (Fredrik Lundh)
Date: Sun, 11 Feb 2001 12:22:58 +0100
Subject: [I18n-sig] Re: Strawman Proposal (2): Encoding attributes
Message-ID: <018901c0941f$4d3e4f00$e46940d5@hagrid>

> > If it works and it is easy, there should not be a problem!
> 
> This is how I started into the Unicode debate (making UTF-8 the default
> encoding). It doesn't work out... let's not restart that discussion.

this is not the same discussion.

Cheers /F


From fredrik@effbot.org  Sun Feb 11 11:29:27 2001
From: fredrik@effbot.org (Fredrik Lundh)
Date: Sun, 11 Feb 2001 12:29:27 +0100
Subject: [I18n-sig] Re: Strawman Proposal: Smart String Test
Message-ID: <018a01c0941f$4d9b8a30$e46940d5@hagrid>

> type(foo)==type("")

any reason we cannot just make this work, whether foo
contains 8-bit or 16-bit data?

btw, the preferred syntax is:

    isinstance(foo, type(""))

I think it's okay only the latter works, for now (which can
be solved by a simple and stupid hack, while waiting for a
real type hierarchy...)

Cheers /F


From fredrik@effbot.org  Sun Feb 11 11:34:12 2001
From: fredrik@effbot.org (Fredrik Lundh)
Date: Sun, 11 Feb 2001 12:34:12 +0100
Subject: [I18n-sig] Re: Strawman Proposal: Encoding Declaration V2
Message-ID: <018b01c0941f$4dd733a0$e46940d5@hagrid>

> A source file with an encoding declaration must only use non-ASCII bytes
> in places that can legally support Unicode characters. In Python 2.x the
> only place is within a Unicode literal

make that "in a string literal".

if an encoding directive is present, the *entire* file should be
assumed to use that encoding.  this applies to comments, 8-bit
string literals, and 16-bit string literals.

Cheers /F


From fredrik@effbot.org  Sun Feb 11 11:53:50 2001
From: fredrik@effbot.org (Fredrik Lundh)
Date: Sun, 11 Feb 2001 12:53:50 +0100
Subject: [I18n-sig] Re: Python and Unicode == Britain and the Euro?
Message-ID: <019e01c09421$c402bfc0$e46940d5@hagrid>

tim wrote:

> The Python Reference Manual says (chapter 2, "Lexical analysis"):
> 
>     Python uses the 7-bit ASCII character set for program text and
>     string literals.

...and then says "8-bit characters may be used in string literals and comments
but their interpretation is platform dependent".

for a non-ASCII programmer, that pretty much means "no native character set".

Cheers /F


From mal@lemburg.com  Sun Feb 11 13:13:23 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sun, 11 Feb 2001 14:13:23 +0100
Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model)
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> <3A82FD60.EFB38FAD@ActiveState.com> <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de> <3A831110.6AADE590@ActiveState.com>
 <3A85BBC6.BBAA8D70@lemburg.com> <200102102223.RAA28498@cj20424-a.reston1.va.home.com> <3A860AA3.655F4207@ActiveState.com>
Message-ID: <3A868FF3.45EEC501@lemburg.com>

[Paul, it would help if you wouldn't always remove important parts
 of the quoted messages... people who don't read the whole thread
 won't have a chance to follow up]

Paul Prescod wrote:
> 
> Guido van Rossum wrote:
> >
> > ...
> > > open(filename, encoding, [[mode], bytes])
> > >
> > > And the documentation would say:
> > >
> > > "There is an obsolete variant that does not require an encoding string.
> > > This may cause a warning in future versions of Python and be removed
> > > sometime after that."
> >
> > I am appalled at this lack of respect for existing conventions,
> 
> You're the one who told everyone to move from string functions to string
> methods. This is a move of similar scope but for a much more important
> purpose than merely changing coding style.
> 
> > when a
> > simple and obvious alternative (see below) is easily available.  I
> > will have a hard time not to take this into account when I finally get
> > to reading up on your proposals.
> 
> There is an important reason that we did not use a keyword argument.
> 
> We (at least some subset of the people in the i18n-sig) want every
> single new instance of the "open" function to declare an encoding. 

This doesn't make sense: not all uses of open() target text 
information. What encoding information would you put into an
open() which wants to read a JPEG image from a file ?

> Right
> now we allow a lot of "ambiguous data" into the system. We do not know
> whether the user meant it to be binary or textual data and so we don't
> know the correct/valid coercions, conversions and operations. We are
> trying to retroactively make an open function that strongly encourages
> (and perhaps finally forces) people to make their intent known.
> 
> The open extension is a backwards compatible way to allow people to move
> from the "old" ambiguous form to the new form. I considered it pretty
> well thought out in terms of backwards and forwards compatibility. We
> could also just invent a new function like "file" or "fileopen" but
> upgrading "open" seemed to show the *most* respect for existing
> conventions (and clutters up builtins the least).

We cannot turn override the mode parameter with an encoding
parameter... why do you believe that this is backwards compatible
in any way ? (Note that mode is an optional parameter!)

The keyword argument approach gives us a much better way to
integrate a new argument into the open() call:

f = open(filename, encoding='mbcs', mode='w')

or

f = open(filename, 'w', encoding='mbcs')

There's a little more typing required, but the readability is
unbeatable...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Sun Feb 11 13:22:53 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sun, 11 Feb 2001 14:22:53 +0100
Subject: [I18n-sig] Random thoughts on Unicode and Python
References: <14981.45051.945099.633730@cymru.basistech.com>
 <PGECLPOBGNBNKHNAGIJHCECMCIAA.andy@reportlab.com> <14982.4009.542031.914222@cymru.basistech.com>
Message-ID: <3A86922D.AB5AB78E@lemburg.com>

Tom Emerson wrote:
> 
> Andy Robinson writes:
> > (1) user defined characters:  the big three Japanese encodings
> > use the Kuten space of 94x94 characters. There are lots of slight
> > venddor variations on the basic JIS0208 character set, as well
> > as people adding new Gaiji in their office workgroups. Generic
> > conversion routines from, say, EUC to Shift-JIS still work
> > perfectly whether you use Shift-JIS, cp932, or cp932 plus
> > ten extra in-house characters.  Conversions to Unicode involve
> > selecting new codecs, or even making new ones, for all these
> > situations.
> 
> There is no reason that we couldn't provide a set of unified codecs
> for EUC-JP, Shift JIS, ISO-2022-JP, and CP932 that provide appropriate
> mappings between the EUDC sections in the legacy character sets and
> the PUA of Unicode, such that these conversions work.

Right.
 
> > (2) slightly corrupt data: Let's say you are dealing with files
> > or database fields containing some truncated kanji.  If you
> > use 8-bit-clean strings and no conversion, the data will not
> > be corrupted or changed; if you try to magically convert
> > it to Unicode you will get error messages or possibly even
> > more corruption.  Maybe you're writing an app whose job is
> > to get text from machine A to machine B without changing it;
> > suddenly it will stop working.  I know people who spent
> > weeks debugging a VB print spooler which was cutting up
> > Postscript files containing kanji.
> 
> Yes, this is a problem that I cannot suggest a good answer to: reality
> raises its ugly head.

We won't be introducing new magic...
 
> > Suddenly upgrading to a new version of Python where all
> > your data undergoes invisible transformations to Unicode
> > and back is going to cause trouble for quite a few people.
> 
> Absolutely.

...and the move will be slow one for sure :-)

I think that a lot of small steps are required to finally get
there and I don't want to rush anything. Still, I believe that
talking about all this now is not such a bad idea, even though
it may cause some concern about the future direction of Python.

Python's history has shown that the developers have always tried 
to maintain backward compatibility whereever possibleand feasable.
This won't change, since it is one of the most important factors 
in Python's success story and there are enough people on python-dev
who care about this a lot.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From fredrik@effbot.org  Sun Feb 11 13:34:26 2001
From: fredrik@effbot.org (Fredrik Lundh)
Date: Sun, 11 Feb 2001 14:34:26 +0100
Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model)
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> <3A82FD60.EFB38FAD@ActiveState.com> <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de> <3A831110.6AADE590@ActiveState.com>  	            <3A85BBC6.BBAA8D70@lemburg.com> <200102102223.RAA28498@cj20424-a.reston1.va.home.com> <3A860AA3.655F4207@ActiveState.com> <3A868FF3.45EEC501@lemburg.com>
Message-ID: <000701c0942f$5d08b780$e46940d5@hagrid>

mal wrote:
> > We (at least some subset of the people in the i18n-sig) want every
> > single new instance of the "open" function to declare an encoding. 
> 
> This doesn't make sense: not all uses of open() target text 
> information. What encoding information would you put into an
> open() which wants to read a JPEG image from a file ?

how about:

    file = open("image.jpg", encoding="image/jpeg")
    image = file.read() # return a PIL image object

or perhaps better:

    file = open("image.jpg", encoding="image/*")
    image = file.read()  

> We cannot turn override the mode parameter with an encoding
> parameter... why do you believe that this is backwards compatible
> in any way ? (Note that mode is an optional parameter!)

instead of overriding, why not append the encoding to
the mode parameter:

    "r" # default, read text file, unknown encoding
    "rb" # read binary file, no encoding"
    "r,utf-8" # read text file, utf-8 encoding
    "rb,ascii" # illegal mode

(this is in line with C's fopen)

Cheers /F


From barry@digicool.com  Sun Feb 11 14:30:13 2001
From: barry@digicool.com (Barry A. Warsaw)
Date: Sun, 11 Feb 2001 09:30:13 -0500
Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model)
References: <3A7F9084.509510B8@ActiveState.com>
 <3A7FD69C.1708339C@lemburg.com>
 <3A800DBC.2BE8ECEF@ActiveState.com>
 <3A8013BA.2FF93E8B@lemburg.com>
 <3A801E49.F8DF70E2@ActiveState.com>
 <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de>
 <3A808702.5FF36669@ActiveState.com>
 <200102070000.f1700BV02437@mira.informatik.hu-berlin.de>
 <3A80951E.DF725F03@ActiveState.com>
 <200102070732.f177WrV00930@mira.informatik.hu-berlin.de>
 <3A81AC7C.3FFE73E5@ActiveState.com>
 <200102080037.f180bul01609@mira.informatik.hu-berlin.de>
 <3A820CD2.25C3F978@ActiveState.com>
 <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de>
 <3A82FD60.EFB38FAD@ActiveState.com>
 <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de>
 <3A831110.6AADE590@ActiveState.com>
 <3A85BBC6.BBAA8D70@lemburg.com>
 <200102102223.RAA28498@cj20424-a.reston1.va.home.com>
 <3A860AA3.655F4207@ActiveState.com>
Message-ID: <14982.41461.889514.547839@anthem.wooz.org>

>>>>> "PP" == Paul Prescod <paulp@ActiveState.com> writes:

    PP> We (at least some subset of the people in the i18n-sig) want
    PP> every single new instance of the "open" function to declare an
    PP> encoding.

I've barely followed this discussion at all, but what you say here
causes my greatest nagging concern to bubble to the surface.  I write
lots of programs for which i18n isn't a requirement, and may never be.
It seems like you saying that you want me to have to confront issues
like encodings, character sets, unicode, multiplicity of string types,
etc. in even the simplest, most xenophobic programs I write.  That
would be, IMO, a loss of epic proportions to the simplicity and "brain
fitting" nature of Python.

I have no problems, and in fact encourage, facilities in Python to
help me i18n-ify my programs when I'm ready and need to.  But not
before.  I really hope I'm misunderstanding.

-Barry


From mal@lemburg.com  Sun Feb 11 14:33:48 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sun, 11 Feb 2001 15:33:48 +0100
Subject: [I18n-sig] Python and Unicode == Britain and the Euro?
References: <PGECLPOBGNBNKHNAGIJHOECICIAA.andy@reportlab.com> <3A8593BF.8AFCEBB3@ActiveState.com> <20010211140545.49DF.BRIAN@tomigaya.shibuya.tokyo.jp>
Message-ID: <3A86A2CC.BB64149B@lemburg.com>

Brian Takashi Hooper wrote:
> 
> Hi there, Brian in Tokyo again,
> 
> On Sat, 10 Feb 2001 11:17:19 -0800
> Paul Prescod <paulp@ActiveState.com> wrote:
> 
> > Andy, I think that part of the reason that Westerners push harder for
> > Unicode than Japanese is because we are pressured (rightly) to right
> > software that works world-wide and it is simply not sane to try to do
> > that by supporting multiple character sets. Multiple encodings maybe.
> > Multiple character sets? Forget it.
> I think this is a true and valid point (that Westerners are more likely
> to want to make internationalized software), but it sounds here like
> because Westerners want to make it easier to internationalize software,
> that that is a valid reason to make it harder to make software that has
> no particular need for internationalization, in non-Western languages,
> and change the _meaning_ of such a basic data type as the Python string.
> 
> If in fact, as the proposal proposes, usage of open() without an
> encoding, for example, is at some point deprecated, then if I am
> manipulating non-Unicode data in "" strings, then I think I _do_ at some
> point have to port them over.  b"<blob of binary data>" then becomes
> different from "<blob of binary data>", because "<blob of binary data>"
> is now automatically being interpreted behind the scenes into an
> internal Unicode representation.  If the blob of binary data actually
> happened to be in Unicode, or some Unicode-favored representation (like
> UTF-8), then I might be happy about this - but if it wasn't, I think
> that this result would instead be rather dismaying.

We are certainly not goind to make the encoding parameter
mandatory for open(). What type the .read() method returns for
a file opened using an encoding is dependent on the codec in
use, e.g. a Unicode codec would return Unicod, but other codecs
may choose to return an encoded 8-bit string instead (with encoding
attribute set accordingly).

There's still much to do down that road and I wouldn't take the
current proposals too seriously yet. We are still in the idea
gathering phase...

> The current Unicode support is more explicit about this - the meaning of
> the string literal itself has not changed, so I can continue to ignore
> Unicode in cases where it serves no useful purpose.  I realize that it
> would be nicer from a design perspective, more consistent, to have
> Python string mean only character data, but right now, it does sometimes
> mean binary and sometimes mean characters. The only one who can
> distinguish which is the programmer - if at some point "" means only
> Unicode character strings, then the programmer _does_, I think, have to
> go through all their programs looking for places where they are using
> strings to hold non-Unicode character data, or binary data, and
> explicitly convert them over.  I have difficulty seeing how we would be
> able to provide a smooth upgrade path - maybe a command-line backwards
> compatibility option?  Maybe defaults?  I've heard a lot of people
> voicing dislike for default encodings, but from my perspective,
> something like ISO-Latin-1, or UTF-8, or even ASCII (EUC-JP and SJIS are,
> strictly speaking, not supersets of ASCII because the ASCII ranges are
> usually interpreted as JIS-Roman, which contains about 4 different
> characters) is functionally a default encoding...  Requiring encoding
> declarations, as the proposal suggests, is nice for people working in
> the i18n domain, but is an unnecessary inconvenience for those who are
> not.

First, I think that most string literals in programs are
in fact text data, so switching to a text data type for ""
wouldn't be such a big change. For those few cases, where
these literals are used for binary data, switching to b""
doesn't really hurt.

Of course, the programmer will have to rethink text vs. binary
data, but this is what we are aiming at after all. 

Since this step can be too much of a burden for the programmer, 
we'll have to come up with a way which allows Python to maintain the 
old style behaviour, e.g. by telling Python to use a codec which 
returns a normal 8-bit string object instead of Unicode...

#?encoding="old-style-strings"

at the top of the source code would then do the trick.
 
-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Sun Feb 11 18:26:04 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sun, 11 Feb 2001 19:26:04 +0100
Subject: [I18n-sig] Re: Strawman Proposal: Binary Strings
References: <017f01c0941f$4b13a6d0$e46940d5@hagrid>
Message-ID: <3A86D93C.66BB0233@lemburg.com>

Fredrik Lundh wrote:
> 
> > About changing .encode() or the existing codecs to return binary
> > strings instead of normal strings: I'm -1 on this one since it
> > will break existing code.
> 
> -1.  core features shouldn't return binary data in text strings.
> foo.upper() shouldn't work if "foo" isn't known to contain text.
> if this breaks code (not sure it does), the binary data type
> needs more work.
> 
> > Instead, strings should probably carry along the encoding
> > information in an additional attribute (it is not always useful,
> > but can help in a few situations) provided that it is known.
> 
> -1.  evil.

Care to explain why ? (I think that such an attribute could be put
to some good use in (re-)unifying strings and Unicode).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Sun Feb 11 18:33:19 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sun, 11 Feb 2001 19:33:19 +0100
Subject: [I18n-sig] Re: Strawman Proposal (2): Encoding attributes
References: <018001c0941f$4c1140b0$e46940d5@hagrid>
Message-ID: <3A86DAEE.74F44655@lemburg.com>

Fredrik Lundh wrote:
> 
> > > > Ah, ok. The encoding information will only be applied to literal
> > > > Unicode strings (u"text"), right ?
> > >
> > > No, that's very different than what I am suggesting.
> > >
> > > The encoding is applied to the *text file*.
> >
> > -1
> 
> and -1 on your -1.
> 
> MAL, you're stuck in a "unicode strings are something special" modus
> operandi.  the goal should be to get rid of u"foo" strings, not continue
> to make Python more and more dependent on this artificial distinction.

Unicode strings *are* special: they can only be used for text data.
I we were to decode the whole source code file using some encoding,
then use of binary data in standard ""-literals could and probably
would lead to decoding errors.

Some encodings even play with ASCII-characters (just take a look at
the codecs in encodings/), so these would break standard program
text as well.

> > The result would be way to much breakage.
> 
> I doubt it.

Anyway, the two bullets I suggested on this thread implement a 
subset of what you (Paul and Fredrik) have in mind, so I believe
it's a good compromise. 

We can always extend this to full text file decoding at some
later stage, if that should become necessary, which I doubt ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From tim.one@home.com  Sun Feb 11 21:32:07 2001
From: tim.one@home.com (Tim Peters)
Date: Sun, 11 Feb 2001 16:32:07 -0500
Subject: [I18n-sig] Python and Unicode == Britain and the Euro?
In-Reply-To: <3A865CA7.249910F1@ActiveState.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCIELOIOAA.tim.one@home.com>

[Tim quotes the Ref Man]
>     Python uses the 7-bit ASCII character set for program text and
>     string literals.
>
> That was Guido's intent, ...

[Paul Prescod]
> That may be the rule but try enforcing it. It is so widely violated
> as to be irrelevant.

Not news -- why do you suppose it isn't enforced <wink>?

> I would love it if you did try to enforce it in Python 2.1. You
> would take the heat for breaking everyone's non-ASCII programs
> and then I could come in and propose the draconian rule be eased with
> the encoding declaration.

Your life would indeed be easier then.

> The wide violation of this rule should inform our discussions about
> where Python source code is going in the future...

In theory you'd hope it would aid your case, but in practice I'm afraid it
works against you:  people with 8-bit character sets covered by C locale
gimmicks seemed happier before Unicode was added.  Also not news, of
course -- Unicode irritates everyone, because it's nobody's national
encoding scheme.


From tim.one@home.com  Sun Feb 11 21:32:08 2001
From: tim.one@home.com (Tim Peters)
Date: Sun, 11 Feb 2001 16:32:08 -0500
Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model)
In-Reply-To: <000701c0942f$5d08b780$e46940d5@hagrid>
Message-ID: <LNBBLJKPBEHFEDALKOLCKELOIOAA.tim.one@home.com>

[/F]
> ...
> instead of overriding, why not append the encoding to
> the mode parameter:

Bingo.

>     "r" # default, read text file, unknown encoding
>     "rb" # read binary file, no encoding"
>     "r,utf-8" # read text file, utf-8 encoding
>     "rb,ascii" # illegal mode

Don't know why the last should be illegal; whether I want line-end
translation done, or want Ctrl-Z to signify EOF, or etc (all the goofy
x-platform distinctions made by binary vs text modes) seems independent of
how character data is encoded.


From tim.one@home.com  Sun Feb 11 21:43:41 2001
From: tim.one@home.com (Tim Peters)
Date: Sun, 11 Feb 2001 16:43:41 -0500
Subject: [I18n-sig] Re: Python and Unicode == Britain and the Euro?
In-Reply-To: <019e01c09421$c402bfc0$e46940d5@hagrid>
Message-ID: <LNBBLJKPBEHFEDALKOLCOELOIOAA.tim.one@home.com>

>> The Python Reference Manual says (chapter 2, "Lexical analysis"):
>>
>>     Python uses the 7-bit ASCII character set for program text and
>>     string literals.

[/F]
> ...and then says "8-bit characters may be used in string literals
> ad comments but their interpretation is platform dependent".
>
> for a non-ASCII programmer, that pretty much means "no native
> character set".

Absolutely.  That's why the Ref Man also says:

    the proper way to insert 8-bit characters in string literals
    is by using octal or hexadecimal escape sequences

Note too that Python opens Python source files in C text mode, and C doesn't
guarantee that high-bit characters can be faithfully written to or read back
from text-mode files either.

What's the point?  As I said before, the *intent* was that Python source
code use 7-bit ASCII.  All we're demonstrating here is the various ways in
which the Ref Man is consistent with that intent.  Go beyond that, and if
"it works" you're seeing a platform accident, albeit a reliable accident on
the major Python platforms.


From paulp@ActiveState.com  Sun Feb 11 21:49:38 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Sun, 11 Feb 2001 13:49:38 -0800
Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model)
References: <3A7F9084.509510B8@ActiveState.com>
 <3A7FD69C.1708339C@lemburg.com>
 <3A800DBC.2BE8ECEF@ActiveState.com>
 <3A8013BA.2FF93E8B@lemburg.com>
 <3A801E49.F8DF70E2@ActiveState.com>
 <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de>
 <3A808702.5FF36669@ActiveState.com>
 <200102070000.f1700BV02437@mira.informatik.hu-berlin.de>
 <3A80951E.DF725F03@ActiveState.com>
 <200102070732.f177WrV00930@mira.informatik.hu-berlin.de>
 <3A81AC7C.3FFE73E5@ActiveState.com>
 <200102080037.f180bul01609@mira.informatik.hu-berlin.de>
 <3A820CD2.25C3F978@ActiveState.com>
 <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de>
 <3A82FD60.EFB38FAD@ActiveState.com>
 <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de>
 <3A831110.6AADE590@ActiveState.com>
 <3A85BBC6.BBAA8D70@lemburg.com>
 <200102102223.RAA28498@cj20424-a.reston1.va.home.com>
 <3A860AA3.655F4207@ActiveState.com> <14982.41461.889514.547839@anthem.wooz.org>
Message-ID: <3A8708F2.669B0A2C@ActiveState.com>

"Barry A. Warsaw" wrote:
> 
> ...
> 
> I've barely followed this discussion at all, but what you say here
> causes my greatest nagging concern to bubble to the surface.  I write
> lots of programs for which i18n isn't a requirement, and may never be.
> It seems like you saying that you want me to have to confront issues
> like encodings, character sets, unicode, multiplicity of string types,
> etc. in even the simplest, most xenophobic programs I write.  That
> would be, IMO, a loss of epic proportions to the simplicity and "brain
> fitting" nature of Python.

file = open("/etc/passwd", "r", "ASCII")

Surely that is not such a terrible burden in the interests of making the
world a little bit less xenophobic! Once you do that, everything else
"just works" and when your program encounters data it can't handle in a
text file it will crash in a predictable way at a logical point (the
read function) instead of in an unpredictable way at an illogical point
(some random string coercion or API call).

 Paul Prescod


From paulp@ActiveState.com  Sun Feb 11 21:57:09 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Sun, 11 Feb 2001 13:57:09 -0800
Subject: [I18n-sig] Re: Pre-PEP: Proposed Python Character Model
References: <013c01c09417$e11c49a0$e46940d5@hagrid>
Message-ID: <3A870AB5.AE1BE6A2@ActiveState.com>

Fredrik Lundh wrote:
> 
> > I really like the idea of the
> >
> > b"..." prefix
> >
> > Is anyone opposed?
> 
> yes.

Could you please describe your problem? We almost had total agreement on
this feature. It was a near miracle! 

As you probably know, the idea behind it is to allow people to continue
to put binary data (especially native encoding data) in some form of
string literal and to manipulate that data as binary "automatically."

> almost: it shouldn't implement text-related method.  isupper, upper,
> etc doesn't make sense here.

Agree.

> (but like in SRE, the *source* code should be reused)

Agree.

 Paul Prescod


From paulp@ActiveState.com  Sun Feb 11 22:05:00 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Sun, 11 Feb 2001 14:05:00 -0800
Subject: [I18n-sig] Re: Python and Unicode == Britain and the Euro?
References: <LNBBLJKPBEHFEDALKOLCOELOIOAA.tim.one@home.com>
Message-ID: <3A870C8C.78066BC@ActiveState.com>

Tim Peters wrote:
> 
> ...
> 
> What's the point?  As I said before, the *intent* was that Python source
> code use 7-bit ASCII.  All we're demonstrating here is the various ways in
> which the Ref Man is consistent with that intent.  Go beyond that, and if
> "it works" you're seeing a platform accident, albeit a reliable accident on
> the major Python platforms.

I still don't understand the point...

It's like saying that Vancouver doesn't need drug rehab clinics because
drugs are illegal here. <wink> 

We can't move to all-ASCII text files at this point even if we have
legal/historical justifications for doing so. The best we can do is try
to limit the damage of having the non-ASCII stuff floating around
without labels.

 Paul Prescod


From tim.one@home.com  Sun Feb 11 22:04:32 2001
From: tim.one@home.com (Tim Peters)
Date: Sun, 11 Feb 2001 17:04:32 -0500
Subject: [I18n-sig] Re: Strawman Proposal (2): Encoding attributes
In-Reply-To: <018801c0941f$4ce42110$e46940d5@hagrid>
Message-ID: <LNBBLJKPBEHFEDALKOLCOELPIOAA.tim.one@home.com>

[Tim]
> "Forced" presumes it's against their will.  That's what I
> question.  There is nothing more Eurocentric than to embark
> on unilateral crusades for the purported benefit of non-
> Euros who aren't asking for help <0.7 wink>.

[/F]
> if you think that ASCII is good enough for european languages,

No, but programming language identifiers are an artificial language.  Python
isn't it itself a word processor, and you may as well complain that Python
requires "." in numeric literals (rather than ",", or an American Indian
glyph meaning "sacred fork between the mighty Integer and Fractional rivers"
<0.9 wink>).

> or that europeans like having to use an approximation of their
> own language

Ditto.

> just because american programmers are lazy,

It's really that Euros are too lazy to learn English <wink>.

> I'm not sure you should be on this list at all <wink>.

Unclear whether you're arguing to allow full Unicode in Python identifiers
(which is all I'm talking about).  You really want getattr() to sort out
Unicode in full generality (thinking specifically of "ignorable"
characters -- if you don't ignore them, you're screwing somebody else's
native tongue) at runtime?  I don't want to see Python get anywhere that
mess.  If you're implementing a word processor *in* Python, fine, you can
deal with it and Python should support you.  It doesn't need to complicate
its own artificial language to do so.


From paulp@ActiveState.com  Sun Feb 11 22:21:18 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Sun, 11 Feb 2001 14:21:18 -0800
Subject: [I18n-sig] Re: Pre-PEP: Proposed Python Character Model
References: <014d01c09419$07dfbad0$e46940d5@hagrid>
Message-ID: <3A87105E.BE3F41D5@ActiveState.com>

Fredrik Lundh wrote:
> 
> ...
> 
> a more pragmatic approach would be to assume ASCII en-
> codings for binary data, and choke on non-ASCII chars.
> 
> >>> "1" + b"2"
> 12
> >>> "1" + buffer("2")
> 12
> >>> "1" + b"\xff"
> ValueError: ASCII decoding error: ordinal not in range(128)

I think that that is the most consistent approach. We should define a
"string type" as one that has compatible with the regular expression
engine, has some defined set of string-like methods and allows
conversion of ordinals less than 128 according to ASCII rules.

 Paul Prescod


From paulp@ActiveState.com  Sun Feb 11 22:24:21 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Sun, 11 Feb 2001 14:24:21 -0800
Subject: [I18n-sig] Re: Strawman Proposal: Smart String Test
References: <018a01c0941f$4d9b8a30$e46940d5@hagrid>
Message-ID: <3A871115.FF91B436@ActiveState.com>

Fredrik Lundh wrote:
> 
> ...
>
>     isinstance(foo, type(""))
> 
> I think it's okay only the latter works, for now (which can
> be solved by a simple and stupid hack, while waiting for a
> real type hierarchy...)

I have two concerns. First I'm not thrilled with having isinstance have
specific knowledge of string types. People will ask us: "How do I set up
a type hierarchy like the string hierarchy?" And they can't...an
isstring() function is clear about the fact that it is special.

My second concern is that this might break a little bit of code. For
instance something like this:

if issinstance(foo, type("")): print foo
elif issinstance(foo, type(u"")): print foo.encode("UTF-8")

 Paul Prescod


From paulp@ActiveState.com  Sun Feb 11 22:31:12 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Sun, 11 Feb 2001 14:31:12 -0800
Subject: [I18n-sig] Re: Strawman Proposal: Encoding Declaration V2
References: <018b01c0941f$4dd733a0$e46940d5@hagrid>
Message-ID: <3A8712B0.E45A503D@ActiveState.com>

Fredrik Lundh wrote:
> 
> > A source file with an encoding declaration must only use non-ASCII bytes
> > in places that can legally support Unicode characters. In Python 2.x the
> > only place is within a Unicode literal
> 
> make that "in a string literal".

Yes, I think you're right. If a person needs to get at a Latin 1
character in a string literal they should be able to do so using 

> if an encoding directive is present, the *entire* file should be
> assumed to use that encoding.  this applies to comments, 8-bit
> string literals, and 16-bit string literals.

I've backed off somewhat on having the file be pre-decoded in the short
term. My major conceptual problem is if we decode to Unicode-escaped
ASCII or something then we mess up the column numbers and the syntax
errors will not be right. We might really need to have a Unicode-aware
parser before we can do this...

 Paul Prescod


From paulp@ActiveState.com  Sun Feb 11 22:35:42 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Sun, 11 Feb 2001 14:35:42 -0800
Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model)
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> <3A82FD60.EFB38FAD@ActiveState.com> <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de> <3A831110.6AADE590@ActiveState.com>
 <3A85BBC6.BBAA8D70@lemburg.com> <200102102223.RAA28498@cj20424-a.reston1.va.home.com> <3A860AA3.655F4207@ActiveState.com> <3A868FF3.45EEC501@lemburg.com>
Message-ID: <3A8713BE.9AC80542@ActiveState.com>

"M.-A. Lemburg" wrote:
> 
> [Paul, it would help if you wouldn't always remove important parts
>  of the quoted messages... people who don't read the whole thread
>  won't have a chance to follow up]

I think we have different interpretations of important...

> Paul Prescod wrote:
> >
> > ...
> > There is an important reason that we did not use a keyword argument.
> >
> > We (at least some subset of the people in the i18n-sig) want every
> > single new instance of the "open" function to declare an encoding.
> 
> This doesn't make sense: not all uses of open() target text
> information. What encoding information would you put into an
> open() which wants to read a JPEG image from a file ?

"binary" or "raw"

> f = open(filename, 'w', encoding='mbcs')
> 
> There's a little more typing required, but the readability is
> unbeatable...

Why not go all the way:

 f = open(filename=filename, mode='w', encoding='mbcs')

Keyword attributes are great for optional parameters. I don't see
encoding as optional. Anyhow, I like Fredrick's idea of extending the
mode string.

 Paul Prescod


From tim.one@home.com  Sun Feb 11 22:42:22 2001
From: tim.one@home.com (Tim Peters)
Date: Sun, 11 Feb 2001 17:42:22 -0500
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
In-Reply-To: <3A865F6A.6CC12CC3@ActiveState.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEMBIOAA.tim.one@home.com>

[Tim, continues to question whether Unicode identifiers
 are market-driven or head-driven]

[Paul Prescod]
> We will see. Before Unicode it would have been very hard to do this and
> yet achieve source code portability between systems. Unicode and the
> tools and languages that use it are just being deployed.

Java has supported Unicode identifiers since its start, and is far more
widely used than Python.  If you can't find supporting evidence of actual
user demand there (I failed to) ...

> ...
> But I'll say again that I think it would be a big mistake to add
> any further impedements to getting there.

Who has proposed adding an impediment?  If someone did, I missed it.

>> it's-a-programming-language-not-a-word-processor-ly y'rs  - tim

> I don't understand your fundamental point.

Simplicity.

I like the ECMAScript (nee JavaScript) rule:  identifiers are Unicode.  But
only a subset of the first 128 Unicode characters are allowed <0.9 wink>.

> ...
> I'm not claiming it's a burning need, but I don't see why a Japanese
> teenager learning to program for the first time would choose to use a
> language that requires English variable names over one that offered
> choice.

Try asking one?  For example, ask Yukihiro Matsumoto why Ruby's set of
allowed identifiers is the same as Python's.  If a Japanese language
designer sees no need to support Japanese identifiers, I'm not going to
presume I know Japanese programmer needs better than him -- or that you do
either.

> ...
> Where's Greg Wilson when I need him?

Doubt he's on this SIG; mailto:gvwilson@nevex.com.


From mal@lemburg.com  Sun Feb 11 22:47:46 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sun, 11 Feb 2001 23:47:46 +0100
Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model)
References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> <3A82FD60.EFB38FAD@ActiveState.com> <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de> <3A831110.6AADE590@ActiveState.com>
 <3A85BBC6.BBAA8D70@lemburg.com> <200102102223.RAA28498@cj20424-a.reston1.va.home.com> <3A860AA3.655F4207@ActiveState.com> <3A868FF3.45EEC501@lemburg.com> <3A8713BE.9AC80542@ActiveState.com>
Message-ID: <3A871692.3370A468@lemburg.com>

Paul Prescod wrote:
> 
> "M.-A. Lemburg" wrote:
> >
> > Paul Prescod wrote:
> > >
> > > ...
> > > There is an important reason that we did not use a keyword argument.
> > >
> > > We (at least some subset of the people in the i18n-sig) want every
> > > single new instance of the "open" function to declare an encoding.
> >
> > This doesn't make sense: not all uses of open() target text
> > information. What encoding information would you put into an
> > open() which wants to read a JPEG image from a file ?
> 
> "binary" or "raw"

I'm -1 on enforcing this. Encoding is optional and has to be, since

1. existing programs don't provide the parameter and would break
2. the user can't know in advance if the file to be opened is
   of a certain encoding or type (e.g. image or sound file)
3. not all files contain encoded data for which Python provides
   a codec for decoding
4. it may not be in the programers intent to have the file
   decoded even though it uses a certain encoding
 
> > f = open(filename, 'w', encoding='mbcs')
> >
> > There's a little more typing required, but the readability is
> > unbeatable...
> 
> Why not go all the way:
> 
>  f = open(filename=filename, mode='w', encoding='mbcs')

Now you're being silly...
 
> Keyword attributes are great for optional parameters. I don't see
> encoding as optional. Anyhow, I like Fredrick's idea of extending the
> mode string.

I'm not sure I like it -- it looks like a hack to me and
I don't really see what's so bad about an optional keyword
argument for open(). At least noone has yet convinced me
of any problems with it.

Note that Fredrik's idea doesn't make the encoding parameter
a requirement either (and this is Goodness).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From paulp@ActiveState.com  Sun Feb 11 22:56:22 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Sun, 11 Feb 2001 14:56:22 -0800
Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes
References: <LNBBLJKPBEHFEDALKOLCEEMBIOAA.tim.one@home.com>
Message-ID: <3A871896.2246587B@ActiveState.com>

Tim Peters wrote:
> 
> ...
> 
> Java has supported Unicode identifiers since its start, and is far more
> widely used than Python.  If you can't find supporting evidence of actual
> user demand there (I failed to) ...

Java is a programming language for professional programmers. They think
it is natural to compare two strings with the "isequal" method. Anyone
in that mindset would find romanji natural too!

> > ...
> > But I'll say again that I think it would be a big mistake to add
> > any further impedements to getting there.
> 
> Who has proposed adding an impediment?  If someone did, I missed it.

There was a suggestion of having the encoding declaration only apply to
unicode strings. Special characters in comments and literal strings
would be interpreted as Latin 1. Now several years from now we'd have to
invent another encoding declaration for non-string stuff.

> Try asking one?  For example, ask Yukihiro Matsumoto why Ruby's set of
> allowed identifiers is the same as Python's.  If a Japanese language
> designer sees no need to support Japanese identifiers, I'm not going to
> presume I know Japanese programmer needs better than him -- or that you do
> either.

I don't presume to know what they want but I do know that people's needs
change and anticipating that is part of systems design in general and
language design in particular.

> > ...
> > Where's Greg Wilson when I need him?
> 
> Doubt he's on this SIG; mailto:gvwilson@nevex.com.

Twas more of a joke...

 Paul Prescod


From brian@tomigaya.shibuya.tokyo.jp  Mon Feb 12 01:27:23 2001
From: brian@tomigaya.shibuya.tokyo.jp (Brian Takashi Hooper)
Date: Mon, 12 Feb 2001 10:27:23 +0900
Subject: [I18n-sig] Python and Unicode == Britain and the Euro?
In-Reply-To: <3A86A2CC.BB64149B@lemburg.com>
References: <20010211140545.49DF.BRIAN@tomigaya.shibuya.tokyo.jp> <3A86A2CC.BB64149B@lemburg.com>
Message-ID: <20010212100732.81C6.BRIAN@tomigaya.shibuya.tokyo.jp>

Thanks for the clarifications, Marc-Andre.

I have no problem with following new conventions, when they are decided
upon, for new programs - I just don't want old programs to break _too_ much;
measures like the encoding directives, if they are implemented properly,
are needed I feel to ease a transition to a new paradigm.

Also, I hadn't yet read Paul's "Guidelines for Language Evolution" PEP,
which I just now _did_ read, if the change is gradual and provides
warning messages for deprecated constructs, then that makes this
proposal seem less scary (does this mean that it might also be time to
start thinking about the workings of a "deprecation and warning facility"
as described in that document, also?)

--Brian

On Sun, 11 Feb 2001 15:33:48 +0100
"M.-A. Lemburg" <mal@lemburg.com> wrote:

> Brian Takashi Hooper wrote:
> > 
> > Hi there, Brian in Tokyo again,
> > 
> > On Sat, 10 Feb 2001 11:17:19 -0800
> > Paul Prescod <paulp@ActiveState.com> wrote:
> > 
> > > Andy, I think that part of the reason that Westerners push harder for
> > > Unicode than Japanese is because we are pressured (rightly) to right
> > > software that works world-wide and it is simply not sane to try to do
> > > that by supporting multiple character sets. Multiple encodings maybe.
> > > Multiple character sets? Forget it.
> > I think this is a true and valid point (that Westerners are more likely
> > to want to make internationalized software), but it sounds here like
> > because Westerners want to make it easier to internationalize software,
> > that that is a valid reason to make it harder to make software that has
> > no particular need for internationalization, in non-Western languages,
> > and change the _meaning_ of such a basic data type as the Python string.
> > 
> > If in fact, as the proposal proposes, usage of open() without an
> > encoding, for example, is at some point deprecated, then if I am
> > manipulating non-Unicode data in "" strings, then I think I _do_ at some
> > point have to port them over.  b"<blob of binary data>" then becomes
> > different from "<blob of binary data>", because "<blob of binary data>"
> > is now automatically being interpreted behind the scenes into an
> > internal Unicode representation.  If the blob of binary data actually
> > happened to be in Unicode, or some Unicode-favored representation (like
> > UTF-8), then I might be happy about this - but if it wasn't, I think
> > that this result would instead be rather dismaying.
> 
> We are certainly not goind to make the encoding parameter
> mandatory for open(). What type the .read() method returns for
> a file opened using an encoding is dependent on the codec in
> use, e.g. a Unicode codec would return Unicod, but other codecs
> may choose to return an encoded 8-bit string instead (with encoding
> attribute set accordingly).
> 
> There's still much to do down that road and I wouldn't take the
> current proposals too seriously yet. We are still in the idea
> gathering phase...
> 
> > The current Unicode support is more explicit about this - the meaning of
> > the string literal itself has not changed, so I can continue to ignore
> > Unicode in cases where it serves no useful purpose.  I realize that it
> > would be nicer from a design perspective, more consistent, to have
> > Python string mean only character data, but right now, it does sometimes
> > mean binary and sometimes mean characters. The only one who can
> > distinguish which is the programmer - if at some point "" means only
> > Unicode character strings, then the programmer _does_, I think, have to
> > go through all their programs looking for places where they are using
> > strings to hold non-Unicode character data, or binary data, and
> > explicitly convert them over.  I have difficulty seeing how we would be
> > able to provide a smooth upgrade path - maybe a command-line backwards
> > compatibility option?  Maybe defaults?  I've heard a lot of people
> > voicing dislike for default encodings, but from my perspective,
> > something like ISO-Latin-1, or UTF-8, or even ASCII (EUC-JP and SJIS are,
> > strictly speaking, not supersets of ASCII because the ASCII ranges are
> > usually interpreted as JIS-Roman, which contains about 4 different
> > characters) is functionally a default encoding...  Requiring encoding
> > declarations, as the proposal suggests, is nice for people working in
> > the i18n domain, but is an unnecessary inconvenience for those who are
> > not.
> 
> First, I think that most string literals in programs are
> in fact text data, so switching to a text data type for ""
> wouldn't be such a big change. For those few cases, where
> these literals are used for binary data, switching to b""
> doesn't really hurt.
> 
> Of course, the programmer will have to rethink text vs. binary
> data, but this is what we are aiming at after all. 
> 
> Since this step can be too much of a burden for the programmer, 
> we'll have to come up with a way which allows Python to maintain the 
> old style behaviour, e.g. by telling Python to use a codec which 
> returns a normal 8-bit string object instead of Unicode...
> 
> #?encoding="old-style-strings"
> 
> at the top of the source code would then do the trick.
>  
> -- 
> Marc-Andre Lemburg
> ______________________________________________________________________
> Company:                                        http://www.egenix.com/
> Consulting:                                    http://www.lemburg.com/
> Python Pages:                           http://www.lemburg.com/python/

-- 
Brian Takashi Hooper <brian@tomigaya.shibuya.tokyo.jp>


From tdickenson@geminidataloggers.com  Mon Feb 12 08:01:58 2001
From: tdickenson@geminidataloggers.com (Toby Dickenson)
Date: Mon, 12 Feb 2001 08:01:58 +0000
Subject: [I18n-sig] Strawman Proposal: Smart String Test
In-Reply-To: <Pine.LNX.4.30.0102090836160.12507-100000@latte.ActiveState.com>
References: <4o588t4683cpu32srcmp928b1m5dr003i3@4ax.com> <Pine.LNX.4.30.0102090836160.12507-100000@latte.ActiveState.com>
Message-ID: <k25f8tkcehq87nhi1t9hs8vgjcqlok585l@4ax.com>

On Fri, 9 Feb 2001 08:40:28 -0800 (PST), Paul Prescod
<paulp@ActiveState.com> wrote:

>On Fri, 9 Feb 2001, Toby Dickenson wrote:
>> Paul Prescod wrote:
>> >Is there a practical problem with this solution?
>>
>> def isstring(obj):
>>   return type(obj) in (StringType, UnicodeType) or isinstance(obj,
>> UserString)
>
>Are you saying that there is a problem with isstring? Or proposing a
>slightly different formulation?

At the moment we dont have a tight definition of the 'string
interface'. While I think we can agree that old code which uses
type(x)=3D=3DStringType is probably wrong, Im not sure we can agree what
that code should be using without examining that code.

Note that several similar interface-testing functions are very rarely
used (operator.isNumberType, operator.isMappingType), and Python
doesnt have functions for other more popular interfaces (no
isFileType, for example).=20


Toby Dickenson
tdickenson@geminidataloggers.com


From tim.one@home.com  Mon Feb 12 08:10:01 2001
From: tim.one@home.com (Tim Peters)
Date: Mon, 12 Feb 2001 03:10:01 -0500
Subject: [I18n-sig] RE: [Python-Dev] Pre-PEP: Python Character Model
In-Reply-To: <01a401c090fd$5165b700$0900a8c0@SPIFF>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEMOIOAA.tim.one@home.com>

[Neil Hodgson]
>    Matz: "We don't believe there can be any single characer-
> encoding that encompasses all the world's languages.  We want
> to handle multiple encodings at the same time (if you want to).

[/F]
> neither does the unicode designers, of course: the point
> is that unicode only deals with glyphs, not languages.
>
> most existing japanese encodings also include language info,
> and if you don't understand the difference, it's easy to think
> that unicode sucks...

It would be helpful to read Matz's quote in context:

    http://www.deja.com/getdoc.xp?AN=705520466&fmt=text

The "encompasses all the world's languages" business was taken verbatim from
the question to which he was replying.  His concerns for Unicoded Japanese
are about time efficiency for conversions from ubiquitous national
encodings; relative (lack of) space efficiency for UTF-8 storage of Unicoded
Japanese (unclear why he's hung up on UTF-8, though -- but it's an ongoing
theme in c.l.ruby); and that Unicode (including surrogates) is too small and
too late for parts of his market:

    I was thinking of applications that process big character
    set (e.g. Mojikyo set) which is not covered by Unicode.  I
    don't know exactly how many code points it has.  But I've
    heard it's pretty big, possibly consumes half of surrogate
    space.  And they want to process them now.  I think they
    don't want to wait Unicode consortium to assign code points
    for their characters.

The first hit I found on Mojikyo was for a freely downloadable "Mojikyo Font
Set", containing about 50,000 Chinese glyphs beyond those covered by
Unicode, + about 20,000 more from other Asian languages.  Python better move
fast lest it lose the Oracle Bone market to Ruby <wink>.

a-2-byte-encoding-space-was-too-small-the-day-unicode-was-conceived-
    and-20-bits-won't-last-either-ly y'rs  - tim


From tdickenson@geminidataloggers.com  Mon Feb 12 08:11:55 2001
From: tdickenson@geminidataloggers.com (Toby Dickenson)
Date: Mon, 12 Feb 2001 08:11:55 +0000
Subject: [I18n-sig] Strawman Proposal: Binary Strings
In-Reply-To: <3A85568A.5B694917@lemburg.com>
References: <Pine.LNX.4.30.0102080927020.11546-100000@latte.ActiveState.com> <aan58to5cf2hei5feje8kakorlrahgo7ff@4ax.com> <3A830091.3D855EDD@ActiveState.com> <h7d78t4jr62eu7sg2m1mevigetko6i0sln@4ax.com> <3A85568A.5B694917@lemburg.com>
Message-ID: <d86f8to2966eovvvsnetmo7at7a8526qan@4ax.com>

On Sat, 10 Feb 2001 15:56:10 +0100, "M.-A. Lemburg" <mal@lemburg.com>
wrote:

>Note that changing e.g. .encode('latin-1') to return a binary string
>doesn't really make sense, since here we know the encoding ! Instead,
>strings should probably carry along the encoding information in an
>additional attribute (it is not always useful, but can help in
>a few situations) provided that it is known.

To what use would that encoding attribute be put? surely not to
provide automatic encoding when these tagged strings interact with
unicode strings (Thats back towards the solution that I think we
already ruled out)

If .encode('latin1') or .encode('utf8') are going to return anything
tagged with an encoding, then surely it should be a tagged binary
string?


Toby Dickenson
tdickenson@geminidataloggers.com


From andy@reportlab.com  Mon Feb 12 08:19:09 2001
From: andy@reportlab.com (Andy Robinson)
Date: Mon, 12 Feb 2001 08:19:09 -0000
Subject: [I18n-sig] Random thoughts on Unicode and Python
In-Reply-To: <3A86922D.AB5AB78E@lemburg.com>
Message-ID: <PGECLPOBGNBNKHNAGIJHKEDLCIAA.andy@reportlab.com>


> -----Original Message-----
> From: M.-A. Lemburg [mailto:mal@lemburg.com]
> Sent: 11 February 2001 13:23
> To: tree@basistech.com
> Cc: Andy Robinson; i18n-sig@python.org
> Subject: Re: [I18n-sig] Random thoughts on Unicode and Python
> 
> 
> Tom Emerson wrote:
> > 
> > Andy Robinson writes:
> > > (1) user defined characters:  the big three Japanese encodings
> > > use the Kuten space of 94x94 characters. There are lots 
> of slight
> > > venddor variations on the basic JIS0208 character set, as well
> > > as people adding new Gaiji in their office workgroups. Generic
> > > conversion routines from, say, EUC to Shift-JIS still work
> > > perfectly whether you use Shift-JIS, cp932, or cp932 plus
> > > ten extra in-house characters.  Conversions to Unicode involve
> > > selecting new codecs, or even making new ones, for all these
> > > situations.
> > 
> > There is no reason that we couldn't provide a set of 
> unified codecs
> > for EUC-JP, Shift JIS, ISO-2022-JP, and CP932 that 
> provide appropriate
> > mappings between the EUDC sections in the legacy 
> character sets and
> > the PUA of Unicode, such that these conversions work.
> 
> Right.

Exactly. Both the problems I mentioned can and should be solved 
properly with Unicode.  I'm just noting that a while bunch
of people have solved them without Unicode in the past
and that's where to look for code that will break.

- Andy

p.s. and yes, I'm working on those extended codecs now.


From mal@lemburg.com  Mon Feb 12 10:24:32 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 12 Feb 2001 11:24:32 +0100
Subject: [I18n-sig] Python and Unicode == Britain and the Euro?
References: <20010211140545.49DF.BRIAN@tomigaya.shibuya.tokyo.jp> <3A86A2CC.BB64149B@lemburg.com> <20010212100732.81C6.BRIAN@tomigaya.shibuya.tokyo.jp>
Message-ID: <3A87B9E0.D9A2A598@lemburg.com>

Brian Takashi Hooper wrote:
> 
> Thanks for the clarifications, Marc-Andre.
> 
> I have no problem with following new conventions, when they are decided
> upon, for new programs - I just don't want old programs to break _too_ much;
> measures like the encoding directives, if they are implemented properly,
> are needed I feel to ease a transition to a new paradigm.

Good to have you back on board :-)
 
> Also, I hadn't yet read Paul's "Guidelines for Language Evolution" PEP,
> which I just now _did_ read, if the change is gradual and provides
> warning messages for deprecated constructs, then that makes this
> proposal seem less scary (does this mean that it might also be time to
> start thinking about the workings of a "deprecation and warning facility"
> as described in that document, also?)

Right. The warning facility is already in place in 2.1: Guido added
a complete warning framework which is currently used to warn about
deprecated module usage like e.g. regex, regsub, etc.
 
> --Brian
> 
> On Sun, 11 Feb 2001 15:33:48 +0100
> "M.-A. Lemburg" <mal@lemburg.com> wrote:
> 
> > Brian Takashi Hooper wrote:
> > >
> > > Hi there, Brian in Tokyo again,
> > >
> > > On Sat, 10 Feb 2001 11:17:19 -0800
> > > Paul Prescod <paulp@ActiveState.com> wrote:
> > >
> > > > Andy, I think that part of the reason that Westerners push harder for
> > > > Unicode than Japanese is because we are pressured (rightly) to right
> > > > software that works world-wide and it is simply not sane to try to do
> > > > that by supporting multiple character sets. Multiple encodings maybe.
> > > > Multiple character sets? Forget it.
> > > I think this is a true and valid point (that Westerners are more likely
> > > to want to make internationalized software), but it sounds here like
> > > because Westerners want to make it easier to internationalize software,
> > > that that is a valid reason to make it harder to make software that has
> > > no particular need for internationalization, in non-Western languages,
> > > and change the _meaning_ of such a basic data type as the Python string.
> > >
> > > If in fact, as the proposal proposes, usage of open() without an
> > > encoding, for example, is at some point deprecated, then if I am
> > > manipulating non-Unicode data in "" strings, then I think I _do_ at some
> > > point have to port them over.  b"<blob of binary data>" then becomes
> > > different from "<blob of binary data>", because "<blob of binary data>"
> > > is now automatically being interpreted behind the scenes into an
> > > internal Unicode representation.  If the blob of binary data actually
> > > happened to be in Unicode, or some Unicode-favored representation (like
> > > UTF-8), then I might be happy about this - but if it wasn't, I think
> > > that this result would instead be rather dismaying.
> >
> > We are certainly not goind to make the encoding parameter
> > mandatory for open(). What type the .read() method returns for
> > a file opened using an encoding is dependent on the codec in
> > use, e.g. a Unicode codec would return Unicod, but other codecs
> > may choose to return an encoded 8-bit string instead (with encoding
> > attribute set accordingly).
> >
> > There's still much to do down that road and I wouldn't take the
> > current proposals too seriously yet. We are still in the idea
> > gathering phase...
> >
> > > The current Unicode support is more explicit about this - the meaning of
> > > the string literal itself has not changed, so I can continue to ignore
> > > Unicode in cases where it serves no useful purpose.  I realize that it
> > > would be nicer from a design perspective, more consistent, to have
> > > Python string mean only character data, but right now, it does sometimes
> > > mean binary and sometimes mean characters. The only one who can
> > > distinguish which is the programmer - if at some point "" means only
> > > Unicode character strings, then the programmer _does_, I think, have to
> > > go through all their programs looking for places where they are using
> > > strings to hold non-Unicode character data, or binary data, and
> > > explicitly convert them over.  I have difficulty seeing how we would be
> > > able to provide a smooth upgrade path - maybe a command-line backwards
> > > compatibility option?  Maybe defaults?  I've heard a lot of people
> > > voicing dislike for default encodings, but from my perspective,
> > > something like ISO-Latin-1, or UTF-8, or even ASCII (EUC-JP and SJIS are,
> > > strictly speaking, not supersets of ASCII because the ASCII ranges are
> > > usually interpreted as JIS-Roman, which contains about 4 different
> > > characters) is functionally a default encoding...  Requiring encoding
> > > declarations, as the proposal suggests, is nice for people working in
> > > the i18n domain, but is an unnecessary inconvenience for those who are
> > > not.
> >
> > First, I think that most string literals in programs are
> > in fact text data, so switching to a text data type for ""
> > wouldn't be such a big change. For those few cases, where
> > these literals are used for binary data, switching to b""
> > doesn't really hurt.
> >
> > Of course, the programmer will have to rethink text vs. binary
> > data, but this is what we are aiming at after all.
> >
> > Since this step can be too much of a burden for the programmer,
> > we'll have to come up with a way which allows Python to maintain the
> > old style behaviour, e.g. by telling Python to use a codec which
> > returns a normal 8-bit string object instead of Unicode...
> >
> > #?encoding="old-style-strings"
> >
> > at the top of the source code would then do the trick.
> >
> > --
> > Marc-Andre Lemburg
> > ______________________________________________________________________
> > Company:                                        http://www.egenix.com/
> > Consulting:                                    http://www.lemburg.com/
> > Python Pages:                           http://www.lemburg.com/python/
> 
> --
> Brian Takashi Hooper <brian@tomigaya.shibuya.tokyo.jp>

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Mon Feb 12 10:39:13 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 12 Feb 2001 11:39:13 +0100
Subject: [I18n-sig] RE: [Python-Dev] Pre-PEP: Python Character Model
References: <LNBBLJKPBEHFEDALKOLCAEMOIOAA.tim.one@home.com>
Message-ID: <3A87BD51.3088DACA@lemburg.com>

Tim Peters wrote:
> 
> [Neil Hodgson]
> >    Matz: "We don't believe there can be any single characer-
> > encoding that encompasses all the world's languages.  We want
> > to handle multiple encodings at the same time (if you want to).
> 
> [/F]
> > neither does the unicode designers, of course: the point
> > is that unicode only deals with glyphs, not languages.
> >
> > most existing japanese encodings also include language info,
> > and if you don't understand the difference, it's easy to think
> > that unicode sucks...
> 
> It would be helpful to read Matz's quote in context:
> 
>     http://www.deja.com/getdoc.xp?AN=705520466&fmt=text
> 
> The "encompasses all the world's languages" business was taken verbatim from
> the question to which he was replying.  His concerns for Unicoded Japanese
> are about time efficiency for conversions from ubiquitous national
> encodings; relative (lack of) space efficiency for UTF-8 storage of Unicoded
> Japanese (unclear why he's hung up on UTF-8, though -- but it's an ongoing
> theme in c.l.ruby); and that Unicode (including surrogates) is too small and
> too late for parts of his market:
> 
>     I was thinking of applications that process big character
>     set (e.g. Mojikyo set) which is not covered by Unicode.  I
>     don't know exactly how many code points it has.  But I've
>     heard it's pretty big, possibly consumes half of surrogate
>     space.  And they want to process them now.  I think they
>     don't want to wait Unicode consortium to assign code points
>     for their characters.
> 
> The first hit I found on Mojikyo was for a freely downloadable "Mojikyo Font
> Set", containing about 50,000 Chinese glyphs beyond those covered by
> Unicode, + about 20,000 more from other Asian languages.  Python better move
> fast lest it lose the Oracle Bone market to Ruby <wink>.
> 
> a-2-byte-encoding-space-was-too-small-the-day-unicode-was-conceived-
>     and-20-bits-won't-last-either-ly y'rs  - tim

Has anyone ever considered the problems this causes for type
designers ? Who is going to do the job of designing 2^20 character 
glyphs to all match the same font design guidelines ? Perhaps
I'm missing something here, but this sounds like Just is going 
to have a bright future ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Mon Feb 12 10:53:55 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 12 Feb 2001 11:53:55 +0100
Subject: [I18n-sig] string encoding attribute (Strawman Proposal: Binary
 Strings)
References: <Pine.LNX.4.30.0102080927020.11546-100000@latte.ActiveState.com> <aan58to5cf2hei5feje8kakorlrahgo7ff@4ax.com> <3A830091.3D855EDD@ActiveState.com> <h7d78t4jr62eu7sg2m1mevigetko6i0sln@4ax.com> <3A85568A.5B694917@lemburg.com> <d86f8to2966eovvvsnetmo7at7a8526qan@4ax.com>
Message-ID: <3A87C0C3.D27F6FF8@lemburg.com>

Toby Dickenson wrote:
> 
> On Sat, 10 Feb 2001 15:56:10 +0100, "M.-A. Lemburg" <mal@lemburg.com>
> wrote:
> 
> >Note that changing e.g. .encode('latin-1') to return a binary string
> >doesn't really make sense, since here we know the encoding ! Instead,
> >strings should probably carry along the encoding information in an
> >additional attribute (it is not always useful, but can help in
> >a few situations) provided that it is known.
> 
> To what use would that encoding attribute be put?

The lack of encoding information is the cause of all the problems
related to coercing 8-bit strings to Unicode. If we had this
information on a per-string basis, then we could do a *much*
better job.

> surely not to
> provide automatic encoding when these tagged strings interact with
> unicode strings (Thats back towards the solution that I think we
> already ruled out)

Depends on who "we" is ;-) I believe that we should reconsider
the idea on different grounds. 

Back when this was discussed on
python-dev, the main argument against adding such an attribute
was that the its value would be coerced to 'binary' much too
fast to be of any value. That was certainly true at the time,
but the current ideas tossed around on this list suggest that
we are moving towards a clearer distinction between binary and
text data. 

In the current context, the attribute could well be used to
avoid using magic when it comes to guessing the encoding of 8-bit
strings at coercion time.
 
> If .encode('latin1') or .encode('utf8') are going to return anything
> tagged with an encoding, then surely it should be a tagged binary
> string?

No. The encoding attribute would then return 'latin-1' and 'utf-8'
resp. -- that's the point of the attribute: it should store the
encoding information in case it is available.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From andy@reportlab.com  Mon Feb 12 11:06:25 2001
From: andy@reportlab.com (Andy Robinson)
Date: Mon, 12 Feb 2001 11:06:25 -0000
Subject: [I18n-sig] RE: [Python-Dev] Pre-PEP: Python Character Model
In-Reply-To: <3A87BD51.3088DACA@lemburg.com>
Message-ID: <PGECLPOBGNBNKHNAGIJHMEEBCIAA.andy@reportlab.com>

> a-2-byte-encoding-space-was-too-small-the-day-unicode-was-conceived-
> >     and-20-bits-won't-last-either-ly y'rs  - tim
> 
> Has anyone ever considered the problems this causes for type
> designers ? Who is going to do the job of designing 2^20 character 
> glyphs to all match the same font design guidelines ? Perhaps
> I'm missing something here, but this sounds like Just is going 
> to have a bright future ;-)
> 

Work has been going on on this glyph set for many years.  

And the font vendors for Japan can charge VERY high prices.  Needless
to say they are not big fans of Open Source.

- Andy


From walter@livinglogic.de  Mon Feb 12 11:27:16 2001
From: walter@livinglogic.de (=?us-ascii?Q?=22Walter_D=F6rwald=22?=)
Date: Mon, 12 Feb 2001 12:27:16 +0100
Subject: [I18n-sig] Random thoughts on Unicode and Python
In-Reply-To: <3A85B8F9.1F494BF8@lemburg.com>
References: <14981.45051.945099.633730@cymru.basistech.com>
 <3A85B8F9.1F494BF8@lemburg.com>
Message-ID: <200102121227160015.004DA31F@mail.livinglogic.de>

On 10.02.01 at 22:56 M.-A. Lemburg wrote:

> [...]
> We are trying to tell people that storing text data is better
> done in Unicode than in a raw data buffer like Python's current
> string data type.

It's not enought to tell people, you actually have to make sure
that storing unicode text data is better and more convenient 
than plain old strings, this means that Unicode text must be 
usable in:
	open(u"foo.txt")
	urllib.open(u"foo.txt")
	s =3D eval(u"\"\\u3042\"")
	exec(u"s =3D \"\\u3042\"")
	os.stat(u"foo.txt")
	os.system(u"foo -x \u3042")
	os.popen2(u"foo -x \u3042",u"r")
and thousands of others.

I think that the first step should be to make Unicode usable 
everywhere. As a first step this can be done by converting to 
the default encoding internally (as e.g. eval and exec do now),

There may be OS services (e.g. file i/o) that are not Unicode 
aware. For these services converting to the default encoding
is all that can be done, but when the OS supports Unicode, it
should be used (for example Unicode filenames on NT/2000).

The next step should be to switch to Unicode internally, i.e.
use Unicode for Python variable names, module names, source 
code, etc. 

>  [...]

Just my $0.02!


Bye,
   Walter D=F6rwald

-- 
Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7
www.livinglogic.de


From tdickenson@geminidataloggers.com  Mon Feb 12 13:28:29 2001
From: tdickenson@geminidataloggers.com (Toby Dickenson)
Date: Mon, 12 Feb 2001 13:28:29 -0000
Subject: [I18n-sig] string encoding attribute (Strawman Proposal: Bina
 ry  Strings)
Message-ID: <9FC702711D39D3118D4900902778ADC83244AA@JUPITER>

> > If .encode('latin1') or .encode('utf8') are going to return anything
> > tagged with an encoding, then surely it should be a tagged binary
> > string?
> 
> No. The encoding attribute would then return 'latin-1' and 'utf-8'
> resp. -- that's the point of the attribute: it should store the
> encoding information in case it is available.

I think you misread. I said....

"A tagged binary string"

not

"Tagged as a binary string"


In other words, at the moment I dont see must distinction between a binary
string, and text string tagged with an encoding. Indeed the only distinction
is the presence of a tag. Is that sufficient distinction to make them
different types? Why cant I tag a binary string to say it contains a jpeg?


From barry@digicool.com  Mon Feb 12 14:14:23 2001
From: barry@digicool.com (Barry A. Warsaw)
Date: Mon, 12 Feb 2001 09:14:23 -0500
Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model)
References: <3A7F9084.509510B8@ActiveState.com>
 <3A7FD69C.1708339C@lemburg.com>
 <3A800DBC.2BE8ECEF@ActiveState.com>
 <3A8013BA.2FF93E8B@lemburg.com>
 <3A801E49.F8DF70E2@ActiveState.com>
 <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de>
 <3A808702.5FF36669@ActiveState.com>
 <200102070000.f1700BV02437@mira.informatik.hu-berlin.de>
 <3A80951E.DF725F03@ActiveState.com>
 <200102070732.f177WrV00930@mira.informatik.hu-berlin.de>
 <3A81AC7C.3FFE73E5@ActiveState.com>
 <200102080037.f180bul01609@mira.informatik.hu-berlin.de>
 <3A820CD2.25C3F978@ActiveState.com>
 <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de>
 <3A82FD60.EFB38FAD@ActiveState.com>
 <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de>
 <3A831110.6AADE590@ActiveState.com>
 <3A85BBC6.BBAA8D70@lemburg.com>
 <200102102223.RAA28498@cj20424-a.reston1.va.home.com>
 <3A860AA3.655F4207@ActiveState.com>
 <14982.41461.889514.547839@anthem.wooz.org>
 <3A8708F2.669B0A2C@ActiveState.com>
Message-ID: <14983.61375.407675.822695@anthem.wooz.org>

>>>>> "PP" == Paul Prescod <paulp@ActiveState.com> writes:

    PP> file = open("/etc/passwd", "r", "ASCII")

    PP> Surely that is not such a terrible burden in the interests of
    PP> making the world a little bit less xenophobic! Once you do
    PP> that, everything else "just works" and when your program
    PP> encounters data it can't handle in a text file it will crash
    PP> in a predictable way at a logical point (the read function)
    PP> instead of in an unpredictable way at an illogical point (some
    PP> random string coercion or API call).

Requiring the encoding imposes too much burden on the newbie learning
the language, IMHO.  It seems obvious that if you're going to open
something, you've got to specify what your opening (i.e. open() makes
no sense without the filename parameter).  I think you can easily
explain the difference between opening for reading and opening for
writing, although the myriad other mode options are pushing it
(e.g. the difference b/w r+, w+, and a+ are quite subtle and not
described sufficiently).

Now to require the encoding either forces you to ask the user to trust
you ("most of you will just want `ascii' for the encoding parameter,
don't worry about what that means"), or to go into /some/ explanation
of what encodings are, what the possible legal values are, what the
difference between "ascii" and "raw" are and when you want to use
them, what can happen if you misspell an encoding, how to guess the
encoding of the file you're about to open, What can happen if you
guess incorrectly, etc. etc.

If you care about file encodings, you've got to learn all that
anyway.  Fine, but that's a heavy burden to place on a new convert.
I'm convinced Guido felt that open() would be used very early on in a
newbie's experience and wanted to make it as simple as possible.
That's why it's a built-in.

Other messages in this thread seem to agree that /if/ open() were to
grow an encoding argument, it should be optional.

-Barry


From tree@basistech.com  Mon Feb 12 14:38:46 2001
From: tree@basistech.com (Tom Emerson)
Date: Mon, 12 Feb 2001 09:38:46 -0500
Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model)
In-Reply-To: <14983.61375.407675.822695@anthem.wooz.org>
References: <3A7F9084.509510B8@ActiveState.com>
 <3A7FD69C.1708339C@lemburg.com>
 <3A800DBC.2BE8ECEF@ActiveState.com>
 <3A8013BA.2FF93E8B@lemburg.com>
 <3A801E49.F8DF70E2@ActiveState.com>
 <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de>
 <3A808702.5FF36669@ActiveState.com>
 <200102070000.f1700BV02437@mira.informatik.hu-berlin.de>
 <3A80951E.DF725F03@ActiveState.com>
 <200102070732.f177WrV00930@mira.informatik.hu-berlin.de>
 <3A81AC7C.3FFE73E5@ActiveState.com>
 <200102080037.f180bul01609@mira.informatik.hu-berlin.de>
 <3A820CD2.25C3F978@ActiveState.com>
 <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de>
 <3A82FD60.EFB38FAD@ActiveState.com>
 <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de>
 <3A831110.6AADE590@ActiveState.com>
 <3A85BBC6.BBAA8D70@lemburg.com>
 <200102102223.RAA28498@cj20424-a.reston1.va.home.com>
 <3A860AA3.655F4207@ActiveState.com>
 <14982.41461.889514.547839@anthem.wooz.org>
 <3A8708F2.669B0A2C@ActiveState.com>
 <14983.61375.407675.822695@anthem.wooz.org>
Message-ID: <14983.62838.503700.118150@cymru.basistech.com>

barry@digicool.com writes:
[...]
> Other messages in this thread seem to agree that /if/ open() were to
> grow an encoding argument, it should be optional.

What if it were possible to specify the "default" encoding at
configure time, while keeping the argument to open() optional. Ruby does this, as does MySQL, so there *is* precedent.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Stringologist                                      http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From barry@digicool.com  Mon Feb 12 14:57:20 2001
From: barry@digicool.com (Barry A. Warsaw)
Date: Mon, 12 Feb 2001 09:57:20 -0500
Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model)
References: <3A7F9084.509510B8@ActiveState.com>
 <3A7FD69C.1708339C@lemburg.com>
 <3A800DBC.2BE8ECEF@ActiveState.com>
 <3A8013BA.2FF93E8B@lemburg.com>
 <3A801E49.F8DF70E2@ActiveState.com>
 <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de>
 <3A808702.5FF36669@ActiveState.com>
 <200102070000.f1700BV02437@mira.informatik.hu-berlin.de>
 <3A80951E.DF725F03@ActiveState.com>
 <200102070732.f177WrV00930@mira.informatik.hu-berlin.de>
 <3A81AC7C.3FFE73E5@ActiveState.com>
 <200102080037.f180bul01609@mira.informatik.hu-berlin.de>
 <3A820CD2.25C3F978@ActiveState.com>
 <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de>
 <3A82FD60.EFB38FAD@ActiveState.com>
 <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de>
 <3A831110.6AADE590@ActiveState.com>
 <3A85BBC6.BBAA8D70@lemburg.com>
 <200102102223.RAA28498@cj20424-a.reston1.va.home.com>
 <3A860AA3.655F4207@ActiveState.com>
 <14982.41461.889514.547839@anthem.wooz.org>
 <3A8708F2.669B0A2C@ActiveState.com>
 <14983.61375.407675.822695@anthem.wooz.org>
 <14983.62838.503700.118150@cymru.basistech.com>
Message-ID: <14983.63952.207294.647978@anthem.wooz.org>

>>>>> "TE" == Tom Emerson <tree@basistech.com> writes:

    TE> What if it were possible to specify the "default" encoding at
    TE> configure time, while keeping the argument to open()
    TE> optional. Ruby does this, as does MySQL, so there *is*
    TE> precedent.

That's a little scary because then Python programs may cease to be
portable.  Moderately better would be an API to select the default
encoding at runtime, but that's still worrisome.

-Barry


From mal@lemburg.com  Mon Feb 12 16:15:21 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 12 Feb 2001 17:15:21 +0100
Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model)
References: <3A7F9084.509510B8@ActiveState.com>
 <3A808702.5FF36669@ActiveState.com>
 <200102070000.f1700BV02437@mira.informatik.hu-berlin.de>
 <3A80951E.DF725F03@ActiveState.com>
 <200102070732.f177WrV00930@mira.informatik.hu-berlin.de>
 <3A81AC7C.3FFE73E5@ActiveState.com>
 <200102080037.f180bul01609@mira.informatik.hu-berlin.de>
 <3A820CD2.25C3F978@ActiveState.com>
 <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de>
 <3A82FD60.EFB38FAD@ActiveState.com>
 <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de>
 <3A831110.6AADE590@ActiveState.com>
 <3A85BBC6.BBAA8D70@lemburg.com>
 <200102102223.RAA28498@cj20424-a.reston1.va.home.com>
 <3A860AA3.655F4207@ActiveState.com>
 <14982.41461.889514.547839@anthem.wooz.org>
 <3A8708F2.669B0A2C@ActiveState.com>
 <14983.61375.407675.822695@anthem.wooz.org>
 <14983.62838.503700.118150@cymru.basistech.com> <14983.63952.207294.647978@anthem.wooz.org>
Message-ID: <3A880C19.7DD8FB80@lemburg.com>

"Barry A. Warsaw" wrote:
> 
> >>>>> "TE" == Tom Emerson <tree@basistech.com> writes:
> 
>     TE> What if it were possible to specify the "default" encoding at
>     TE> configure time, while keeping the argument to open()
>     TE> optional. Ruby does this, as does MySQL, so there *is*
>     TE> precedent.
> 
> That's a little scary because then Python programs may cease to be
> portable.  Moderately better would be an API to select the default
> encoding at runtime, but that's still worrisome.

A default value for encoding wouldn't work, since not all files
you open are text files. The only reasonable default for the
encoding parameter is 'binary'.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From tim.one@home.com  Mon Feb 12 20:38:00 2001
From: tim.one@home.com (Tim Peters)
Date: Mon, 12 Feb 2001 15:38:00 -0500
Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model)
In-Reply-To: <14983.61375.407675.822695@anthem.wooz.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEOEIOAA.tim.one@home.com>

[Barry A. Warsaw]
> ...
> Requiring the encoding imposes too much burden on the newbie learning
> the language, IMHO.

Indeed it does, no matter how strongly an advocate may believe users
"should" be aware of i18n issues.

By the same token, you could get yourself into a world of trouble by coding

   x = float(y) + z

unless you're careful to first specify the hardware rounding mode, values
for the 5 IEEE-754 exception masks, and the HW precision control setting if
you're running on a Pentium (also HW range control if running on Itanium).
And, someday, Python will probably grow ways to specify all that stuff.  If,
at that time, I suggest everyone *must* specify them before doing any fp
arithmetic, I hope someone has the good taste to just shoot me <wink>.

BTW, Python should drop C's text mode, because it's feeble and ill-defined
across platforms.

just-thought-i'd-liven-it-up<wink>-ly y'rs  - tim


From paulp@ActiveState.com  Mon Feb 12 21:26:24 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Mon, 12 Feb 2001 13:26:24 -0800
Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model)
References: <LNBBLJKPBEHFEDALKOLCAEOEIOAA.tim.one@home.com>
Message-ID: <3A885500.68F9D984@ActiveState.com>

Tim Peters wrote:
> 
> [Barry A. Warsaw]
> > ...
> > Requiring the encoding imposes too much burden on the newbie learning
> > the language, IMHO.
> 
> Indeed it does, no matter how strongly an advocate may believe users
> "should" be aware of i18n issues.

It has nothing to do with awareness of il8n issues. The fundamental
question is whether you expect to get text back from a read() or binary.
If you open with ASCII you get text coercions, text conversions and
other text semantics. If you open with binary you don't.

I do not think it too much to ask for people to know the difference
between text and binary data!

 Paul Prescod


From tim.one@home.com  Mon Feb 12 22:52:14 2001
From: tim.one@home.com (Tim Peters)
Date: Mon, 12 Feb 2001 17:52:14 -0500
Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model)
In-Reply-To: <3A885500.68F9D984@ActiveState.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEOOIOAA.tim.one@home.com>

[Paul Prescod]
> It has nothing to do with awareness of il8n issues. The fundamental
> question is whether you expect to get text back from a read() or binary.

C already addresses that distinction ("r" vs "rb" open modes).


From paulp@ActiveState.com  Tue Feb 13 00:17:11 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Mon, 12 Feb 2001 16:17:11 -0800
Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model)
References: <LNBBLJKPBEHFEDALKOLCMEOOIOAA.tim.one@home.com>
Message-ID: <3A887D07.220D5904@ActiveState.com>

Tim Peters wrote:
> 
> [Paul Prescod]
> > It has nothing to do with awareness of il8n issues. The fundamental
> > question is whether you expect to get text back from a read() or binary.
> 
> C already addresses that distinction ("r" vs "rb" open modes).

Python is documented as only using the distinction to handle line ends.
We want to create totally different object types based on the flag.

 Paul Prescod


From tim.one@home.com  Tue Feb 13 01:45:30 2001
From: tim.one@home.com (Tim Peters)
Date: Mon, 12 Feb 2001 20:45:30 -0500
Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model)
In-Reply-To: <3A887D07.220D5904@ActiveState.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEPJIOAA.tim.one@home.com>

[Paul Prescod]
> It has nothing to do with awareness of il8n issues. The fundamental
> question is whether you expect to get text back from a read()
> or binary.

[Tim]
> C already addresses that distinction ("r" vs "rb" open modes).

[Paul]
> Python is documented as only using the distinction to handle line
> ends.

Where?  Not in the open() docs.  They're uselessly vague about the
differences between 'r' and 'rb' (and don't mention line ends at all --
you're hallucinating that), because C is too and Python's semantics *are*
C's here.  Nevertheless, "When opening a binary file, you should append 'b'
to the mode value for improved portability. (It's useful even on systems
which don't treat binary and text files differently, where it serves as
documentation.)".  True enough, and good enough for newbies.  Although, as I
said before, I think Python should drop C's notion of text mode in favor of
its own (because C's notion is wildly platform-dependent).

> We want to create totally different object types based on the flag.

Which flag?  "b"?  Fine by me -- but that's what I said at the start.


From tim.one@home.com  Tue Feb 13 04:05:24 2001
From: tim.one@home.com (Tim Peters)
Date: Mon, 12 Feb 2001 23:05:24 -0500
Subject: [I18n-sig] RE: [Python-Dev] Pre-PEP: Python Character Model
In-Reply-To: <3A87BD51.3088DACA@lemburg.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEPNIOAA.tim.one@home.com>

[Tim]
>> a-2-byte-encoding-space-was-too-small-the-day-unicode-was-conceived-
>>     and-20-bits-won't-last-either-ly y'rs  - tim

[MAL]
> Has anyone ever considered the problems this causes for type
> designers ?

LOL!  I'm picturing Guido going back a few thousand years in his time
machine, to civilization after civilization on the verge of literacy, asking
"Haven't you foolish people ever considered the problems this silly
picture-writing will cause for type designers someday?  Now grow up and use
7-bit ASCII." <wink>.

The same inconsiderate bastards made computer speech recognition a lot
harder than it could have been, too.  Not to mention computerized
inter-language translation, and whether or not it's polite or a mortal
offense to point with your foot.

> Who is going to do the job of designing 2^20 character glyphs
> to all match the same font design guidelines ?

No problem -- at Earth's current population, we can assign about 5,000
people to work on each glyph <wink>.

size-is-a-relative-thing-ly y'rs  - tim


From mal@lemburg.com  Tue Feb 13 08:01:48 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 13 Feb 2001 09:01:48 +0100
Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model)
References: <LNBBLJKPBEHFEDALKOLCMEOOIOAA.tim.one@home.com> <3A887D07.220D5904@ActiveState.com>
Message-ID: <3A88E9EC.EADC6A3A@lemburg.com>

Paul Prescod wrote:
> 
> Tim Peters wrote:
> >
> > [Paul Prescod]
> > > It has nothing to do with awareness of il8n issues. The fundamental
> > > question is whether you expect to get text back from a read() or binary.
> >
> > C already addresses that distinction ("r" vs "rb" open modes).
> 
> Python is documented as only using the distinction to handle line ends.
> We want to create totally different object types based on the flag.

Two things:

1. the difference between "r" and "rb" only exists on some non-Unix
   platforms (e.g. Windows)

2. the codec decides which type of object to return for .read()
   -- this has nothing to do with the file mode, but instead is
   dependent on the encoding used, e.g. encoding='binary' would
   return a binary string, encoding='ascii' results in Unicode
   and encoding='pil-image' could produce a PIL image object...

Paul, you ought to write up a PEP about this subject discussing
all the different issues with adding more optional parameters
(encoding and errors, possibly more) to open().

It should also include a discussion about the
implications using an encoding would have w/r to the applications
relying on getting a real file object from the builtin open().

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From martin@loewis.home.cs.tu-berlin.de  Thu Feb 15 18:49:22 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 15 Feb 2001 19:49:22 +0100
Subject: [I18n-sig] Re: Pre-PEP: Proposed Python Character Model
In-Reply-To: <013401c09416$881b0f40$e46940d5@hagrid> (fredrik@effbot.org)
References: <013401c09416$881b0f40$e46940d5@hagrid>
Message-ID: <200102151849.f1FInMt02227@mira.informatik.hu-berlin.de>

> > However, matter-of-factually, you propose that ISO-8859-1 is the
> > default encoding, as this is the encoding that is used when converting
> > character strings to char* in the C API. I'd certainly call it a
> > default.
> 
> It's not an encoding.  It's the subset of Unicode that you can store
> in an 8-bit character.

No, it is not *the* subset of Unicode that you can store in an 8-bit
character. You can store any subset of Unicode with a cardinality <256
in a single octet.

Latin-1 is group 0, plane 0, row 0. Why is it any better than any
other plane or row?

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Thu Feb 15 18:39:44 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 15 Feb 2001 19:39:44 +0100
Subject: [I18n-sig] Random thoughts on Unicode and Python
In-Reply-To: <PGECLPOBGNBNKHNAGIJHCECMCIAA.andy@reportlab.com>
References: <PGECLPOBGNBNKHNAGIJHCECMCIAA.andy@reportlab.com>
Message-ID: <200102151839.f1FIdi002179@mira.informatik.hu-berlin.de>

> That's my concern, and the thing I want to poll people on.
> If Python "just works" for these users, and if we already offer
> Unicode strings and a good codec library for people to use when they
> want to, is there really a need to go further?

My simple answer is: no, not at the moment.

I can surely think of things that ought to work with the Unicode type
and which currently don't, but most of them are a matter of fixing
libraries.

Regards,
Martin


From barry@scottb.demon.co.uk  Sun Feb 18 13:01:06 2001
From: barry@scottb.demon.co.uk (Barry Scott)
Date: Sun, 18 Feb 2001 13:01:06 -0000
Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model
In-Reply-To: <PGECLPOBGNBNKHNAGIJHKEAPCIAA.andy@reportlab.com>
Message-ID: <001001c099aa$daebf240$060210ac@private>

> Here's a thought.  How about BinaryFile/BinarySocket/ByteArray which
> do

	Files and sockets often contain a both string and binary data.
	Having StringFile and BinaryFile seems the wrong split. I'd
	think being able to write string and binary data is more useful
	for example having methods on file and socket like file.writetext,
	file.writebinary. NOw I can use the writetext to write the HTTP
	headers and writebinary to write the JPEG image say.

			BArry


From brian@tomigaya.shibuya.tokyo.jp  Tue Feb 20 10:16:09 2001
From: brian@tomigaya.shibuya.tokyo.jp (Brian Takashi Hooper)
Date: Tue, 20 Feb 2001 19:16:09 +0900
Subject: [I18n-sig] Japanese commentary on the Pre-PEP (2 of 4)
Message-ID: <20010220190538.F94A.BRIAN@tomigaya.shibuya.tokyo.jp>

Here's the second message, from Tamito Kajiyama, contributor of the SJIS
and EUC-JP codecs:

----

  On Sun, 11 Feb 2001 20:18:51 +0900
  Brian Takashi Hooper <brian@tomigaya.shibuya.tokyo.jp> wrote:

  > Hi there,
  > 
  > What does everyone think of the Proposed Character Model?

  I was also one of the people that Andy asked to contribute an opinion,
  so after reviewed the thread and here's what I have to say:

  I understand Paul's Pre-PEP as raising the following three points:

  1. Deprecate the usage of the present string type as containing a
  sequence of bytes, and instead interpret string literals as containing
  Unicode characters.  (Unify the present character strings and Unicode
  strings.)

  2. Introduce a new data type (byte strings) for expressing an 
  uninterpreted byte sequence.

  3. Add a convention for specifying the encoding of a source file.

 In Python 2.0, there are separate data types for non-Unicode
 strings and Unicode character strings.  The proposals 1. and 2.
 are essentially to replace these data types with the (Unicode)
 character sequence and byte sequence data types.

 Personally, I am opposed to the proposals 1. and 2. for the
 following two reasons:

 (1) The string types in Python 2.0 and the new string types
 proposed in the pre-PEP have a relationship something like this:

      Python 2.0                      Pre-PEP
      string "" (byte sequence)       byte string b""
      Unicode string u"" (Unicode     string ""
         character sequence)

  In general, the before- and after-PEP Pythons above have essentially no
  difference in expressiveness, and therefore it's hard to see what merit
  there might be in swapping the data types.

 On the other hand, I believe that swapping byte sequence and character
 sequence data types as described above has several serious demerits for
 Japanese Python developers.

  Japanese programmers have a regular need to handle legacy encodings such as
 EUC-JP and Shift JIS in their programs.  Regular conversion back-and-forth
 between Unicode and legacy encodings introduces a significant cost
 in terms of resource usage and performance.  Moreover, there is the
  problem of incompatibilities between different Unicode conversion tables.
  Furthermore, Japanese programmers are accustomed to dealing with Japanese
  strings as byte sequences.  Japanese users have a real motivation to
 manipulate Japanese character strings as sequences of bytes.  Regardless
 of whether Unicode is supported or not, the byte sequence data type is
 necessary in order to represent Japanese characters.

  The present implementation of strings in Python, where a string represents
  a sequence of bytes, is one feature that makes Python easy for Japanese
 developers to use.  Changing strings to contain Unicode character data
 would impose a heavy burden for development and maintenance on Japanese
 Python programmers.  Therefore, I'm against swapping byte string and
  character (Unicode) string types.

 (2) It is not always possible to unambiguously interpret string literals
 as Unicode character data

  As you know, in Japanese-encoded byte strings, 2 bytes often represent
 1 character.  Therefore, the position of characters is expressed in terms
 of bytes, not characters.  Because of this, if a Japanese-encoded byte
  string is interpreted as-is as a Unicode character string, indexes into
  the string would no longer be interpreted the same way.  For example, in
 the below code snippet the substring is output differently depending on
 whether the string literal is interpreted as a byte sequence or Unicode
  character sequence:

    s = "$B$3$l$OF|K\8l$NJ8;zNs$G$9!#(B"
    print s[6:12]

 Hard coding of slices as with the above is a common practice,
  I believe.  Paul has asserted that no serious problems will occur if
  existing byte sequences are interpreted as Unicode, but I disagree with
  him on this.

 Due to the above two reasons, I cannot agree with the pre-PEP's first
  two proposals (1. and 2.).

 However, I believe the 3rd proposal to explicitly specify source file
 encoding is a necessary improvement, leaving aside for the moment the
 question of implementation.

  In Python 2.0, if a program is written containing Japanese strings in
 Shift-JIS, Python may raise parser errors.  As many of you may know,
 in Shift-JIS encoded strings the second byte of some Japanese characters
 may be a backslash (ASCII 0x5c), and this conflicts with the backslash
 escaping in the string literal.  As far as I know, this is also the case
  with the Chinese encoding Big 5.

  One way to solve this problem is to apply Ishimoto-san's Shift-JIS
 patch [1] to Python, but I feel that a more desirable solution is
 to allow Python itself to handle files with different source encodings.

  However, the intent of Paul's 3rd suggestion seems directed at solving
  a different problem than that of allowing specification of an encoding
 for byte strings.  On the other hand, Marc-Andre's proposal [2] is to
 use the source file encoding only for the decoding of non-Unicode
 characters in character strings, without touching the contents of byte
 strings.  While I prefer Marc-Andre's proposal since it seems to be
 a straightforward extension of Python 2.0's current Unicode support,
 it doesn't address the aforementioned problem with the usage of
 Shift-JIS and Big 5 in Python programs.  Concerning this point,
  I think there is a need to start another discussion aside from Paul's
  pre-PEP.

 [1] http://www.gembook.org/python/
     http://www.gembook.org/python/python20-sjis-20001202.zip

 [2] http://mail.python.org/pipermail/i18n-sig/2001-February/000756.html

----------------------------------------------------------------------

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>


From brian@tomigaya.shibuya.tokyo.jp  Tue Feb 20 10:16:07 2001
From: brian@tomigaya.shibuya.tokyo.jp (Brian Takashi Hooper)
Date: Tue, 20 Feb 2001 19:16:07 +0900
Subject: [I18n-sig] Japanese commentary on the Pre-PEP (1 of 4)
Message-ID: <20010220185630.F948.BRIAN@tomigaya.shibuya.tokyo.jp>

Hi there, this is Brian Hooper in Tokyo.

The proposed character model thread seems to have simmered down so I
don't know how interested people will be in this, but I gathered a few
comments about the Pre-PEP from the Japanese Python mailing list, and
translated the responses - I think there were some very good points
brought up, and I'd like to add the messages I received (with the
permission of their authors) to the discussion.  

I've got four messages to post; I'm not such a fast translator so I'll
post the two I have now, and the other two as I finish them.

Here is Atsuo Ishimoto's post - Ishimoto-san wrote and contributed the
CP 932 codec.

---

On Sun, 11 Feb 2001 20:18:51 +0900
Brian Takashi Hooper <brian@tomigaya.shibuya.tokyo.jp> wrote:

> Hi there,
> 
> What does everyone think of the Proposed Character Model?

I'm opposed to it in its present form.  Putting aside for the moment
any criticisms of Unicode itself, building extension modules for Python
would become more difficult and problematic (as Suzuki also pointed out).

For example, given:

PyObject *simple(PyObject *o, PyObject *args)
{
	char *filename;
	if (!PyArg_ParseTuple(args, "s", filename))
		return NULL;
	File *f = fopen(filename, "w");
	if (!f)
		return NULL;
	fprintf(f, "spam");
	fclose(f);
	Py_INCREF(Py_None);
	return Py_None;
}

(Bfrom Python you can write:

sample.simple("$BF|K\8l%U%!%$%kL>(B")

and it will work as is in almost any platform and language environment.
It works because in the present implementation of CPython, the input data
string is treated as simply data by the extension module, which simply
passes it along to the underlying OS or library without interpreting the
content of the data.
 
However, consider the same extension module in the case where all character
sequences are handled by Python internally as Unicode.  PyArg_ParseTuple()
has no way of automatically knowing how to change Unicode characters with an
ordinal value greater than 0xff into the encoding currently supported on the
platform.  In this case, sample.simple("$BF|K\8l%U%!%$%kL>(B") becomes an error.
At present, most of Python's extension modules can be used without having to
explicitly add CJK support - however if this PEP is implemented then most of these
modules will become unusable in their present form.

So, is there any solution for this?

Well, we could take care when writing our Python scripts only to use strings
in such a way that PyArg_ParseTuple() does not cause an error.  There are two
ways to do this:

a. Use byte strings

Instead of using a character string, we could call our function as

sample.simple(b"$BF|K\8l%U%!%$%kL>(B")

and everything then works fine.  However, if we always have to use byte
strings when interacting with extension libraries, then we haven't really
achieved any real improvement in terms of internationalization, and there's
not much point to implementing the PEP in that case...

b. We could use an 8-bit character encoding such as ISO-8859.

Suppose we use ISO-8859-1 instead of Shift-JIS or EUC-JP when creating the
character string.  Since the value of ord() for each character in the string
is always <= 255, PyArg_ParseTuple() will have no problem with it, but in
having to treat legacy encoded data as a different encoding, we haven't
really made it easier to write programs which handle CJK data, or improved
the situation for i18n either.

It could be argued that Unicode strings could be used everywhere else, and
either a. or b. above only when calling legacy code through extension modules
like with simple() above.  However, in the above case, it becomes necessary
for the programmer to be aware of whether the function they are calling is
implemented in legacy C code or not, which isn't really an improvement on the
current state of things.  Moreover, because in converting to Unicode we lose
information about the original string encoding, automatically converting back
to the original string encoding (for example in order to make the distinction
between Unicode supported and non-supported libraries -B) becomes impossible.
Use of a default encoding is discouraged in the PEP, but this is one example 
of why it may be necessary.

So, returning to the extension module example above, we've seen that managing the
problem on the Python script side is difficult.  Another approach might be
to change our extension module to support Unicode:

PyObject *simple(PyObject *o, PyObject *args)
{
	Py_UNICODE *filename;
	if (!PyArg_ParseTuple(args, "u", filename))
		return NULL;
	File *f = ... :-P

If the platform being used has a version of fopen() which has Unicode support,
then there's no problem, but if not, then it's necessary to first convert the
Unicode string to an encoding which _is_ supported on the platform:

PyObject *simple(PyObject *o, PyObject *args)
{
	Py_UNICODE *filename;
	char native_filename[MAX_FILE];
	
	if (!PyArg_ParseTuple(args, "u", filename))
		return NULL;

#IF SJIS
	/* SJIS$B$KJQ49(B */
#ELSE
	/* EUC$B$KJQ49(B */
#ENDIF
	
	FILE *f = fopen(....)

I don't think anyone really wants to write code like this.

Besides adding complexity, it is also hard to ignore the additional
processing cost added by having to convert incoming Unicode arguments.
Furthermore, adding this kind of support isn't likely to be provided by
European or American programmers, since the coincidence of the ISO-8859-1
with the <= 255 range of Unicode makes such explicit support unnecessary
for applications which only use Latin-1 or ASCII.  (So: Non-American/
European programmers will have to add support for libraries they want to
use)

One of Python's strong points is that it makes it easy to wrap and use
existing C libraries - however, the great majority of these C libraries are
still not Unicode compliant.  In that case, then it becomes necessary to
add Unicode->native encoding support for all such C modules one-by-one, as
described above.  It's difficult to see what would be good about that.

Some might react to the above by insisting, "These are just transitional
problems which will soon be solved.  If we restrict things to just a few 
main platforms, then it won't become a big problem."  This position is
however, flawed.  For example, in Windows 95, to say nothing of UNIX-based
OS's, Unicode support is only partial, and there is no Unicode version of
fopen().  Considering the huge number of non-Unicode supported systems cur-
rently in use around the world, we cannot ignore the importance of continuing
to support them.

In conclusion, supposing that Python strings are made to hold only character
data as proposed in the pre-PEP, use of extension modules from non-European
languages becomes much more difficult, and explicit encoding support has to be
added in many cases.  Python's current string implementation has important
implications for its use as a glue language in non-internationlized environments.

-Atsuo Ishimoto

The Japanese (original) version of this opinion is available at
http://www.gembook.org/moin/moin.cgi/OpinionForPepPythonCharacterModel
Comments / feedback appreciated.

P.S. I wonder what Tcl does with this?


From tdickenson@geminidataloggers.com  Tue Feb 20 14:22:43 2001
From: tdickenson@geminidataloggers.com (Toby Dickenson)
Date: Tue, 20 Feb 2001 14:22:43 +0000
Subject: [I18n-sig] Japanese commentary on the Pre-PEP (1 of 4)
In-Reply-To: <20010220185630.F948.BRIAN@tomigaya.shibuya.tokyo.jp>
References: <20010220185630.F948.BRIAN@tomigaya.shibuya.tokyo.jp>
Message-ID: <44t49tk2ea60f47uccfo8cll3a53lgtjmc@4ax.com>

On Tue, 20 Feb 2001 19:16:07 +0900, Brian Takashi Hooper
<brian@tomigaya.shibuya.tokyo.jp> wrote:

>Hi there, this is Brian Hooper in Tokyo.

>The proposed character model thread seems to have simmered down so I
>don't know how interested people will be in this, but I gathered a few
>comments about the Pre-PEP from the Japanese Python mailing list, and
>translated the responses - I think there were some very good points
>brought up, and I'd like to add the messages I received (with the
>permission of their authors) to the discussion. =20

Thankyou for this effort.


>For example, given:
>
>PyObject *simple(PyObject *o, PyObject *args)
>{
>	char *filename;
>	if (!PyArg_ParseTuple(args, "s", filename))
>		return NULL;
>	File *f =3D fopen(filename, "w");
>	if (!f)
>		return NULL;
>	fprintf(f, "spam");
>	fclose(f);
>	Py_INCREF(Py_None);
>	return Py_None;
>}
>
>from Python you can write:
>
>sample.simple("????????")
>
>and it will work as is in almost any platform and language environment.

If those ??? are anything other than ASCII characters, then it doesnt
work *predictably* today. (assuming the requirement that the file name
is correct when viewed using the platforms native file browser)

>Well, we could take care when writing our Python scripts only to use =
strings
>in such a way that PyArg_ParseTuple() does not cause an error.

Sticking with the fopen example; I had assumed it is desirable to get
an error if a script tries to create a file whose name contains
japanse characters, on a filesystem that does not support that.

>Use byte strings
>
>Instead of using a character string, we could call our function as
>
>sample.simple(b"????????")
>
>and everything then works fine.  However, if we always have to use byte
>strings when interacting with extension libraries, then we haven't =
really
>achieved any real improvement in terms of internationalization, and =
there's
>not much point to implementing the PEP in that case...

If this is a legacy extension library then a byte string is all it
expects. You could call this function as

sample.simple(u"????????".encode('encoding_expected_by_sample_dot_simple'=
))

I agree we need to provide a simpler interface to new extensions.


>PyObject *simple(PyObject *o, PyObject *args)
>{
>	Py_UNICODE *filename;
>	char native_filename[MAX_FILE];
>=09
>	if (!PyArg_ParseTuple(args, "u", filename))
>		return NULL;
>
>#IF SJIS
>	/* SJIS??? */
>#ELSE
>	/* EUC??? */
>#ENDIF
>=09
>	FILE *f =3D fopen(....)
>
>I don't think anyone really wants to write code like this.

I think those ifdefs could be replaced by one call to PyUnicode_Encode

>Furthermore, adding this kind of support isn't likely to be provided by
>European or American programmers, since the coincidence of the =
ISO-8859-1
>with the <=3D 255 range of Unicode makes such explicit support =
unnecessary
>for applications which only use Latin-1 or ASCII.  (So: Non-American/
>European programmers will have to add support for libraries they want to
>use)

As a European native-English speaker, I dont think this is true so
long as we preserve the ASCII default encoding. An application that
stores latin-1 data in a mix of unicode and plain strings will quickly
trigger an exception (as soon as a unicode string mixes with a plain
string containing a non-ASCII byte).

A useful counterexample may be Mark Hammond's extensions for
supporting win32 and com. They have always included explicit support
for automatic encoding of unicode parameters on platforms where win32
uses 8-bit strings, and automatic decoding of plain strings when used
with COM, which is always unicode.

Toby Dickenson
tdickenson@geminidataloggers.com


From ishimoto@gembook.org  Tue Feb 20 17:35:23 2001
From: ishimoto@gembook.org (Atsuo Ishimoto)
Date: Wed, 21 Feb 2001 02:35:23 +0900
Subject: [I18n-sig] Japanese commentary on the Pre-PEP (1 of 4)
In-Reply-To: <44t49tk2ea60f47uccfo8cll3a53lgtjmc@4ax.com>
References: <20010220185630.F948.BRIAN@tomigaya.shibuya.tokyo.jp> <44t49tk2ea60f47uccfo8cll3a53lgtjmc@4ax.com>
Message-ID: <20010221023442.EE11.ISHIMOTO@gembook.org>

Brian, Thanks for your effort to translate our comment. 

On Tue, 20 Feb 2001 14:22:43 +0000
Toby Dickenson <mbel44@dial.pipex.net> wrote:

> 
> If those ??? are anything other than ASCII characters, then it doesnt
> work *predictably* today. (assuming the requirement that the file name
> is correct when viewed using the platforms native file browser)
> 
If the filename is illegal for the platform, fopen() may returns error.
Why should we check whether filename is valid or not? Current python
doesn't check if filename contains illegal letters, such as ':' on Win32.
This is because platform knows their job and character set. We don't
have to bother them to work.

> >Well, we could take care when writing our Python scripts only to use strings
> >in such a way that PyArg_ParseTuple() does not cause an error.
> 
> Sticking with the fopen example; I had assumed it is desirable to get
> an error if a script tries to create a file whose name contains
> japanse characters, on a filesystem that does not support that.
> 
You can get an error from platform-depend fopen(). Python or extension
module don't have to check this.

> If this is a legacy extension library then a byte string is all it
> expects. You could call this function as
> 
> sample.simple(u"????????".encode('encoding_expected_by_sample_dot_simple'))
> 
> I agree we need to provide a simpler interface to new extensions.

I don't believe this make people happy, even if interface is simplified.
It is hard work to remember given function is Python script, legacy
extension or Unicode-aware extension. 

> >#IF SJIS
> >	/* SJIS??? */
> >#ELSE
> >	/* EUC??? */
> >#ENDIF
> >	
> >	FILE *f = fopen(....)
> >
> >I don't think anyone really wants to write code like this.
> 
> I think those ifdefs could be replaced by one call to PyUnicode_Encode

May be. But to encode, you need to know the possible character set  of
incoming Unicode string and it's encoding, and specify them explicitly.
Platform depended default encoding may eliminate hard coded encoding
name, but I'm afraid of performance penalty for really long strings.

> 
> As a European native-English speaker, I dont think this is true so
> long as we preserve the ASCII default encoding. An application that
> stores latin-1 data in a mix of unicode and plain strings will quickly
> trigger an exception (as soon as a unicode string mixes with a plain
> string containing a non-ASCII byte).
> 
This means a lot of existing extension modules should be updated. It is
hard for me to believe this is good idea.

> A useful counterexample may be Mark Hammond's extensions for
> supporting win32 and com. They have always included explicit support
> for automatic encoding of unicode parameters on platforms where win32
> uses 8-bit strings, and automatic decoding of plain strings when used
> with COM, which is always unicode.

win32com works fine because COM is the Unicode world. But Python should
live in the Unicode hostile land, I believe.

Wishing you can read my English....

--------------------------
Atsuo Ishimoto
ishimoto@gembook.org
Homepage:http://www.gembook.org 


From guido@digicool.com  Tue Feb 20 19:36:35 2001
From: guido@digicool.com (Guido van Rossum)
Date: Tue, 20 Feb 2001 14:36:35 -0500
Subject: [I18n-sig] How does Python Unicode treat surrogates?
Message-ID: <200102201936.OAA30670@cj20424-a.reston1.va.home.com>

On the XML sig the following exchange happened.  I don't know enough
about the issues to investigate, but I'm sure that someone here can
provide insight?  It seems to boil down to whether or not surrogates
may get transposed when between platforms.

--Guido van Rossum (home page: http://www.python.org/~guido/)

------- Forwarded Message

Date:    Tue, 20 Feb 2001 11:54:34 -0700
From:    Uche Ogbuji <uche.ogbuji@fourthought.com>
To:      Guido van Rossum <guido@digicool.com>
cc:      Lars Marius Garshol <larsga@garshol.priv.no>, xml-sig@python.org
Subject: Re: [XML-SIG] DC DOM tests (Was: Roadmap document - finally!) 

> > > > - DOMString and text manipulating interface methods are not
> > > >   tested beyond ASCII text due to an implementation limitation
> > > >   of ParsedXML.DOM. So, implementations will not be tested if
> > > >   text is correctly treated when multi-byte UTF-16 characters
> > > >   are involved.
> > > 
> > > By "multi-byte UTF-16 characters" I assume you mean Unicode
> > > characters outside the BMP that are represented using two
> > > surrogates?
> > 
> > I wonder if that's what Martijn means.  I've read that most Java
> > implementations have trouble with characters outside the BMP.  I
> > wonder if Python handles these properly.
> 
> Depends on what you call properly.  Can you elaborate on what you
> would call proper treatment here?

Sure.  I admit it's hearsay, but I thought I'd read that because Java
Unicode is or was underspecified, that there was the possibility of
transposition of the high-surrogate with the low-surrogate character
between Java implementations or platforms.

Now I don't exactly write XML dissertations on "Hello Kitty" <g>, so
I'm not likely to run into this myself, but I was wondering whether
Python handles surrogate blocks appropriately across platforms and
implementations (I guess including cpyhton -> Jpython).


-- 
Uche Ogbuji                               Principal Consultant
uche.ogbuji@fourthought.com               +1 303 583 9900 x 101
Fourthought, Inc.                         http://Fourthought.com 
4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA
Software-engineering, knowledge-management, XML, CORBA, Linux, Python

------- End of Forwarded Message


From paulp@ActiveState.com  Tue Feb 20 21:46:35 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Tue, 20 Feb 2001 13:46:35 -0800
Subject: [I18n-sig] Re: Pre-PEP: Proposed Python Character Model
References: <013401c09416$881b0f40$e46940d5@hagrid> <200102151849.f1FInMt02227@mira.informatik.hu-berlin.de>
Message-ID: <3A92E5BB.38D4FB0B@ActiveState.com>

"Martin v. Loewis" wrote:
> 
> > ...
> >
> > It's not an encoding.  It's the subset of Unicode that you can store
> > in an 8-bit character.
> 
> No, it is not *the* subset of Unicode that you can store in an 8-bit
> character. You can store any subset of Unicode with a cardinality <256
> in a single octet.
> 
> Latin-1 is group 0, plane 0, row 0. Why is it any better than any
> other plane or row?

I don't know. You tell me.

>>> "a"==u"a"==chr(97)
1

It looks like we've already decided that group 0, plane 0, row 0 is
special. A better question is why if the first half of group 0, plane 0,
row 0 better than the last half?

>>> unichr(160)==chr(160)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: ASCII decoding error: ordinal not in range(128)

The Unicode guys made group 0, plane 0, row 0 Latin-1 for a reason. It's
not just an accident. I don't think it makes sense for us to agree with
them "halfway"...especially when this half-way agreement causes all
kinds of nasty problems like forcing Python to raise exceptions in
places that are really surprising like equality tests and sort
functions.

-- 
Vote for Your Favorite Python & Perl Programming  
Accomplishments in the first Active Awards! 
http://www.ActiveState.com/Awards


From paulp@ActiveState.com  Tue Feb 20 21:56:40 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Tue, 20 Feb 2001 13:56:40 -0800
Subject: [I18n-sig] Japanese commentary on the Pre-PEP (2 of 4)
References: <20010220190538.F94A.BRIAN@tomigaya.shibuya.tokyo.jp>
Message-ID: <3A92E818.6FFACF04@ActiveState.com>

Thanks for the translation Brian! That must have been a ton of work but
it strikes me as very important work!


> 
> ...
> 
>       Python 2.0                      Pre-PEP
>       string "" (byte sequence)       byte string b""
>       Unicode string u"" (Unicode     string ""
>          character sequence)
>
>   In general, the before- and after-PEP Pythons above have essentially no
>   difference in expressiveness, and therefore it's hard to see what merit
>   there might be in swapping the data types.

I think that there is an important issue here. Python is documented as
having character strings. The minimal unit of a string is supposed to be
a character. "Literal" strings are documented as being strings of
characters. People expect this of a modern, high-level, user-centric
language. Bytes are no more interesting to your average programmer than
are DWORDs. We aren't going to start teaching people about bytes in
introductory Python classes.

More and more, people are going to find it bizarre to make a distinction
between the 128 characters that happen to have lived in a
quickly-becoming-obsolete American standard and the other 65,000
characters that we can use in word processors, web pages, search engines
and so forth. You don't have to be Asian to see the distinction as
arbitrary and historical. What if you want to insert a trademark (tm) or
copyright (c) in your software?

It is certainly too early for Python to abandon the one-byte centric
view of the world. It is NOT too early to start putting into place a
transition plan to the future world that we will all be forced to live
in. Part of that transition is teaching people that literal strings may
one day allow characters greater than 128 (perhaps directly, perhaps
through an escape mechanism).

> ...
>   Furthermore, Japanese programmers are accustomed to dealing with Japanese
>   strings as byte sequences.  Japanese users have a real motivation to
>  manipulate Japanese character strings as sequences of bytes.  Regardless
>  of whether Unicode is supported or not, the byte sequence data type is
>  necessary in order to represent Japanese characters.

An explicit part of every proposal has been a continued support for
rich, expressive byte-sequence manipulation.

>   The present implementation of strings in Python, where a string represents
>   a sequence of bytes, is one feature that makes Python easy for Japanese
>  developers to use.  

If Japanese programmers understand the difference between a byte and a
character (which they must!), why would they be opposed to making that
distinction explicit in code?

>   As you know, in Japanese-encoded byte strings, 2 bytes often represent
>  1 character.  Therefore, the position of characters is expressed in terms
>  of bytes, not characters.  Because of this, if a Japanese-encoded byte
>   string is interpreted as-is as a Unicode character string, indexes into
>   the string would no longer be interpreted the same way.  For example, in
>  the below code snippet the substring is output differently depending on
>  whether the string literal is interpreted as a byte sequence or Unicode
>   character sequence:
>
>     s = "$B$3$l$OF|K\8l$NJ8;zNs$G$9!#(B"
>     print s[6:12]
> 
>  Hard coding of slices as with the above is a common practice,
>   I believe.  Paul has asserted that no serious problems will occur if
>   existing byte sequences are interpreted as Unicode, but I disagree with
>   him on this.

I still assert that the interpretation will not change. If you have no
encoding declaration then the only rational choice is to treat each byte
as a character. Therefore the indexes would work exactly as they do
today.

-- 
Vote for Your Favorite Python & Perl Programming  
Accomplishments in the first Active Awards! 
http://www.ActiveState.com/Awards


From guido@digicool.com  Tue Feb 20 21:54:25 2001
From: guido@digicool.com (Guido van Rossum)
Date: Tue, 20 Feb 2001 16:54:25 -0500
Subject: [I18n-sig] Re: Pre-PEP: Proposed Python Character Model
In-Reply-To: Your message of "Tue, 20 Feb 2001 13:46:35 PST."
 <3A92E5BB.38D4FB0B@ActiveState.com>
References: <013401c09416$881b0f40$e46940d5@hagrid> <200102151849.f1FInMt02227@mira.informatik.hu-berlin.de>
 <3A92E5BB.38D4FB0B@ActiveState.com>
Message-ID: <200102202154.QAA06554@cj20424-a.reston1.va.home.com>

> "Martin v. Loewis" wrote:
> > Latin-1 is group 0, plane 0, row 0. Why is it any better than any
> > other plane or row?
> 
> I don't know. You tell me.
> 
> >>> "a"==u"a"==chr(97)
> 1
> 
> It looks like we've already decided that group 0, plane 0, row 0 is
> special. A better question is why if the first half of group 0, plane 0,
> row 0 better than the last half?
> 
> >>> unichr(160)==chr(160)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeError: ASCII decoding error: ordinal not in range(128)
> 
> The Unicode guys made group 0, plane 0, row 0 Latin-1 for a reason. It's
> not just an accident. I don't think it makes sense for us to agree with
> them "halfway"...especially when this half-way agreement causes all
> kinds of nasty problems like forcing Python to raise exceptions in
> places that are really surprising like equality tests and sort
> functions.

This has been hashed to death many times before.  We have absolutely
no guarantee that the files from which Python strings are read are
encoded in Latin-1, but we do know pretty sure that they are an ASCII
superset (if they represent characters at all).  Using the locale
module the user can (implicitly) indicate what the character set is,
and this may not be Latin-1.  Since s.islower() and other similar
functions are locale-sensitive, it would be inconsistent to declare
that 8-bit strings are always encoded in Latin-1.  This is historical
baggage that cannot easily be fixed without breaking lots of code
handling character data using legacy encodings (and typically, such
code is not served by a switch to Unicode).  It's possible to change
locales in mid-execution, but for various reasons it's bad to change
the default encoding in mid-execution, so the best we can do is assume
ASCII as the default encoding.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@digicool.com  Tue Feb 20 22:02:01 2001
From: guido@digicool.com (Guido van Rossum)
Date: Tue, 20 Feb 2001 17:02:01 -0500
Subject: [I18n-sig] Japanese commentary on the Pre-PEP (2 of 4)
In-Reply-To: Your message of "Tue, 20 Feb 2001 13:56:40 PST."
 <3A92E818.6FFACF04@ActiveState.com>
References: <20010220190538.F94A.BRIAN@tomigaya.shibuya.tokyo.jp>
 <3A92E818.6FFACF04@ActiveState.com>
Message-ID: <200102202202.RAA06643@cj20424-a.reston1.va.home.com>

> I think that there is an important issue here. Python is documented as
> having character strings. The minimal unit of a string is supposed to be
> a character. "Literal" strings are documented as being strings of
> characters.

Sorry, you're reading way too much into the words here.  When I wrote
that, in my brain there was absolutely no difference between
characters and bytes, and in C the type name for a byte is 'char', so
I wrote 'character' -- but I was thinking '8-bit quantity'.

[starry-eyed romantic idealism skipped]

> It is certainly too early for Python to abandon the one-byte centric
> view of the world. It is NOT too early to start putting into place a
> transition plan to the future world that we will all be forced to live
> in. Part of that transition is teaching people that literal strings may
> one day allow characters greater than 128 (perhaps directly, perhaps
> through an escape mechanism).

No objection here.

> > ...
> >   Furthermore, Japanese programmers are accustomed to dealing with Japanese
> >   strings as byte sequences.  Japanese users have a real motivation to
> >  manipulate Japanese character strings as sequences of bytes.  Regardless
> >  of whether Unicode is supported or not, the byte sequence data type is
> >  necessary in order to represent Japanese characters.
> 
> An explicit part of every proposal has been a continued support for
> rich, expressive byte-sequence manipulation.
> 
> >  The present implementation of strings in Python, where a string represents
> >  a sequence of bytes, is one feature that makes Python easy for Japanese
> >  developers to use.  
> 
> If Japanese programmers understand the difference between a byte and a
> character (which they must!), why would they be opposed to making that
> distinction explicit in code?

Maybe because, like me, they're thinking in historical terms where
'char' is just another word for byte?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From martin@loewis.home.cs.tu-berlin.de  Tue Feb 20 22:11:22 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 20 Feb 2001 23:11:22 +0100
Subject: [I18n-sig] Re: Pre-PEP: Proposed Python Character Model
In-Reply-To: <3A92E5BB.38D4FB0B@ActiveState.com> (message from Paul Prescod on
 Tue, 20 Feb 2001 13:46:35 -0800)
References: <013401c09416$881b0f40$e46940d5@hagrid> <200102151849.f1FInMt02227@mira.informatik.hu-berlin.de> <3A92E5BB.38D4FB0B@ActiveState.com>
Message-ID: <200102202211.f1KMBMl01756@mira.informatik.hu-berlin.de>

> A better question is why if the first half of group 0, plane 0,
> row 0 better than the last half?

Well, because it is ASCII, and because ASCII is a subset of most
encodings - so assuming that an octet string is meant as ASCII when
compared to a Unicode object has a high probability of being a good
guess. The same is not true if there are octets >128.

> 
> >>> unichr(160)==chr(160)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeError: ASCII decoding error: ordinal not in range(128)
> 
> The Unicode guys made group 0, plane 0, row 0 Latin-1 for a reason.

Sure: to allow easy conversion between Latin-1 documents and Unicode.

> It's not just an accident. I don't think it makes sense for us to
> agree with them "halfway"...

We agree with them all the way. The codec that deals with Latin-1 is
hard-coded in _codecs, whereas the other single-byte encodings require
dictionaries for operation.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Tue Feb 20 22:21:25 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 20 Feb 2001 23:21:25 +0100
Subject: [I18n-sig] Japanese commentary on the Pre-PEP (2 of 4)
In-Reply-To: <3A92E818.6FFACF04@ActiveState.com> (message from Paul Prescod on
 Tue, 20 Feb 2001 13:56:40 -0800)
References: <20010220190538.F94A.BRIAN@tomigaya.shibuya.tokyo.jp> <3A92E818.6FFACF04@ActiveState.com>
Message-ID: <200102202221.f1KMLPV01849@mira.informatik.hu-berlin.de>

> I still assert that the interpretation will not change. If you have no
> encoding declaration then the only rational choice is to treat each byte
> as a character. Therefore the indexes would work exactly as they do
> today.

I'm not surprised that this assertion does not convince people too much.

Again, I doubt that theoretical discussion of the issue does not bring
it much further. What is needed is an actual patch to Python so people
can see what exactly you are proposing, and in what way it would
affect their code. I'm still pretty sure that any patch that changes
string literals to be interpreted as wide strings, using the Unicode
charset, would break loads of existing applications.

Regards,
Martin


From guido@digicool.com  Tue Feb 20 22:26:36 2001
From: guido@digicool.com (Guido van Rossum)
Date: Tue, 20 Feb 2001 17:26:36 -0500
Subject: [I18n-sig] Japanese commentary on the Pre-PEP (2 of 4)
In-Reply-To: Your message of "Tue, 20 Feb 2001 23:21:25 +0100."
 <200102202221.f1KMLPV01849@mira.informatik.hu-berlin.de>
References: <20010220190538.F94A.BRIAN@tomigaya.shibuya.tokyo.jp> <3A92E818.6FFACF04@ActiveState.com>
 <200102202221.f1KMLPV01849@mira.informatik.hu-berlin.de>
Message-ID: <200102202226.RAA07034@cj20424-a.reston1.va.home.com>

> Again, I doubt that theoretical discussion of the issue does not bring
> it much further. What is needed is an actual patch to Python so people
> can see what exactly you are proposing, and in what way it would
> affect their code.

Yes!

> I'm still pretty sure that any patch that changes
> string literals to be interpreted as wide strings, using the Unicode
> charset, would break loads of existing applications.

Note that this can already be approximated with the -U option.  It
might be a good idea to present the patch as an extension of what -U
does (I believe -U currently *only* changes all string literals to
Unicode -- but that's already very pervasive...).

--Guido van Rossum (home page: http://www.python.org/~guido/)


From paulp@ActiveState.com  Tue Feb 20 23:04:17 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Tue, 20 Feb 2001 15:04:17 -0800
Subject: [I18n-sig] Re: Pre-PEP: Proposed Python Character Model
References: <013401c09416$881b0f40$e46940d5@hagrid> <200102151849.f1FInMt02227@mira.informatik.hu-berlin.de>
 <3A92E5BB.38D4FB0B@ActiveState.com> <200102202154.QAA06554@cj20424-a.reston1.va.home.com>
Message-ID: <3A92F7F1.77AFE3FD@ActiveState.com>

Guido van Rossum wrote:
> 
> ...
> 
> This has been hashed to death many times before.  We have absolutely
> no guarantee that the files from which Python strings are read are
> encoded in Latin-1, but we do know pretty sure that they are an ASCII
> superset (if they represent characters at all). Using the locale
> module the user can (implicitly) indicate what the character set is,
> and this may not be Latin-1.  Since s.islower() and other similar
> functions are locale-sensitive, it would be inconsistent to declare
> that 8-bit strings are always encoded in Latin-1. 

So the problem is that s.islower() might in some circumstances not equal
unicode(s).islower()?

Is this really a bigger deal than the fact that in some circumstances
comparisons between 8-bit strings and Unicode strings will cause an
exception, depending on the contents of the 8-bit string. Or that sorts
could throw exceptions? Or concatenations can fail?

The only arguments I have heard for the need for the builtin function
"unichr" are based on the danger of concatenation failures in the
127-255 range. The price of this consistency is very high IMO!

-- 
Vote for Your Favorite Python & Perl Programming  
Accomplishments in the first Active Awards! 
http://www.ActiveState.com/Awards


From paulp@ActiveState.com  Tue Feb 20 23:20:35 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Tue, 20 Feb 2001 15:20:35 -0800
Subject: [I18n-sig] Japanese commentary on the Pre-PEP (2 of 4)
References: <20010220190538.F94A.BRIAN@tomigaya.shibuya.tokyo.jp> <3A92E818.6FFACF04@ActiveState.com>
 <200102202221.f1KMLPV01849@mira.informatik.hu-berlin.de> <200102202226.RAA07034@cj20424-a.reston1.va.home.com>
Message-ID: <3A92FBC3.E8484C0B@ActiveState.com>

Guido van Rossum wrote:
> 
> > Again, I doubt that theoretical discussion of the issue does not bring
> > it much further. What is needed is an actual patch to Python so people
> > can see what exactly you are proposing, and in what way it would
> > affect their code.
> 
> Yes!

The pre-PEP proposed roughly several month's work in terms of new types,
extended functions, encoding changes and so forth to be implemented over
several years. But if we don't agree on the direction of movement
straight then we aren't going to move anywhere ever!

The central proposal is that "Python strings" could allow characters
with ordinal values higher than 255. I absolutely cannot see how this
could break Python code. It is a loosening of a restriction!

The trick (which may or may not be possible) is working with extension
modules which have assumptions about the underlying bit-representation
of strings. The only way out from under that weight is to start
distinguishing between logical character strings and physical byte
strings now, so that we do not have this same "legacy extension code"
issue five years from now.

-- 
Vote for Your Favorite Python & Perl Programming  
Accomplishments in the first Active Awards! 
http://www.ActiveState.com/Awards


From guido@digicool.com  Tue Feb 20 23:53:14 2001
From: guido@digicool.com (Guido van Rossum)
Date: Tue, 20 Feb 2001 18:53:14 -0500
Subject: [I18n-sig] Japanese commentary on the Pre-PEP (2 of 4)
In-Reply-To: Your message of "Tue, 20 Feb 2001 15:20:35 PST."
 <3A92FBC3.E8484C0B@ActiveState.com>
References: <20010220190538.F94A.BRIAN@tomigaya.shibuya.tokyo.jp> <3A92E818.6FFACF04@ActiveState.com> <200102202221.f1KMLPV01849@mira.informatik.hu-berlin.de> <200102202226.RAA07034@cj20424-a.reston1.va.home.com>
 <3A92FBC3.E8484C0B@ActiveState.com>
Message-ID: <200102202353.SAA07769@cj20424-a.reston1.va.home.com>

> The pre-PEP proposed roughly several month's work in terms of new types,
> extended functions, encoding changes and so forth to be implemented over
> several years. But if we don't agree on the direction of movement
> straight then we aren't going to move anywhere ever!
> 
> The central proposal is that "Python strings" could allow characters
> with ordinal values higher than 255. I absolutely cannot see how this
> could break Python code. It is a loosening of a restriction!

It will probably require changes to C APIs, so it will break
extensions.  If some extensions aren't ported, that will in turn
break 3rd party code.

Also, if you want to see what could break, try running the test suite
with python -U.

> The trick (which may or may not be possible) is working with extension
> modules which have assumptions about the underlying bit-representation
> of strings. The only way out from under that weight is to start
> distinguishing between logical character strings and physical byte
> strings now, so that we do not have this same "legacy extension code"
> issue five years from now.

Sorry, I don't understand what you're proposing here.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@digicool.com  Wed Feb 21 00:00:55 2001
From: guido@digicool.com (Guido van Rossum)
Date: Tue, 20 Feb 2001 19:00:55 -0500
Subject: [I18n-sig] Re: Pre-PEP: Proposed Python Character Model
In-Reply-To: Your message of "Tue, 20 Feb 2001 15:04:17 PST."
 <3A92F7F1.77AFE3FD@ActiveState.com>
References: <013401c09416$881b0f40$e46940d5@hagrid> <200102151849.f1FInMt02227@mira.informatik.hu-berlin.de> <3A92E5BB.38D4FB0B@ActiveState.com> <200102202154.QAA06554@cj20424-a.reston1.va.home.com>
 <3A92F7F1.77AFE3FD@ActiveState.com>
Message-ID: <200102210000.TAA07907@cj20424-a.reston1.va.home.com>

> Guido van Rossum wrote:
> > 
> > ...
> > 
> > This has been hashed to death many times before.  We have absolutely
> > no guarantee that the files from which Python strings are read are
> > encoded in Latin-1, but we do know pretty sure that they are an ASCII
> > superset (if they represent characters at all). Using the locale
> > module the user can (implicitly) indicate what the character set is,
> > and this may not be Latin-1.  Since s.islower() and other similar
> > functions are locale-sensitive, it would be inconsistent to declare
> > that 8-bit strings are always encoded in Latin-1. 
> 
> So the problem is that s.islower() might in some circumstances not equal
> unicode(s).islower()?
> 
> Is this really a bigger deal than the fact that in some circumstances
> comparisons between 8-bit strings and Unicode strings will cause an
> exception, depending on the contents of the 8-bit string. Or that sorts
> could throw exceptions? Or concatenations can fail?

Yes, it is a bigger deal, because it is a clear indication that
assuming Latin-1 is simply WRONG.

> The only arguments I have heard for the need for the builtin function
> "unichr" are based on the danger of concatenation failures in the
> 127-255 range. The price of this consistency is very high IMO!

--Guido van Rossum (home page: http://www.python.org/~guido/)


From kajiyama@pseudo.grad.sccs.chukyo-u.ac.jp  Wed Feb 21 05:30:15 2001
From: kajiyama@pseudo.grad.sccs.chukyo-u.ac.jp (Tamito Kajiyama)
Date: Wed, 21 Feb 2001 14:30:15 +0900 (JST)
Subject: [I18n-sig] Japanese commentary on the Pre-PEP (2 of 4)
In-Reply-To: <3A92E818.6FFACF04@ActiveState.com> (message from Paul Prescod on Tue, 20 Feb 2001 13:56:40 -0800)
References: <3A92E818.6FFACF04@ActiveState.com>
 <200102202202.RAA06643@cj20424-a.reston1.va.home.com>
Message-ID: <200102210530.OAA11470@pseudo.grad.sccs.chukyo-u.ac.jp>

Brian, thank you for the great translation! 

Paul Prescod wrote:
|
| It is certainly too early for Python to abandon the one-byte centric
| view of the world. It is NOT too early to start putting into place a
| transition plan to the future world that we will all be forced to live
| in. Part of that transition is teaching people that literal strings may
| one day allow characters greater than 128 (perhaps directly, perhaps
| through an escape mechanism).

I agree.

| > The present implementation of strings in Python, where a string represents
| > a sequence of bytes, is one feature that makes Python easy for Japanese
| > developers to use.  
| 
| If Japanese programmers understand the difference between a byte and a
| character (which they must!), why would they be opposed to making that
| distinction explicit in code?

They are not opposed to the distinction, I believe.  In fact,
Python 2.0 makes such a distinction since it has the byte string
and Unicode string data types.  The present two distinct data
types are necessary and sufficient, I think.

Guido van Rossum wrote:
|
| Maybe because, like me, they're thinking in historical terms where
| 'char' is just another word for byte?

Paul Prescod wrote:
|
| I still assert that the interpretation will not change. If you have no
| encoding declaration then the only rational choice is to treat each byte
| as a character. Therefore the indexes would work exactly as they do
| today.

As Guido pointed out, Japanese programmers are thinking that
'char' in Python (and C) is another word of 'byte'.  Therefore,
to treat each byte as a character is not rational at least in
Japanese text processing.  I'm quite sure that tons of existing
programs will break if the semantics of the byte string and
Unicode string are swapped.

Regards,

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>


From andy@reportlab.com  Wed Feb 21 06:04:57 2001
From: andy@reportlab.com (Andy Robinson)
Date: Wed, 21 Feb 2001 06:04:57 -0000
Subject: [I18n-sig] Japanese commentary on the Pre-PEP (1 of 4)
In-Reply-To: <20010220185630.F948.BRIAN@tomigaya.shibuya.tokyo.jp>
Message-ID: <PGECLPOBGNBNKHNAGIJHOEFMCHAA.andy@reportlab.com>

> I've got four messages to post; I'm not such a fast 
> translator so I'll
> post the two I have now, and the other two as I finish them.
> 
Many thanks to everyone on python-ml-jp for your thoughtful 
answers.  And Brian, thank you very much for these translations; 
I know you have put a lot of time and effort into them.

- Andy Robinson  


From andy@reportlab.com  Wed Feb 21 06:04:59 2001
From: andy@reportlab.com (Andy Robinson)
Date: Wed, 21 Feb 2001 06:04:59 -0000
Subject: [I18n-sig] Re: Pre-PEP: Proposed Python Character Model
In-Reply-To: <3A92E5BB.38D4FB0B@ActiveState.com>
Message-ID: <PGECLPOBGNBNKHNAGIJHAEFNCHAA.andy@reportlab.com>

Paul Prescod wrote:
> It looks like we've already decided that group 0, plane 0, row 0 is
> special. A better question is why if the first half of 
> group 0, plane 0, row 0 better than the last half?

Because the first half is compatible with just about every native 
encoding on the planet.  The last half is just Latin-1, and byte
values above 127 are different in just about every native encoding
on the planet.

- Andy Robinson


From martin@loewis.home.cs.tu-berlin.de  Wed Feb 21 09:13:30 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 21 Feb 2001 10:13:30 +0100
Subject: [I18n-sig] Japanese commentary on the Pre-PEP (2 of 4)
In-Reply-To: <3A92FBC3.E8484C0B@ActiveState.com> (message from Paul Prescod on
 Tue, 20 Feb 2001 15:20:35 -0800)
References: <20010220190538.F94A.BRIAN@tomigaya.shibuya.tokyo.jp> <3A92E818.6FFACF04@ActiveState.com>
 <200102202221.f1KMLPV01849@mira.informatik.hu-berlin.de> <200102202226.RAA07034@cj20424-a.reston1.va.home.com> <3A92FBC3.E8484C0B@ActiveState.com>
Message-ID: <200102210913.f1L9DUh00845@mira.informatik.hu-berlin.de>

> The pre-PEP proposed roughly several month's work in terms of new types,
> extended functions, encoding changes and so forth to be implemented over
> several years. But if we don't agree on the direction of movement
> straight then we aren't going to move anywhere ever!
> 
> The central proposal is that "Python strings" could allow characters
> with ordinal values higher than 255. I absolutely cannot see how this
> could break Python code. It is a loosening of a restriction!

If you are convinced that your approach works, but cannot afford to
implement it all, then specify it in a PEP. That might reduce the
amount of work that you have to do, but will increase the amount of
work that others have to do: I'd have to study it, trying to
understand it, then pointing out places where it is imprecise. After
that, I'd have to figure out mentally how to implement it, and point
to the places that are unimplementable. Finally, I'd have to look
around for code and study how it would operate under your proposal.

I seriously doubt that "direction of movement" is a meaningful term
here. It all depends on the details, not the grand picture.

Regards,
Martin


From mal@lemburg.com  Wed Feb 21 12:39:26 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 21 Feb 2001 13:39:26 +0100
Subject: [I18n-sig] How does Python Unicode treat surrogates?
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com>
Message-ID: <3A93B6FE.842A4BAF@lemburg.com>

Guido van Rossum wrote:
> 
> On the XML sig the following exchange happened.  I don't know enough
> about the issues to investigate, but I'm sure that someone here can
> provide insight?  It seems to boil down to whether or not surrogates
> may get transposed when between platforms.

The Python Unicode implementation assumes that the internal
storage is using UTF-16 *without* surrogates. As a result the
storage scheme is the same as UCS2. This is per design since
surrogates introduce a whole new can of worms (making
UTF-16 a variable length encoding).

Still, there are some codecs (utf-8, utf-16, unicode-escape) 
which try to handle can handle  surrogates properly. The support 
for surrogates is not complete though, so I wouldn't rely on it.

Note that UTF-16 surrogates are only needed to reach Unicode
code points beyond BMP. AFAIK, there are plans to fill this
area in the next Unicode version, but the designers are very
well aware of the issues this imposes on the existing implementations:
Windows and Java are Unicode 2.0 based which is not capable of
handling character points outside BMP.

Does this answer you question ?

> --Guido van Rossum (home page: http://www.python.org/~guido/)
> 
> ------- Forwarded Message
> 
> Date:    Tue, 20 Feb 2001 11:54:34 -0700
> From:    Uche Ogbuji <uche.ogbuji@fourthought.com>
> To:      Guido van Rossum <guido@digicool.com>
> cc:      Lars Marius Garshol <larsga@garshol.priv.no>, xml-sig@python.org
> Subject: Re: [XML-SIG] DC DOM tests (Was: Roadmap document - finally!)
> 
> > > > > - DOMString and text manipulating interface methods are not
> > > > >   tested beyond ASCII text due to an implementation limitation
> > > > >   of ParsedXML.DOM. So, implementations will not be tested if
> > > > >   text is correctly treated when multi-byte UTF-16 characters
> > > > >   are involved.
> > > >
> > > > By "multi-byte UTF-16 characters" I assume you mean Unicode
> > > > characters outside the BMP that are represented using two
> > > > surrogates?
> > >
> > > I wonder if that's what Martijn means.  I've read that most Java
> > > implementations have trouble with characters outside the BMP.  I
> > > wonder if Python handles these properly.
> >
> > Depends on what you call properly.  Can you elaborate on what you
> > would call proper treatment here?
> 
> Sure.  I admit it's hearsay, but I thought I'd read that because Java
> Unicode is or was underspecified, that there was the possibility of
> transposition of the high-surrogate with the low-surrogate character
> between Java implementations or platforms.
> 
> Now I don't exactly write XML dissertations on "Hello Kitty" <g>, so
> I'm not likely to run into this myself, but I was wondering whether
> Python handles surrogate blocks appropriately across platforms and
> implementations (I guess including cpyhton -> Jpython).
> 
> --
> Uche Ogbuji                               Principal Consultant
> uche.ogbuji@fourthought.com               +1 303 583 9900 x 101
> Fourthought, Inc.                         http://Fourthought.com
> 4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA
> Software-engineering, knowledge-management, XML, CORBA, Linux, Python
> 
> ------- End of Forwarded Message
> 
> _______________________________________________
> I18n-sig mailing list
> I18n-sig@python.org
> http://mail.python.org/mailman/listinfo/i18n-sig

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Pages:                           http://www.lemburg.com/python/


From fw@deneb.enyo.de  Thu Feb 22 16:38:26 2001
From: fw@deneb.enyo.de (Florian Weimer)
Date: 22 Feb 2001 17:38:26 +0100
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <3A93B6FE.842A4BAF@lemburg.com>
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com>
 <3A93B6FE.842A4BAF@lemburg.com>
Message-ID: <87hf1moepp.fsf@deneb.enyo.de>

"M.-A. Lemburg" <mal@lemburg.com> writes:

> Note that UTF-16 surrogates are only needed to reach Unicode
> code points beyond BMP. AFAIK, there are plans to fill this
> area in the next Unicode version, but the designers are very
> well aware of the issues this imposes on the existing implementations:
> Windows and Java are Unicode 2.0 based which is not capable of
> handling character points outside BMP.

And so is Ada.

However, a few useful extensions are planned for the next Unicode
revisions: several mathematical alphabets and language tags come to
my mind immediately.  It's certainly no longer true that non-BMP
characters are going to be used only by scholars (as it seemed a few
years ago).