[I18n-sig] Japanese commentary on the Pre-PEP (2 of 4)

Paul Prescod paulp@ActiveState.com
Tue, 20 Feb 2001 13:56:40 -0800


Thanks for the translation Brian! That must have been a ton of work but
it strikes me as very important work!


> 
> ...
> 
>       Python 2.0                      Pre-PEP
>       string "" (byte sequence)       byte string b""
>       Unicode string u"" (Unicode     string ""
>          character sequence)
>
>   In general, the before- and after-PEP Pythons above have essentially no
>   difference in expressiveness, and therefore it's hard to see what merit
>   there might be in swapping the data types.

I think that there is an important issue here. Python is documented as
having character strings. The minimal unit of a string is supposed to be
a character. "Literal" strings are documented as being strings of
characters. People expect this of a modern, high-level, user-centric
language. Bytes are no more interesting to your average programmer than
are DWORDs. We aren't going to start teaching people about bytes in
introductory Python classes.

More and more, people are going to find it bizarre to make a distinction
between the 128 characters that happen to have lived in a
quickly-becoming-obsolete American standard and the other 65,000
characters that we can use in word processors, web pages, search engines
and so forth. You don't have to be Asian to see the distinction as
arbitrary and historical. What if you want to insert a trademark (tm) or
copyright (c) in your software?

It is certainly too early for Python to abandon the one-byte centric
view of the world. It is NOT too early to start putting into place a
transition plan to the future world that we will all be forced to live
in. Part of that transition is teaching people that literal strings may
one day allow characters greater than 128 (perhaps directly, perhaps
through an escape mechanism).

> ...
>   Furthermore, Japanese programmers are accustomed to dealing with Japanese
>   strings as byte sequences.  Japanese users have a real motivation to
>  manipulate Japanese character strings as sequences of bytes.  Regardless
>  of whether Unicode is supported or not, the byte sequence data type is
>  necessary in order to represent Japanese characters.

An explicit part of every proposal has been a continued support for
rich, expressive byte-sequence manipulation.

>   The present implementation of strings in Python, where a string represents
>   a sequence of bytes, is one feature that makes Python easy for Japanese
>  developers to use.  

If Japanese programmers understand the difference between a byte and a
character (which they must!), why would they be opposed to making that
distinction explicit in code?

>   As you know, in Japanese-encoded byte strings, 2 bytes often represent
>  1 character.  Therefore, the position of characters is expressed in terms
>  of bytes, not characters.  Because of this, if a Japanese-encoded byte
>   string is interpreted as-is as a Unicode character string, indexes into
>   the string would no longer be interpreted the same way.  For example, in
>  the below code snippet the substring is output differently depending on
>  whether the string literal is interpreted as a byte sequence or Unicode
>   character sequence:
>
>     s = "これは日本語の文字列です。"
>     print s[6:12]
> 
>  Hard coding of slices as with the above is a common practice,
>   I believe.  Paul has asserted that no serious problems will occur if
>   existing byte sequences are interpreted as Unicode, but I disagree with
>   him on this.

I still assert that the interpretation will not change. If you have no
encoding declaration then the only rational choice is to treat each byte
as a character. Therefore the indexes would work exactly as they do
today.

-- 
Vote for Your Favorite Python & Perl Programming  
Accomplishments in the first Active Awards! 
http://www.ActiveState.com/Awards