[I18n-sig] Python and Unicode == Britain and the Euro?

Paul Prescod paulp@ActiveState.com
Sun, 11 Feb 2001 00:05:22 -0800


Brian Takashi Hooper wrote:
> 
> ...
>
> I think this is a true and valid point (that Westerners are more likely
> to want to make internationalized software), but it sounds here like
> because Westerners want to make it easier to internationalize software,
> that that is a valid reason to make it harder to make software that has
> no particular need for internationalization, in non-Western languages,
> and change the _meaning_ of such a basic data type as the Python string.

I do not think that any of the proposals make it much harder to make
non-internationlized software. We are merely asking people to be
explicit about their assumptions so that code will have a better chance
of working on other people's computers. That means adding an encoding
declaration here, prepending a "b" prefix there and so forth. Asians
understand encoding issues and I do not think that they will be confused
by these changes.

If you ask an Asian "what is Python's character set" they will either
answer Latin 1 (which looks bad) or "Python has no native character set,
only binary strings of bytes." If they think of strings as strings of
bytes then what is the harm in prefixing a "b" to make that assumption
explicit?

> If in fact, as the proposal proposes, usage of open() without an
> encoding, for example, is at some point deprecated, then if I am
> manipulating non-Unicode data in "" strings, then I think I _do_ at some
> point have to port them over.  

No, those would be two unrelated changes. In order to get open() to have
its old behavior you would say something like:

open( "filename", "raw")

or

open( "filename", "binary")

> b"<blob of binary data>" then becomes
> different from "<blob of binary data>", because "<blob of binary data>"
> is now automatically being interpreted behind the scenes into an
> internal Unicode representation.  

Yes, this is a separate proposal for some time down the road. Sometime
down the road is likely at least two years because the deployment of new
versions of Python is very slow and it would be wrong to quickly
deprecate a usage which is "recommended practice" in Python 2.x.

> If the blob of binary data actually
> happened to be in Unicode, or some Unicode-favored representation (like
> UTF-8), then I might be happy about this - but if it wasn't, I think
> that this result would instead be rather dismaying.

The vast majority of the world's encodings are "Unicode-favored" at some
level. As long as the character set is compatible with Unicode and you
add an encoding declaration, everything should just work. If you do NOT
want to work with Unicode then you would have to prepend a "b" prefix to
your literal strings.

As I've described, you will have several years to choose which path you
want to take. And the "fixups" are easy. I don't see why this is a cause
for alarm.

> The current Unicode support is more explicit about this - the meaning of
> the string literal itself has not changed, so I can continue to ignore
> Unicode in cases where it serves no useful purpose.  

Python is EXPLICIT about the fact that the character set is NOT Unicode.

Python is NOT explicit about the fact that the character set is Latin 1
or "binary data" -- depending on your point of view. If you take the
former point of view then Python is Western centric. If you take the
latter point of view then it is just plain confusing to use the term
"character string" as the name for your "binary data" container. You
acknowledge this below:

> I realize that it
> would be nicer from a design perspective, more consistent, to have
> Python string mean only character data, but right now, it does sometimes
> mean binary and sometimes mean characters. The only one who can
> distinguish which is the programmer - if at some point "" means only
> Unicode character strings, then the programmer _does_, I think, have to
> go through all their programs looking for places where they are using
> strings to hold non-Unicode character data, or binary data, and
> explicitly convert them over.  I have difficulty seeing how we would be
> able to provide a smooth upgrade path - maybe a command-line backwards
> compatibility option?  

It is my personal opinion that time itself is an "upgrade path." If you
tell people where things are going then in the course of basic software
maintenance they will change their software. This is how we managed the
transition from K&R C to ANSI C to C++. Yes, a command-line backwards
compatibility option is another way of extending the amount of
"change-over" time people have.

> Maybe defaults?  I've heard a lot of people
> voicing dislike for default encodings, but from my perspective,
> something like ISO-Latin-1, or UTF-8, or even ASCII (EUC-JP and SJIS are,
> strictly speaking, not supersets of ASCII because the ASCII ranges are
> usually interpreted as JIS-Roman, which contains about 4 different
> characters) is functionally a default encoding...  Requiring encoding
> declarations, as the proposal suggests, is nice for people working in
> the i18n domain, but is an unnecessary inconvenience for those who are
> not.

One of the things I like about Python is that it encourages me to write
software in ways that allow my simple scripts to grow into complex
programs. Perl programmers consider many of these "encouragements" to be
unnecessary inconveniences. Similarly, I think Python should help me
(and encourage me) to write software that works on computers that are
configured differently than mine.

Think of it also as an investment in the unification of the Python
world. Wouldn't it be great if Chinese programmers could email Guido and
say: "Here's a cool Python program I wrote. Give it a whirl?" Is it
possible that we duplicate more code than we need to because it is too
hard to share programs right now? Obviously spoken language barriers are
not going away but at least our code can be portable.

Also, think of all of the great software being written in Python. Maybe
the next killer Python app will work better in Japan and China because
we made it easier to internationalize code.

And if Python itself can distinguish between textual and binary
information then we can do a lot of things more intelligently:
coercions, exceptions, concatenations, extension library integration
etc. Explicit is better than implicit!

Finally, I think it is in the best interests of even people who do not
want i18n to have the Python language be more explicit and consistent.
When Python is taught in a Japananese school they can say: "See, this
character 'b' means that the string contains binary data. We choose to
use a binary string for reason X, Y and Z." or "See, this string
contains Unicode characters. That means len() works as you would expect
on a per-character basis and the software works just as well with
Chinese text as Japanese text and ..."

> > I don't think it is posssible to say in the abstract that a move to
> > Unicode would break code. Depending on implementation strategy it might.
> > But I can't imagine there is really a ton of code that would break
> > merely from widening the character.
> See above.  I think there is, at least outside of Europe.  

Note that we are discussing three or four or five different proposals as
if they are one. I think it would be easy to demonstrate that there is
little code that would break based ONLY on the change that Python
strings could contain characters with ordinals greater than 255.

If we added a single character to the range at position 256, would that
break much Python code? Ignore Unicode. Just extend the range by one
character. Now keep extending it until you get to the size of Unicode.

The separate proposal that tries to clean up the interpretation of
literals with non-Unicode bytes WOULD break code (if only some time far
in the future and after a long changeover period). 

> ...
> Maybe it would be instructive to take the current proposal and any
> others that come out, and without actually implementing, pretend-apply
> the changes to parts of the existing code base to try to see how big the
> effect would be?  That way, neither of us has to accept just on faith
> that changing so-and-so would or would not break existing code...

Python changes are always implemented as patches which are tested and
then backed-out if they break things. Nevertheless, you are right that
there are some of us with the goal of having string literals directly
contain Unicode characters one day. Guido may or may not have an opinion
on the issue.

Either way, Guido wouldn't make the change if it were going to break a
lot of code. So the immediate issue is whether the explicitness
requirements of b"" strings and an encoding declaration are too onerous.

Anyhow, at this point we are not even talking about adding any mandatory
features or turning new features into recommended practice. We are just
talking about ALLOWING people to be explicit about the distinction
between binary and text data and ALLOWING people to directly enter
Unicode text data. 

I haven't tried to hide where I think things should go but still these
new features deserve to be evaluated on their own. They are good ideas
even if we never deprecate the other ways of doing things. I know I
started this discussion with my single big-bang proposal but I'd like to
take a more incremental approach now. I don't think that the current
proposals make anyone's life harder yet.

 Paul Prescod