[Python-Dev] Re: Unicode debate

Guido van Rossum guido@python.org
Fri, 28 Apr 2000 10:10:27 -0400


[GvR]
> >- We need a way to indicate the encoding of Python source code.
> >(Probably a "magic comment".)

[JvR]
> How will other parts of a program know which encoding was used for
> non-unicode string literals?
> 
> It seems to me that an encoding attribute for 8-bit strings solves this
> nicely. The attribute should only be set automatically if the encoding of
> the source file was specified or when the string has been encoded from a
> unicode string. The attribute should *only* be used when converting to
> unicode. (Hm, it could even be used when calling unicode() without the
> encoding argument.) It should *not* be used when comparing (or adding,
> etc.) 8-bit strings to each other, since they still may contain binary
> goop, even in a source file with a specified encoding!

Marc-Andre took this idea a bit further, but I think it's not
practical given the current implementation: there are too many places
where the C code would have to be changed in order to propagate the
string encoding information, and there are too many sources of strings
with unknown encodings to make it very useful.  Plus, it would slow
down 8-bit string ops.

I have a better idea: rather than carrying around 8-bit strings with
an encoding, use Unicode literals in your source code.  If the source
encoding is known, these will be converted using the appropriate
codec.

If you object to having to write u"..." all the time, we could say
that "..." is a Unicode literal if it contains any characters with the
top bit on (of course the source file encoding would be used just like
for u"...").

But I think this should be enabled by a separate pragma -- people who
want to write Unicode-unaware code manipulating 8-bit strings in their
favorite encoding (e.g. shift-JIS or Latin-1) should not silently get
Unicode strings.

(I thought about an option to make *all strings* (not just literals)
Unicode, but the current implementation would require too much
hacking.  This is what JPython does, and maybe it should be what
Python 3000 does; I don't see it as a realistic option for the 1.x
series.)

--Guido van Rossum (home page: http://www.python.org/~guido/)