[I18n-sig] Unicode strings: an alternative

Guido van Rossum guido@python.org
Fri, 05 May 2000 12:07:10 -0400


> > I'll just say that I am very happy with ASCII as the default.
> 
> It's better than UTF-8, but 8bit Unicode would be better, because
> it's the least suprising alternative.
> 
> People who use Python with "funny" languages, are already used to
> converting their strings around, and they treat their Python
> strings as byte arrays anyway. With Python 1.6 they can start
> to switch to Pythons unicode strings without any problems.
> That isn't so with UTF-8. I wonder how it will work with ASCII.
> Will this ASCII restriction only be enforced when converting
> to Unicode, or will the string type itself be restricted to
> ASCII?

No, 8-bit strings will always be 8-bit clear, of course!  The ASCII
restriction is only used for conversion to Unicode when no explicit
encoding is given.  For example, "abc" + u"xyz" is u"abcxyz", but "θι"
+ u"xyz" raises an exception.  However you can write
unicode("θι","latin-1") and it will yield u"\350\351".

> IMHO the long term goal should be to have only one string type
> (being Unicode) and one byte array type (being our current string 
> type?)

The byte array type should not support string literals at all.  The
Java model is right.

--Guido van Rossum (home page: http://www.python.org/~guido/)