[I18n-sig] Unicode debate

Guido van Rossum guido@python.org
Fri, 05 May 2000 11:05:36 -0400


[Moving this discussion to i18n-sig, where it belongs]

> At 11:02 PM +0200 04-05-2000, Fredrik Lundh wrote:
> >Henry S. Thompson <ht@cogsci.ed.ac.uk> wrote:
> >> I think I hear a moderate consensus developing that the 'ASCII
> >> proposal' is a reasonable compromise given the time constraints.
> >
> >agreed.
> 
> This makes no sense: implementing the 7-bit proposal takes the more or less
> the same time as implementing 8-bit downcasting. Or is it just the
> bickering that's too time consuming? ;-)

Sort of.  The 8-bit proposal has too much opposition, and other
(possibly better) proposals would take too long to implement.  The
7-bit proposal takes away the biggest problem with the current UTF-8
version (a character is always a byte -- a byte isn't always a
character though) and doesn't to back us into a corner we can't get
out of later.

> I worry that if the current implementation goes into 1.6 more or less as it
> is now there's no way we can ever go back (before P3K). Or will Unicode
> support be marked "experimental" in 1.6? This is not so much about the
> 7-bit/8-bit proposal but about the dubious unicode() and unichr() functions
> and the u"" notation:
> 
> - unicode() only takes strings, so is effectively a method of the string type.

Not true.  It takes anything that supports the buffer interface:

  >>> from array import array
  >>> a = array('b', "hello world")
  >>> unicode(a)
  u'hello world'
  >>> 

The best way to look at it is to view unicode() as a constructor for
Unicode objects.

> - if narrow and wide strings are meant to be as similar as possible,
> chr(256) should just return a wide char
> - similarly, why is the u"" notation at all needed?

Many extensions don't do the right thing with Unicode string objects,
and there's not enough time to fix them all.  So my (indeed
experimental and temporary -- at the worst until Py3k) solution is to
require people to be explicit about when they want to use wide
strings.  Very similar to what Python does with 32-bit vs. long ints.
Not ideal in the long run, and to be fixed in Py3k, but (in my view)
unavoidable right now given that Python interfaces to so many
real-world systems where the distinction is important.

If in the future we'll be more automatic, we will support but ignore
the u prefix on string literals, for backward compatibility -- just
like we will support but ignore the L suffix on numeric literals once
ints and longs have been unified.

> The current design is more complex than needed, and still offers plenty of
> surprises. Making it simpler (without integrating the two string types) is
> not a huge effort. Seeing the wide string type as independent of Unicode
> takes no physical effort at all, as it's just in our heads.

What do you propose to make it simpler?  Your last implementation
proposal would require starting all over from scratch.

> Fixing str() so it can return wide strings might be harder, and can wait
> until later. Would be too bad, though.

Agreed on both counts (harder, and too bad).

--Guido van Rossum (home page: http://www.python.org/~guido/)