[I18n-sig] Unicode compromise?

Paul Prescod paul@prescod.net
Tue, 02 May 2000 14:01:32 -0500


Guido van Rossum wrote:
> 
> >     No automatic conversions between 8-bit "strings" and Unicode strings.
> >
> > If you want to turn UTF-8 into a Unicode string, say so.
> > If you want to turn Latin-1 into a Unicode string, say so.
> > If you want to turn ISO-2022-JP into a Unicode string, say so.
> > Adding a Unicode string and an 8-bit "string" gives an exception.
> 
> I'd accept this, with one change: mixing Unicode and 8-bit strings is
> okay when the 8-bit strings contain only ASCII (byte values 0 through
> 127).  

I could live with this compromise as long as we document that a future
version may use the "character is a character" model. I just don't want
people to start depending on a catchable exception being thrown because
that would stop us from ever unifying unmarked literal strings and
Unicode strings.

--

Are there any steps we could take to make a future divorce of strings
and byte arrays easier? What if we added a 

binary_read()

function that returns some form of byte array. The byte array type could
be just like today's string type except that its type object would be
distinct, it wouldn't have as many string-ish methods and it wouldn't
have any auto-conversion to Unicode at all.

People could start to transition code that reads non-ASCII data to the
new function. We could put big warning labels on read() to state that it
might not always be able to read data that is not in some small set of
recognized encodings (probably UTF-8 and UTF-16).

Or perhaps binary_open(). Or perhaps both.

I do not suggest just using the text/binary flag on the existing open
function because we cannot immediately change its behavior without
breaking code.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html