[Python-Dev] re: Unicode as argument for 8-bit strings

Bill Tutt billtut@microsoft.com
Fri, 7 Apr 2000 23:24:06 -0700


> From: Fredrik Lundh [mailto:fredrik@pythonware.com]
> 
> Bill Tutt wrote:
> > > There has been a bug report about the treatment of Unicode
> > > objects together with 8-bit format strings. The current
> > > implementation converts the Unicode object to UTF-8 and then
> > > inserts this value in place of the %s.... 
> > > 
> > > I'm inclined to change this to have '...%s...' % u'abc'
> > > return u'...abc...' since this is just another case of
> > > coercing data to the "bigger" type to avoid information loss.
> > > 
> > > Thoughts ?
> > 
> > Suddenly returning a Unicode string from an operation that 
> was an 8-bit
> > string is likely to give some code exterme fits of despondency.
> 
> why is this different from returning floating point values from
> operations involving integers and floats?
> 
> > Converting to UTF-8 didn't give you any data loss, however 
> it certainly
> > might be unexpected to now find UTF-8 characters in what 
> the user originally
> > thought was a binary string containing whatever they had 
> wanted it to contain.
> 
> the more I've played with this, the stronger my opinion that
> the "now it's an ordinary string, now it's a UTF-8 string, now
> it's an ordinary string again" approach doesn't work.  more on
> this in a later post.
> 

Well, unicode string/UTF-8 string, but I definately agree with you. Pick one
or the other and make the user convert betwixt the two. 

> (am I the only one here that has actually tried to write code
> that handles both unicode strings and ordinary strings?  if not,
> can anyone tell me what I'm doing wrong?)
> 

In C++, yes. :) Autoconverting into or out of unicode is bound to lead to
trouble for someone. Look at the various messes that misused C++ operator
overloading can get you into. Whether its the code that wasn't expecting
UTF-8 in a normal string type, or a formatting operation that used to return
a normal string type now returning a Unicode string.

> > Throwing an exception would at the very least force the 
> user to make a
> > decision one way or the other about what they want to do 
> with the data.
> > They might want to do a codepage translation, or something 
> else. (aka Hey,
> > here's a bug I just found for you!)
> 
> > In what other cases are you suddenly returning a Unicode 
> string object from
> > which previouslly returned a string object?
> 
> if unicode is ever to be a real string type in python, and not just a
> nifty extension type, it must be okay to return a unicode string from
> any operation that involves a unicode argument...

Err. I'm not sure what you're getting at here. If your saying that it'd be
nice if we could ditch the current string type and just use the Unicode
string type, then I agree with you. However, that doesn't mean you should
change the semantics of an operation that existed before unicode came into
the picture, since it would break backward compatability. 

+1 for '%s' % u'\u1234' throwing a TypeError exception.

Bill