[Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

Tue Feb 14 19:36:26 CET 2006

On Feb 14, 2006, at 11:25 AM, Phillip J. Eby wrote:
> At 11:08 AM 2/14/2006 -0500, James Y Knight wrote:
>> I like it, it makes sense. Unicode strings are simply not allowed as
>> arguments to the byte constructor. Thinking about it, why would it be
>> otherwise? And if you're mixing str-strings and unicode-strings, that
>> means the str-strings you're sometimes giving are actually not byte
>> strings, but character strings anyhow, so you should be encoding
>> those too. bytes(s_or_U.encode('utf-8')) is a perfectly good  
>> spelling.
> Actually, I think you mean:
>
>     if isinstance(s_or_U, str):
>         s_or_U = s_or_U.decode('utf-8')
>
>     b = bytes(s_or_U.encode('utf-8'))
>
> Or maybe:
>
>     if isinstance(s_or_U, unicode):
>         s_or_U = s_or_U.encode('utf-8')
>
>     b = bytes(s_or_U)
>
> Which is why I proposed that the boilerplate logic get moved *into*  
> the bytes constructor.  I think this use case is going to be common  
> in today's Python, but in truth I'm not as sure what bytes() will  
> get used *for* in today's Python.  I'm probably overprojecting  
> based on the need to use str objects now, but bytes aren't going to  
> be a replacement for str for a good while anyway.

I most certainly *did not* mean that. If you are mixing together str  
and unicode instances, the str instances _must be_ in the default  
encoding (ascii). Otherwise, you are bound for failure anyhow, e.g.  
''.join(['\x95', u'1']). Str is used for two things right now: 1) a  
byte string. 2) a unicode string restricted to 7bit ASCII. These two  
uses are separate and you cannot mix them without causing disaster.

You've created an interface which can take either a utf8 byte-string,  
or unicode character string. But that's wrong and can only cause  
problems. It should take either an encoded bytestring, or a unicode  
character string. Not both. If it takes a unicode character string,  
there are two ways of spelling that in current python: a "str" object  
with only ASCII in it, or a "unicode" object with arbitrary  
characters in it. bytes(s_or_U.encode('utf-8')) works correctly with  
both.

James