"Decoding unicode is not supported" in unusual situation

Sun Mar 11 01:12:49 EST 2012

On 3/9/2012 4:57 PM, Steven D'Aprano wrote:
> On Fri, 09 Mar 2012 10:11:58 -0800, John Nagle wrote:
> This demonstrates a gross confusion about both Unicode and Python. John,
> I honestly don't mean to be rude here, but if you actually believe that
> (rather than merely expressing yourself poorly), then it seems to me that
> you are desperately misinformed about Unicode and are working on the
> basis of some serious misapprehensions about the nature of strings.
>
> In Python 2.6/2.7, there is no ambiguity between str/bytes. The two names
> are aliases for each other. The older name, "str", is a misnomer, since
> it *actually* refers to bytes (and always has, all the way back to the
> earliest days of Python). At best, it could be read as "byte string" or
> "8-bit string", but the emphasis should always be on the *bytes*.

    There's an inherent ambiguity in that "bytes" and "str" are really
the same type in Python 2.6/2.7.  That's a hack for backwards
compatibility, and it goes away in 3.x.  The notes for PEP 358
admit this.

    It's implicit in allowing

	unicode(s)

with no encoding, on type "str", that there is an implicit
assumption that s is ASCII.  Arguably, "unicode()" should
have required an encoding in all cases.

Or "str" and "bytes" should have been made separate types in
Python 2.7, in which case unicode() of a str would be a safe
ASCII to Unicode translation, and unicode() of a bytes object
would require an encoding.  But that would break too much old code.
So we have an ambiguity and a hack.

"While Python 2 also has a unicode string type, the fundamental 
ambiguity of the core string type, coupled with Python 2's default 
behavior of supporting automatic coercion from 8-bit strings to unicode 
objects when the two are combined, often leads to UnicodeErrors"
- PEP 404

				John Nagle