"Decoding unicode is not supported" in unusual situation

Fri Mar 9 19:57:50 EST 2012

On Fri, 09 Mar 2012 10:11:58 -0800, John Nagle wrote:

> On 3/8/2012 2:58 PM, Prasad, Ramit wrote:
>>>      Right. The real problem is that Python 2.7 doesn't have distinct
>>> "str" and "bytes" types.  type(bytes() returns<type 'str'> "str" is
>>> assumed to be ASCII 0..127, but that's not enforced. "bytes" and "str"
>>> should have been distinct types, but that would have broken much old
>>> code.  If they were distinct, then constructors could distinguish
>>> between string type conversion (which requires no encoding
>>> information) and byte stream decoding.
>>>
>>>      So it's possible to get junk characters in a "str", and they
>>> won't convert to Unicode.  I've had this happen with databases which
>>> were supposed to be ASCII, but occasionally a non-ASCII character
>>> would slip through.
>>
>> bytes and str are just aliases for each other.
> 
>     That's true in Python 2.7, but not in 3.x.  From 2.6 forward,
> "bytes" and "str" were slowly being separated.  See PEP 358. Some of the
> problems in Python 2.7 come from this ambiguity. Logically, "unicode" of
> "str" should be a simple type conversion from ASCII to Unicode, while
> "unicode" of "bytes" should require an encoding.  But because of the
> bytes/str ambiguity in Python 2.6/2.7, the behavior couldn't be
> type-based.

This demonstrates a gross confusion about both Unicode and Python. John, 
I honestly don't mean to be rude here, but if you actually believe that 
(rather than merely expressing yourself poorly), then it seems to me that 
you are desperately misinformed about Unicode and are working on the 
basis of some serious misapprehensions about the nature of strings.

I recommend you start with this:

http://www.joelonsoftware.com/articles/Unicode.html

In Python 2.6/2.7, there is no ambiguity between str/bytes. The two names 
are aliases for each other. The older name, "str", is a misnomer, since 
it *actually* refers to bytes (and always has, all the way back to the 
earliest days of Python). At best, it could be read as "byte string" or 
"8-bit string", but the emphasis should always be on the *bytes*.

str is NOT "assumed to be ASCII 0..127", and it never has been. Python's 
str prior to version 3.0 has *always* been bytes, it just never used that 
name. For example, in Python 2.4, help(chr) explicitly supports 
characters with ordinal 0...255:

Help on built-in function chr in module __builtin__:

chr(...)
    chr(i) -> character

    Return a string of one character with ordinal i; 0 <= i < 256.

I can go all the way back to Python 0.9, which was so primitive it didn't 
even accept "" as string delimiters, and the str type was still based on 
bytes, with explicit support for non-ASCII values:

steve at runes:~/Downloads/python-0.9.1$ ./python0.9.1 
>>> print 'This is *not* ASCII \xCA see the non-ASCII byte.'
This is *not* ASCII � see the non-ASCII byte.

Any conversion from bytes (including Python 2 strings) to Unicode is 
ALWAYS a decoding operation. It can't possibly be anything else. If you 
think that it can be, you don't understand the relationship between 
strings, Unicode and bytes.

-- 
Steven