harmful str(bytes)

Fri Oct 8 09:31:27 EDT 2010

Arnaud Delobelle writes:
>Hallvard B Furuseth <h.b.furuseth at usit.uio.no> writes:
>> I've been playing a bit with Python3.2a2, and frankly its charset
>> handling looks _less_ safe than in Python 2.
>> (...)
>> With 2.<late> conversion Unicode <-> string the equivalent operation did
>> not silently produce garbage: it raised UnicodeError instead.  With old
>> raw Python strings that was not a problem in applications which did not
>> need to convert any charsets, with python3 they can break.
>>
>> I really wish bytes.__str__ would at least by default fail.
>
> I think you misunderstand the purpose of str().  It is to provide a
> (unicode) string representation of an object and has nothing to do with
> converting it to unicode:

That's not the point - the point is that for 2.* code which _uses_ str
vs unicode, the equivalent 3.* code uses str vs bytes.  Yet not the
same way - a 2.* 'str' will sometimes be 3.* bytes, sometime str.  So
upgraded old code will have to expect both str and bytes.

In 2.*, str<->unicode conversion failed or produced the equivalent
character/byte data.  Yes, there could be charset problems if the
defaults were set up wrong, but that's a smaller problem than in 3.*.
In 3.*, the bytes->str conversion always _silently_ produces garbage.

And lots of code use both, and need to convert back and forth.  In
particular code 3.* code converted from 2.*, or using modules converted
from 2.*.  There's a lot of such code, and will be for a long time.

-- 
Hallvard