harmful str(bytes)

Fri Oct 8 09:45:58 EDT 2010

Antoine Pitrou writes:
>Hallvard B Furuseth <h.b.furuseth at usit.uio.no> wrote:
>> The offender is bytes.__str__: str(b'foo') == "b'foo'".
>> It's often not clear from looking at a piece of code whether
>> some data is treated as strings or bytes, particularly when
>> translating from old code.  Which means one cannot see from
>> context if str(s) or "%s" % s will produce garbage.
>
> This probably comes from overuse of str(s) and "%s". They can be useful
> to produce human-readable messages, but you shouldn't have to use them
> very often.

Maybe Python 3 has something better, but they could be hard to avoid in
Python 2.  And certainly our site has plenty of code using them, whether
we should have avoided them or not.

>> I really wish bytes.__str__ would at least by default fail.
>
> Actually, the implicit contract of __str__ is that it never fails, so
> that everything can be printed out (for debugging purposes, etc.).

Nope:

$ python2 -c 'str(u"\u1000")'
Traceback (most recent call last):
  File "<string>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\u1000' in position 0: ordinal not in range(128)

And the equivalent:

$ python2 -c 'unicode("\xA0")'
Traceback (most recent call last):
  File "<string>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

In Python 2, these two UnicodeEncodeErrors made our data safe from code
which used str and unicode objects without checking too carefully which
was which.  Code which sort the types out carefully enough would fail.

In Python 3, that safety only exists for bytes(str), not str(bytes).

-- 
Hallvard