harmful str(bytes)

Fri Oct 8 16:53:28 EDT 2010

On Fri, 08 Oct 2010 15:45:58 +0200
Hallvard B Furuseth <h.b.furuseth at usit.uio.no> wrote:
> Antoine Pitrou writes:
> >Hallvard B Furuseth <h.b.furuseth at usit.uio.no> wrote:
> >> The offender is bytes.__str__: str(b'foo') == "b'foo'".
> >> It's often not clear from looking at a piece of code whether
> >> some data is treated as strings or bytes, particularly when
> >> translating from old code.  Which means one cannot see from
> >> context if str(s) or "%s" % s will produce garbage.
> >
> > This probably comes from overuse of str(s) and "%s". They can be useful
> > to produce human-readable messages, but you shouldn't have to use them
> > very often.
> 
> Maybe Python 3 has something better, but they could be hard to avoid in
> Python 2.  And certainly our site has plenty of code using them, whether
> we should have avoided them or not.

It's difficult to answer more precisely without knowing what you're
doing precisely. But if you already have str objects, you don't have to
call str() or format them using "%s", so implicit __str__ calls are
avoided.

> > Actually, the implicit contract of __str__ is that it never fails, so
> > that everything can be printed out (for debugging purposes, etc.).
> 
> Nope:
> 
> $ python2 -c 'str(u"\u1000")'
> Traceback (most recent call last):
[...]
> $ python2 -c 'unicode("\xA0")'
> Traceback (most recent call last):

Sure, but so what? This mainly shows that unicode support was broken in
Python 2, because:
1) it tried to do implicit bytes<->unicode coercion by using some
process-wide default encoding
2) some unicode objects didn't have a succesful str()

Python 3 fixes both these issues. Fixing 1) means there's no automatic
coercion when trying to mix bytes and unicode. Try for example:

[Python 2] >>> u"a" + "b"
u'ab'
[Python 3] >>> "a" + b"b"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Can't convert 'bytes' object to str implicitly

And fixing 2) means bytes object get a meaningful str() in all
circumstances, which is much better for debug output.

If you don't think that 2) is important, then perhaps you don't deal
with non-ASCII data a lot. Failure to print out exception messages (or
log entries, etc.) containing non-ASCII characters is a big annoyance
with Python 2 for many people (including me).

> In Python 2, these two UnicodeEncodeErrors made our data safe from code
> which used str and unicode objects without checking too carefully which
> was which.

That's false, since implicit coercion can actually happen everywhere.
And it only fails when there's non-ASCII data involved, meaning the
unsuspecting Anglo-saxon developer doesn't understand why his/her users
complain.

Regards

Antoine.