harmful str(bytes)

Mon Oct 11 15:50:32 EDT 2010

Antoine Pitrou writes:
>Hallvard B Furuseth <h.b.furuseth at usit.uio.no> wrote:
>>Antoine Pitrou writes:
>>>Hallvard B Furuseth <h.b.furuseth at usit.uio.no> wrote:
>>>> The offender is bytes.__str__: str(b'foo') == "b'foo'".
>>>> It's often not clear from looking at a piece of code whether
>>>> some data is treated as strings or bytes, particularly when
>>>> translating from old code.  Which means one cannot see from
>>>> context if str(s) or "%s" % s will produce garbage.
>>>
>>> This probably comes from overuse of str(s) and "%s". They can be useful
>>> to produce human-readable messages, but you shouldn't have to use them
>>> very often.
>> 
>> Maybe Python 3 has something better, but they could be hard to avoid in
>> Python 2.  And certainly our site has plenty of code using them, whether
>> we should have avoided them or not.
>
> It's difficult to answer more precisely without knowing what you're
> doing precisely.

I'd just posted an example in article <hbf.20101008cg74 at bombur.uio.no>:

urllib.parse.urlunparse(('', '', '/foo', b'bar', '', '')) returns
"/foo;b'bar'" instead of raising an exception or returning 2.6's correct
"/foo;bar".

> But if you already have str objects, you don't have to
> call str() or format them using "%s", so implicit __str__ calls are
> avoided.

Except it's quite normal to output strings with %s.  Above, a library
did it for me.  Maybe also to depend on the fact that str.__str__() is a
noop, so one can call str() just in case some variable needs to be
unpacked to a plain string.   urllib.parse is an example of that too.

>>> Actually, the implicit contract of __str__ is that it never fails, so
>>> that everything can be printed out (for debugging purposes, etc.).
>> 
>> Nope:
>> 
>> $ python2 -c 'str(u"\u1000")'
>> Traceback (most recent call last):
> [...]
>> $ python2 -c 'unicode("\xA0")'
>> Traceback (most recent call last):
>
> Sure, but so what?

So your statement above was wrong, which you made in response to my
suggested solution.

> This mainly shows that unicode support was broken in
> Python 2, because:

...because Python 2 was designed so there was no way to avoid poor
unicode support one way or other.  Python 3 has not fixed this, it has
just moved the problem elsewhere.

> 1) it tried to do implicit bytes<->unicode coercion by using some
> process-wide default encoding

I had completely forgotten that.  I've been lucky (with my sysadmins
maybe:-) and lived with ASCII default encoding.  Checking around I see
now Python2 site.py used my locale for the encoding, as if that had any
relevance for my data...

> 2) some unicode objects didn't have a succesful str()
>
> Python 3 fixes both these issues. Fixing 1) means there's no automatic
> coercion when trying to mix bytes and unicode.

Fine, so programs will have to do it themselves...

> (...)
> And fixing 2) means bytes object get a meaningful str() in all
> circumstances, which is much better for debug output.

Except str() on such data has a different meaning than it did before, so
equivalent programs *silently* produce different results.  Which is why
I started this thread.

> If you don't think that 2) is important, then perhaps you don't deal
> with non-ASCII data a lot. Failure to print out exception messages (or
> log entries, etc.) containing non-ASCII characters is a big annoyance
> with Python 2 for many people (including me).

I'm Norwegian.  I do deal with non-ASCII and I agree failures in error
messages are annoying.

OTOH if the same bug that previously caused an error in an error,
instead quietly munges my data, that's worse than annoying.  I've dealt
with that too, and the fix is to use another tool.  (Ironically, in one
case it meant moving from Perl to Python, and now Python has followed
Perl...)

>> In Python 2, these two UnicodeEncodeErrors made our data safe from code
>> which used str and unicode objects without checking too carefully which
>> was which.
>
> That's false, since implicit coercion can actually happen everywhere.

Right, it was true as long as my encoding was ASCII.

> And it only fails when there's non-ASCII data involved, meaning the
> unsuspecting Anglo-saxon developer doesn't understand why his/her users
> complain.

-- 
Hallvard