harmful str(bytes)

Thu Oct 7 17:58:42 EDT 2010

Hallvard B Furuseth <h.b.furuseth at usit.uio.no> writes:

> I've been playing a bit with Python3.2a2, and frankly its charset
> handling looks _less_ safe than in Python 2.
>
> The offender is bytes.__str__: str(b'foo') == "b'foo'".
> It's often not clear from looking at a piece of code whether
> some data is treated as strings or bytes, particularly when
> translating from old code.  Which means one cannot see from
> context if str(s) or "%s" % s will produce garbage.
>
> With 2.<late> conversion Unicode <-> string the equivalent operation did
> not silently produce garbage: it raised UnicodeError instead.  With old
> raw Python strings that was not a problem in applications which did not
> need to convert any charsets, with python3 they can break.
>
> I really wish bytes.__str__ would at least by default fail.

I think you misunderstand the purpose of str().  It is to provide a
(unicode) string representation of an object and has nothing to do with
converting it to unicode:

>>> b = b"\xc2\xa3"
>>> str(b)
"b'\\xc2\\xa3'"

If you want to *decode* a bytes string, use its decode method and you
get a unicode string (if your bytes string is a valid encoding):

>>> b = b"\xc2\xa3"
>>> b.decode('utf8')
'£'
>>> b.decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

If you want to *encode* a (unicode) string, use its encode method and you
get a bytes string (provided your string can be encoded using the given
encoding):

>>> s="€"
>>> s.encode('utf8')
b'\xe2\x82\xac'
>>> s.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\u20ac' in position 0: ordinal not in range(128)

-- 
Arnaud