harmful str(bytes)
Arnaud Delobelle
arnodel at gmail.com
Thu Oct 7 17:58:42 EDT 2010
Hallvard B Furuseth <h.b.furuseth at usit.uio.no> writes:
> I've been playing a bit with Python3.2a2, and frankly its charset
> handling looks _less_ safe than in Python 2.
>
> The offender is bytes.__str__: str(b'foo') == "b'foo'".
> It's often not clear from looking at a piece of code whether
> some data is treated as strings or bytes, particularly when
> translating from old code. Which means one cannot see from
> context if str(s) or "%s" % s will produce garbage.
>
> With 2.<late> conversion Unicode <-> string the equivalent operation did
> not silently produce garbage: it raised UnicodeError instead. With old
> raw Python strings that was not a problem in applications which did not
> need to convert any charsets, with python3 they can break.
>
> I really wish bytes.__str__ would at least by default fail.
I think you misunderstand the purpose of str(). It is to provide a
(unicode) string representation of an object and has nothing to do with
converting it to unicode:
>>> b = b"\xc2\xa3"
>>> str(b)
"b'\\xc2\\xa3'"
If you want to *decode* a bytes string, use its decode method and you
get a unicode string (if your bytes string is a valid encoding):
>>> b = b"\xc2\xa3"
>>> b.decode('utf8')
'£'
>>> b.decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
If you want to *encode* a (unicode) string, use its encode method and you
get a bytes string (provided your string can be encoded using the given
encoding):
>>> s="€"
>>> s.encode('utf8')
b'\xe2\x82\xac'
>>> s.encode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\u20ac' in position 0: ordinal not in range(128)
--
Arnaud
More information about the Python-list
mailing list