[Python-ideas] duck typing for io write methods

Fri Jun 14 02:14:59 CEST 2013

From: Wolfgang Maier <wolfgang.maier at biologie.uni-freiburg.de>

Sent: Thursday, June 13, 2013 7:41 AM

> Nick Coghlan <ncoghlan at ...> writes:
> 
>> 
>>  > Oscar Benjamin <oscar.j.benjamin <at> ...> writes:
>>  >>
>>  >> On 13 June 2013 13:24, Nick Coghlan <ncoghlan <at> 
> ...> wrote:
>>  >> > If your type is acceptable input to operator.index(), 
> you'll get the
>>  >> > "initialised array of bytes" behaviour
>>  >>
>>  >> I only recently discovered this. What was the rationale for that 
> change?
>>  >>
>>  >> $ py -2.7 -c 'print(repr(bytes(4)))'
>>  >> '4'
>>  >>
>>  >> $ py -3.3 -c 'print(repr(bytes(4)))'
>>  >> b'\x00\x00\x00\x00'
>>  >>
>>  >> I can't really see why anyone would want the latter behaviour 
> (when
>>  >> you can already do b'\x00' * 4).
>>  >>
>>  >> Oscar
>>  >>
>>  >
>>  > It's funny you mention that difference since that was how I came 
> across my
>>  > issue. I was looking for a way to get back the Python 2.7 behaviour
>>  > bytes('1234')
>>  > '1234'
>> 
>>  You mean other than using the bytes literal b'1234' instead of a
>>  string literal? Bytes and text are different things in Python 3,
>>  whereas the 2.x "bytes" was just an alias for "str".
>> 
> 
> Well, I was illustrating the case with a literal integer, but, of course, I
> was thinking of cases with references:
> a=1234
> str(a).encode() # gives b'1234' in Python3, but converting your int to 
> str
> first, just to encode it again to bytes seems weird

Conceptually, it makes perfect sense. b'1234' isn't a string with the canonical numeral representation of 1234, it's a sequence of bytes, which happens to be a particular (unspecified) encoding of a string with the canonical numeral representation of 1234.

The docs (http://docs.python.org/3.3/library/functions.html#bytes) explicitly say a bytes object:

> is an immutable sequence of integers in the range 0 <= x < 256. bytes is an immutable version of bytearray

Practically, you often want to use bytes as "ASCII strings", and you often can get away with it. It works for literals, some but not all methods, and of course everything that strings inherit from sequences (concatenation, slicing, etc.). 

But often you can't get away with it. It doesn't work for formatting, anything strings do differently from sequences (notably indexing), some methods, most functions that special-case on strings, type-checking (there's no basestring in 3.x), etc.

Likewise, the bytes() constructor doesn't work quite like str(), and there's no bytes equivalent of repr().

Obviously, there's a tradeoff behind all of those decisions. It wouldn't have been hard to put bytes.__mod__, bytes.format, basestring, etc. into Python 3, or to make b'a'[0] return b'a' instead of 97, or to make bytes(x) work more like str(x), or to add a brepr or similar function, etc. But it would make bytes less useful as a sequence of 8-bit integers. And, more importantly, it would be an attractive nuisance, making a lot of common errors more common (as they were in 2.x). As the docs (http://docs.python.org/3.3/library/stdtypes.html#bytes) put it:

> This is done deliberately to emphasise that while many binary formats include ASCII based elements and can be usefully manipulated with some text-oriented algorithms, this is not generally the case for arbitrary binary data (blindly applying text processing algorithms to binary data formats that are not ASCII compatible will usually lead to data corruption).

Anyway, why do you actually want a bytes here? Maybe there's a better design for what you're trying to do that would make this whole issue irrelevant to your code.