Python 3.5, bytes, and %-interpolation (aka PEP 461)

Mon Feb 24 18:55:36 EST 2014

On Tue, 25 Feb 2014 00:18:53 +0200, Marko Rauhamaa wrote:

> random832 at fastmail.us:
> 
>> On Mon, Feb 24, 2014, at 15:46, Marko Rauhamaa wrote:
>>> That is:
>>> 
>>>  1. ineffient (encode/decode shuffle)
>>> 
>>>  2. unnatural (strings usually have no place in protocols)
>>
>> That's not at all clear. Why _aren't_ these protocols considered text
>> protocols? Why can't you add a string directly to headers?

You cannot mix text strings and byte strings in Python 3. Python 2 allows 
you to do so, and it leads to hard-to-diagnose bugs and confusing 
behaviour. This is why Python 3 insists on a strict separation between 
the two.

But of course you can add *byte* strings directly to byte headers. Just 
prefix your strings with a b, as in b'Header' instead of 'Header', and it 
will work fine.

However, you don't really want to be adding large numbers of byte strings 
together, due to efficiency. Better to use % interpolation to insert them 
all at once. Hence the push to add % to bytes in Python 3.

Marko replied:
> Text expresses a written human language. In prosaic terms, a Python
> string is a sequence of ISO 10646 characters, whose codepoints are not
> octets.

Almost correct, but not quite. Python strings are Unicode, not ISO-10646. 
The two are not the same.

http://www.unicode.org/faq/unicode_iso.html

> Most network protocols are defined in terms of octets, although many of
> them can carry textual, audio or video payloads (among others). So when
> RFC 3507 (ICAP) shows an example starting:
> 
>    RESPMOD icap://icap.example.org/satisf ICAP/1.0 Host:
>    icap.example.org
>    Encapsulated: req-hdr=0, res-hdr=137, res-body=296
> 
> it consists of 8-bit octets and not some human language.

Not really relevant. In practical terms, whether they are implemented as 
octets or not, the sequence "Host" *is* human language, specifically it 
is the English word Host that just happens to be encoded in ASCII. 
Likewise the sequence "Encapsulated" *is* the English word Encapsulated 
encoded in ASCII.

> In practical terms, you get the bytes off the socket as, well, bytes. It
> makes little sense to "decode" those bytes into a string for
> manipulation. Manipulating bytes directly is both more efficient and
> more natural from the point of view of the standard.

But not necessarily more natural from the point of the programmer, which 
is what matters.

I agree that if you don't need to interpret the data as Unicode text, 
then there's no real benefit to decoding to text. (In fact, if your data 
can contain arbitrary bytes, you may not be able to decode to text, since 
not all byte sequences are legal UTF-8.)

> Many internet protocols happen to look like text. It makes it nicer for
> human network programmers to work with them. However, they are primarily
> meant for computers, and the message formats are really a form of binary
> code.

The reason that, say, the subject header line in emails starts with the 
word "Subject" rather than some arbitrary binary code is because it is 
intended to be human-readable. Not just human-readable, but *semantically 
meaningful*. That's why the subject line is labelled "Subject" rather 
than "Field 23" or "SJT".

Fortunately, such headers are usually (always?) ASCII, and byte strings 
in Python privilege ASCII-encoded text. When you write b'Subject', you 
get the same sequence of bytes as 'Subject'.encode('ascii').

-- 
Steven