[Python-Dev] PEP 460: allowing %d and %f and mojibake

Sun Jan 12 02:01:01 CET 2014

Hi,

2014/1/11 Antoine Pitrou <solipsis at pitrou.net>:
>> b'x=%s' % 10 is well defined, it's pure bytes.
>
> It is well-defined? Then please explain me what the general case of
>   b'%s' % x
> is supposed to call:
>
> - does it call x.__bytes__? int.__bytes__ doesn't exist
> - does it call bytes(x)? bytes(10) gives
>   b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
> - does it call x.__str__? you've reintroduced the Python 2 behaviour of
>   conflating bytes and unicode

I don't want to call any method from bytes%args, only Py_buffer API
would be used. So the pseudo-code becomes:

- try to get Py_buffer
- on failure, check if it's an int: yes? ok, format it as decimal
- otherwise, raise an error

Or:

- is the object an int? yes, format it as decimal. no, use Py_buffer

--

I discussed with Antoine to try to understand how and why we disagree.

Antoine prefers a pure API, whereas I'm trying to figure out if it
would be possible to write code compatible with Python 2 and Python 3.

Using Antoine's PEP, it's possible to write code working on Python 2
and Python 3 which only manipulate bytes strings.

The problem is that it's a pain to write a code working on both Python
versions when an argument is an integer. For example, the Python 2
code "Content-Length: %s\r\n" % 123 is written ("Content-Length:
%s\r\n" % 123).encode('ascii') in Python 3. So Python 2 and Python 3
codes are different.

Supporting formating integers would allow to write b"Content-Length:
%s\r\n" % 123, which would work on Python 2 and Python 3.

(u'Content-Length: %s\r\n' % 123).encode('ascii') works on both Python
versions, but it may require more work to Python 2 code on Python 3.

--

Now I'm trying to find use cases in Mercurial and Twisted source code
to see which features are required. First, I'm looking for a function
requiring to format a number in decimal in a bytes string.

In issue #3982, I saw:

"""
HTTP chunking' uses ASCII mixed with binary (octets). With 2.6 you could write:

def chunk(block):
    return b'{0:x}\r\n{1}\r\n'.format(len(block), block)"
"""

and

"""
'Content-length: {}\r\n'.format(length)
"""

But are the examples real use cases, or artifical examples?

--

Augie Fackler gave an example from Mercurial:
"""
sys.stdout.write('%(state)s %(path)s\n' % {'state': 'M', 'path':
'some/filesystem/path'})

except we don't know the encoding of the filesystem path (Hi unix!) so
we have to treat the whole thing as opaque bytes.  It's even more fun
for 'log', becase then it's got localized strings in it as well.
"""

But here I disagree with the design of Mercurial, filenames should be
treated as text. If a filename would be pure binary, you should not
write it in a terminal. Displaying binary data usually leads to
displaying random characters and changing terminal options (ex: text
starts blinking or is displayed in bold!?) :-)

For the localized string: again, it's also a design issue in my
opinion. A localized string is text, not binary data :-)

--

Another option is that I cannot find usecases because there are no use
cases for the PEP 460 and the PEP is useless :-)

Victor