[Python-checkins] peps: PEP 467: descope dramatically based on Guido's feedback

Thu Apr 3 14:33:47 CEST 2014

http://hg.python.org/peps/rev/435fa0278b73
changeset:   5452:435fa0278b73
user:        Nick Coghlan <ncoghlan at gmail.com>
date:        Thu Apr 03 22:33:36 2014 +1000
summary:
  PEP 467: descope dramatically based on Guido's feedback

files:
  pep-0467.txt |  303 ++++++++++++--------------------------
  1 files changed, 95 insertions(+), 208 deletions(-)

diff --git a/pep-0467.txt b/pep-0467.txt
--- a/pep-0467.txt
+++ b/pep-0467.txt
@@ -22,28 +22,35 @@
 
 This PEP proposes a number of small adjustments to the APIs of the ``bytes``
 and ``bytearray`` types to make their behaviour more internally consistent
-and to make it easier to operate entirely in the binary domain.
+and to make it easier to operate entirely in the binary domain, as well as
+changes to their documentation to make it easier to grasp their dual roles
+as containers of "arbitrary binary data" and "binary data with ASCII
+compatible segments".
 
 
 Background
 ==========
 
-Over the course of Python 3's evolution, a number of adjustments have been
-made to the core ``bytes`` and ``bytearray`` types as additional practical
-experience was gained with using them in code beyond the Python 3 standard
-library and test suite. However, to date, these changes have been made
-on a relatively ad hoc tactical basis as specific issues were identified,
-rather than as part of a systematic review of the APIs of these types. This
-approach has allowed inconsistencies to creep into the API design as to which
-input types are accepted by different methods. Additional inconsistencies
-linger from an earlier pre-release design where there was *no* separate
+To simplify the task of writing the Python 3 documentation, the ``bytes``
+and ``bytearray`` types were documented primarily in terms of the way they
+differed from the Unicode based Python 3 ``str`` type. Even when I
+`heavily revised the sequence documentation
+<http://hg.python.org/cpython/rev/463f52d20314>`__ in 2012, I retained that
+simplifying shortcut.
+
+However, it turns out that this approach to the documentation of these types
+has a problem: it doesn't adequately introduce users to their hybrid nature,
+where they can be manipulated *either* as a "sequence of integers" type,
+*or* as ``str``-like types that assume ASCII compatible data.
+
+In addition to the documentation issues, there are some lingering design
+quirks from an earlier pre-release design where there was *no* separate
 ``bytearray`` type, and instead the core ``bytes`` type was mutable (with
-no immutable counterpart), as well as from the origins of these types in
-the text-like behaviour of the Python 2 ``str`` type.
+no immutable counterpart).
 
-This PEP aims to provide the missing systematic review, with the goal of
-ensuring that wherever feasible (given backwards compatibility constraints)
-these current inconsistencies are addressed for the Python 3.5 release.
+Finally, additional experience with using the existing Python 3 binary
+sequence types in real world applications has suggested it would be
+beneficial to make it easier to convert integers to length 1 bytes objects.
 
 
 Proposals
@@ -55,10 +62,13 @@
 factors:
 
 * removing remnants of the original design of ``bytes`` as a mutable type
-* more consistently accepting length 1 ``bytes`` objects as input where an
-  integer between ``0`` and ``255`` inclusive is expected, and vice-versa
-* allowing users to easily convert integer output to a length 1 ``bytes``
+* allowing users to easily convert integer values to a length 1 ``bytes``
   object
+* consistently applying the following analogies to the type API designs
+  and documentation:
+
+  * ``bytes``: tuple of integers, with additional str-like methods
+  * ``bytearray``: list of integers, with additional str-like methods
 
 
 Alternate Constructors
@@ -83,95 +93,69 @@
     b'\x00\x00\x00'
 
 This PEP proposes that the current handling of integers in the bytes and
-bytearray constructors by deprecated in Python 3.5 and removed in Python
-3.6, being replaced by two more type appropriate alternate constructors
-provided as class methods. The initial python-ideas thread [ideas-thread1]_
-that spawned this PEP was specifically aimed at deprecating this constructor
-behaviour.
+bytearray constructors by deprecated in Python 3.5 and targeted for
+removal in Python 3.7, being replaced by two more explicit alternate
+constructors provided as class methods. The initial python-ideas thread
+[ideas-thread1]_ that spawned this PEP was specifically aimed at deprecating
+this constructor behaviour.
 
-For ``bytes``, a ``byte`` constructor is proposed that converts integers
-(as indicated by ``operator.index``) in the appropriate range to a ``bytes``
-object, converts objects that support the buffer API to bytes, and also
-passes through length 1 byte strings unchanged::
+Firstly, a ``byte`` constructor is proposed that converts integers
+in the range 0 to 255 (inclusive) to a ``bytes`` object::
 
     >>> bytes.byte(3)
     b'\x03'
-    >>> bytes.byte(bytearray(bytes([3])))
-    b'\x03'
-    >>> bytes.byte(memoryview(bytes([3])))
-    b'\x03'
-    >>> bytes.byte(bytes([3]))
-    b'\x03'
+    >>> bytearray.byte(3)
+    bytearray(b'\x03')
     >>> bytes.byte(512)
     Traceback (most recent call last):
       File "<stdin>", line 1, in <module>
     ValueError: bytes must be in range(0, 256)
-    >>> bytes.byte(b"ab")
-    Traceback (most recent call last):
-      File "<stdin>", line 1, in <module>
-    TypeError: bytes.byte() expected a byte, but buffer of length 2 found
 
 One specific use case for this alternate constructor is to easily convert
 the result of indexing operations on ``bytes`` and other binary sequences
 from an integer to a ``bytes`` object. The documentation for this API
 should note that its counterpart for the reverse conversion is ``ord()``.
+The ``ord()`` documentation will also be updated to note that while
+``chr()`` is the counterpart for ``str`` input, ``bytes.byte`` and
+``bytearray.byte`` are the counterparts for binary input.
 
-For ``bytearray``, a ``from_len`` constructor is proposed that preallocates
-the buffer filled with a particular value (default to ``0``) as a direct
+Secondly, a ``zeros`` constructor is proposed that serves as a direct
 replacement for the current constructor behaviour, rather than having to use
 sequence repetition to achieve the same effect in a less intuitive way::
 
-    >>> bytearray.from_len(3)
+    >>> bytes.zeros(3)
+    b'\x00\x00\x00'
+    >>> bytearray.zeros(3)
     bytearray(b'\x00\x00\x00')
-    >>> bytearray.from_len(3, 6)
-    bytearray(b'\x06\x06\x06')
 
-This part of the proposal was covered by an existing issue
-[empty-buffer-issue]_ and a variety of names have been proposed
-(``empty_buffer``, ``zeros``, ``zeroes``, ``allnull``, ``fill``). The
-specific name currently proposed was chosen by analogy with
-``dict.fromkeys()`` and ``itertools.chain.from_iter()`` to be completely
-explicit that it is an alternate constructor rather than an in-place
-mutation, as well as how it differs from the standard constructor.
+The chosen name here is taken from the corresponding initialisation function
+in NumPy (although, as these are sequence types rather than N-dimensional
+matrices, the constructors take a length as input rather than a shape tuple)
 
-
-Open questions
-^^^^^^^^^^^^^^
-
-* Should ``bytearray.byte()`` also be added? Or is
-  ``bytearray(bytes.byte(x))`` sufficient for that case?
-* Should ``bytes.from_len()`` also be added? Or is sequence repetition
-  sufficient for that case?
-* Should ``bytearray.from_len()`` use a different name?
-* Should ``bytes.byte()`` raise ``TypeError`` or ``ValueError`` for binary
-  sequences with more than one element? The ``TypeError`` currently proposed
-  is copied (with slightly improved wording) from the behaviour of ``ord()``
-  with sequences containing more than one code point, while ``ValueError``
-  would be more consistent with the existing handling of out-of-range
-  integer values.
-* ``bytes.byte()`` is defined above as accepting length 1 binary sequences
-  as individual bytes, but this is currently inconsistent with the main
-  ``bytes`` constructor::
-
-      >>> bytes([b"a", b"b", b"c"])
-      Traceback (most recent call last):
-        File "<stdin>", line 1, in <module>
-      TypeError: 'bytes' object cannot be interpreted as an integer
-
-  Should the ``bytes`` constructor be changed to accept iterables of length 1
-  bytes objects in addition to iterables of integers? If so, should it
-  allow a mixture of the two in a single iterable?
+While ``bytes.byte`` and ``bytearray.zeros`` are expected to be the more
+useful duo amongst the new constructors, ``bytes.zeros`` and
+`bytearray.byte`` are provided in order to maintain API consistency between
+the two types.
 
 
 Iteration
 ---------
 
-Iteration over ``bytes`` objects and other binary sequences produces
-integers. Rather than proposing a new method that would need to be added
-not only to ``bytes``, ``bytearray`` and ``memoryview``, but potentially
-to third party types as well, this PEP proposes that iteration to produce
-length 1 ``bytes`` objects instead be handled by combining ``map`` with
-the new ``bytes.byte()`` alternate constructor proposed above::
+While iteration over ``bytes`` objects and other binary sequences produces
+integers, it is sometimes desirable to iterate over length 1 bytes objects
+instead.
+
+To handle this situation more obviously (and more efficiently) than would be
+the case with the ``map(bytes.byte, data)`` construct enabled by the above
+constructor changes, this PEP proposes the addition of a new ``iterbytes``
+method to ``bytes``, ``bytearray`` and ``memoryview``::
+
+    for x in data.iterbytes():
+        # x is a length 1 ``bytes`` object, rather than an integer
+
+Third party types and arbitrary containers of integers that lack the new
+method can still be handled by combining ``map`` with the new
+``bytes.byte()`` alternate constructor proposed above::
 
     for x in map(bytes.byte, data):
         # x is a length 1 ``bytes`` object, rather than an integer
@@ -179,139 +163,42 @@
         # 0 to 255 inclusive
 
 
-Consistent support for different input types
---------------------------------------------
+Open questions
+^^^^^^^^^^^^^^
 
-The ``bytes`` and ``bytearray`` methods inspired by the Python 2 ``str``
-type generally expect to operate on binary subsequences: other objects
-implementing the buffer API. By contrast, the mutating APIs added to
-the ``bytearray`` interface expect to operate on individual elements:
-integer in the range 0 to 255 (inclusive).
+* The fallback case above suggests that this could perhaps be better handled
+  as an ``iterbytes(data)`` *builtin*, that used ``data.__iterbytes__()``
+  if defined, but otherwise fell back to ``map(bytes.byte, data)``::
 
-In Python 3.3, the binary search operations (``in``, ``count()``,
-``find()``, ``index()``, ``rfind()`` and ``rindex()``) were updated to
-accept integers in the range 0 to 255 (inclusive) as their first argument,
-in addition to the existing support for binary subsequences.
+    for x in iterbytes(data):
+        # x is a length 1 ``bytes`` object, rather than an integer
+        # This works with *any* container of integers in the range
+        # 0 to 255 inclusive
 
-This results in behaviour like the following in Python 3.3+::
 
-    >>> data = bytes([1, 2, 3, 4])
-    >>> 3 in data
-    True
-    >>> b"\x03" in data
-    True
-    >>> data.count(3)
-    1
-    >>> data.count(b"\x03")
-    1
+Documentation clarifications
+----------------------------
 
-    >>> data.replace(3, 4)
-    Traceback (most recent call last):
-      File "<stdin>", line 1, in <module>
-    TypeError: expected bytes, bytearray or buffer compatible object
-    >>> data.replace(b"\x03", b"\x04")
-    b'\x01\x02\x04\x04'
+In an attempt to clarify the `documentation
+<https://docs.python.org/dev/library/stdtypes.html#binary-sequence-types-bytes-bytearray-memoryview>`__
+of the ``bytes`` and ``bytearray`` types, the following changes are
+proposed:
 
-    >>> mutable = bytearray(data)
-    >>> mutable
-    bytearray(b'\x01\x02\x03\x04')
-    >>> mutable.append(b"\x05")
-    Traceback (most recent call last):
-      File "<stdin>", line 1, in <module>
-    TypeError: an integer is required
-    >>> mutable.append(5)
-    >>> mutable
-    bytearray(b'\x01\x02\x03\x04\x05')
+* the documentation of the *sequence* behaviour of each type is moved to
+  section for that individual type. These sections will be updated to
+  explicitly make the ``tuple of integers`` and ``list of integers``
+  analogies, as well as to make it clear that these parts of the API work
+  with arbitrary binary data
+* the current "Bytes and bytearray operations" section will be updated to
+  "Handling binary data with ASCII compatible segments", and will explicitly
+  list *all* of the methods that are included.
+* clarify that due to their origins in the API of the immutable ``str``
+  type, even the ``bytearray`` versions of these methods do *not* operate
+  in place, but instead create a new object.
 
-
-This PEP proposes extending the behaviour of accepting integers as being
-equivalent to the corresponding length 1 binary sequence to several other
-``bytes`` and ``bytearray`` methods that currently expect a ``bytes``
-object for certain parameters. In essence, if a value is an acceptable
-input to the new ``bytes.byte`` constructor defined above, then it would
-be acceptable in the roles defined here (in addition to any other already
-supported inputs):
-
-* ``startswith()`` prefix(es)
-* ``endswith()`` suffix(es)
-
-* ``center()`` fill character
-* ``ljust()`` fill character
-* ``rjust()`` fill character
-
-* ``strip()`` character to strip
-* ``lstrip()`` character to strip
-* ``rstrip()`` character to strip
-
-* ``partition()`` separator argument
-* ``rpartition()`` separator argument
-
-* ``split()`` separator argument
-* ``rsplit()`` separator argument
-
-* ``replace()`` old value and new value
-
-In addition to the consistency motive, this approach also makes it easier
-to work with the indexing behaviour , as the result of an indexing operation
-can more easily be fed back in to other methods.
-
-For ``bytearray``, some additional changes are proposed to the current
-integer based operations to ensure they remain consistent with the proposed
-constructor changes::
-
-* ``append()``: updated to be consistent with ``bytes.byte()``
-* ``remove()``: updated to be consistent with ``bytes.byte()``
-* ``+=``: updated to be consistent with ``bytes()`` changes (if any)
-* ``extend()``: updated to be consistent with ``bytes()`` changes (if any)
-
-The general principle behind these changes is to restore the flexible
-"element-or-subsequence" behaviour seen in the ``str`` API, even though
-Python 3 actually represents subsequences and individual elements as
-distinct types in the binary domain.
-
-
-Acknowledgement of surprising behaviour of some ``bytearray`` methods
----------------------------------------------------------------------
-
-Several of the ``bytes`` and ``bytearray`` methods have their origins in the
-Python 2 ``str`` API. As ``str`` is an immutable type, all of these
-operations are defined as returning a *new* instance, rather than operating
-in place. This contrasts with methods on other mutable types like ``list``,
-where ``list.sort()`` and ``list.reverse()`` operate in-place and return
-``None``, rather than creating a new object.
-
-Backwards compatibility constraints make it impractical to change this
-behaviour at this point, but it may be appropriate to explicitly call out
-this quirk in the documentation for the ``bytearray`` type. It affects the
-following methods that could reasonably be expected to operate in-place on
-a mutable type:
-
-* ``center()``
-* ``ljust()``
-* ``rjust()``
-* ``strip()``
-* ``lstrip()``
-* ``rstrip()``
-* ``replace()``
-* ``lower()``
-* ``upper()``
-* ``swapcase()``
-* ``title()``
-* ``capitalize()``
-* ``translate()``
-* ``expandtabs()``
-* ``zfill()``
-
-Note that the following ``bytearray`` operations *do* operate in place, as
-they're part of the mutable sequence API in ``bytearray``, rather than being
-inspired by the immutable Python 2 ``str`` API:
-
-* ``+=``
-* ``append()``
-* ``extend()``
-* ``reverse()``
-* ``remove()``
-* ``pop()``
+A patch for at least this part of the proposal will be prepared before
+submitting the PEP for approval, as writing out these docs completely may
+suggest additional opportunities for API consistency improvements.
 
 
 References

-- 
Repository URL: http://hg.python.org/peps