[Python-Dev] Adding bytes.frombuffer() constructor to PEP 467 (was: [Python-ideas] Adding bytes.frombuffer() constructor

Fri Oct 21 02:48:53 EDT 2016

On 19 October 2016 at 01:28, Chris Barker - NOAA Federal
<chris.barker at noaa.gov> wrote:
>>>>> def get_builtin_methods():
>>    ...     return [(name, method_name) for name, obj in
>> get_builtin_types().items() for method_name, method in
>> vars(obj).items() if not method_name.startswith("__")]
>>    ...
>>>>> len(get_builtin_methods())
>>    230
>
> So what? No one looks in all the methods of builtins at once.

Yes, Python implementation developers do, which is why it's a useful
part of defining the overall "size" of Python and how that is growing
over time.

When we define a new standard library module (particularly pure Python
ones) rather than new methods on builtin types, we create
substantially less additional work for other implementations, and we
make it easier for educators to decide whether or not they should be
introducing their students to the new capabilities.

That latter aspect is important, as providing functionality as
separate modules means we also gain an enhanced ability to explain
"What is this *for*?", which is something we regularly struggle with
when making changes to the core language to better support relatively
advanced domain specific use cases (see
http://learning-python.com/books/python-changes-2014-plus.html for one
generalist author's perspective on the vast gulf that can arise
between "What professional programmers want" and "What's relevant to
new programmers")

> If we
> have anything like an OO System (and python builtins only sort of
> do...) then folks look for a built in that they need, and only then
> look at its methods.
>
> If you need to work with bytes, you'll look at the bytes object and
> bytarray object. Having to go find some helper function module to know
> to efficiently do something with bytes is VERY non-discoverable!

Which is more comprehensible and discoverable, dict.setdefault(), or
collections.defaultdict()?

Micro-optimisations like dict.setdefault() typically don't make sense
in isolation - they only make sense in the context of a particular
pattern of thought. Now, one approach to such patterns is to say "We
just need to do a better job of teaching people to recognise and use
the pattern!". This approach tends not to work very well - you're
often better off extracting the entire pattern out to a higher level
construct, giving that construct a name, and teaching that, and
letting people worry about how it works internally later.

(For a slightly different example, consider the rationale for adding
the `secrets` module, even though it's mostly just a collection of
relatively thin wrappers around `os.urandom()`)

> bytes and bytarray are already low-level objects -- adding low-level
> functionality to them makes perfect sense.

They're not really that low level. They're *relatively* low level
(especially for Python), but they're still a long way away from the
kind of raw control over memory layout that a language like C or Rust
can give you.

> And no, this is not just for asycio at all -- it's potentially useful
> for any byte manipulation.

Yes, which is why I think the end goal should be a public `iobuffers`
module in the standard library. Doing IO buffer manipulation
efficiently is a complex topic, but it's also one where there are:

- many repeatable patterns for managing IO buffers efficiently that
aren't necessarily applicable to manipulating arbitrary binary data
(ring buffers, ropes, etc)
- many operating system level utilities available to make it even more
efficient that we currently don't use (since we only have general
purpose "bytes" and "bytearray" objects with no "iobuffer" specific
abstraction that could take advantage of those use case specific
features)

> +1 on a frombuffer() method.

Still -1 in the absence of evidence that a good IO buffer abstraction
for asyncio and the standard library can't be written without it
(where the evidence I'll accept is "We already wrote the abstraction
layer, and not having this builtin feature necessarily introduces
inefficiencies or a lack of portability beyond CPython into our
implementation").

>> Putting special purpose functionality behind an import gate helps to
>> provide a more explicit context of use
>
> This is a fine argument for putting bytearray in a separate module --
> but that ship has sailed. The method to construct a bytearray from a
> buffer belongs with the bytearray object.

The bytearray constructor already accepts arbitrary bytes-like
objects. What this proposal is about is a way to *more efficiently*
snapshot a slice of a bytearray object for use in asyncio buffer
manipulation in cases where all of the following constraints apply:

- we don't want to copy the data twice
- we don't want to let a memoryview be cleaned up lazily
- we don't want to incur the readability penalty of explicitly
managing the memoryview

For a great many use cases, we simply don't care about those
constraints (especially the last one), so adding `bytes.frombuffer` is
just confusing: we can readily predict that after adding it, a future
Stack Overflow question will be "When should I use bytes.frombuffer()
in Python instead of the normal bytes constructor?"

By contrast, if we instead say "We want Python to natively support
efficient. readily discoverable, IO buffer manipulation", then folks
can ask "What's preventing us from providing an `iobuffers` module
today?" and start working towards that end goal (just as
the"selectors" module was added as an asyncio-independent abstraction
layer over select, epoll and kqueue, but probably wouldn't have been
without the asyncio use case to drive its design and implementation as
a standard library module)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia