[Python-Dev] Adding bytes.frombuffer() constructor to PEP 467 (was: [Python-ideas] Adding bytes.frombuffer() constructor

Wed Oct 12 05:34:18 EDT 2016

On Wed, Oct 12, 2016 at 2:07 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> I don't think it makes sense to add any more ideas to PEP 467. That
> needed to be a PEP because it proposed breaking backwards
> compatibility in a couple of areas, and because of the complex history
> of Python 3's "bytes-as-tuple-of-ints" and Python 2's "bytes-as-str"
> semantics.
>
> Other enhancements to the binary data handling APIs in Python 3 can be
> considered on their own merits.
>

I see.  My proposal should be another PEP (if PEP is required).

>>
>> * It isn't "one obvious way": Developers including me may forget to
>> use context manager.
>>   And since it works on CPython, it's hard to point it out.
>
> To add to the confusion, there's also
> https://docs.python.org/3/library/stdtypes.html#memoryview.tobytes
> giving:
>
>     line = memoryview(buf)[:n].tobytes()
>
> However, folks *do* need to learn that many mutable data types will
> lock themselves against modification while you have a live memory view
> on them, so it's important to release views promptly and reliably when
> we don't need them any more.
>

I agree.
io.TextWrapper objects reports ResourceWarning for unclosed file.
I think same warning for unclosed memoryview objects may help developers.

>> Quick benchmark:
>>
>> (temporary bytes)
>> $ python3 -m perf timeit -s 'buf =
>> bytearray(b"foo\r\nbar\r\nbaz\r\n")' -- 'bytes(buf)[:3]'
>> ....................
>> Median +- std dev: 652 ns +- 19 ns
>>
>> (temporary memoryview without "with"
>> $ python3 -m perf timeit -s 'buf =
>> bytearray(b"foo\r\nbar\r\nbaz\r\n")' -- 'bytes(memoryview(buf)[:3])'
>> ....................
>> Median +- std dev: 886 ns +- 26 ns
>>
>> (temporary memoryview with "with")
>> $ python3 -m perf timeit -s 'buf = bytearray(b"foo\r\nbar\r\nbaz\r\n")' -- '
>> with memoryview(buf) as m:
>>     bytes(m[:3])
>> '
>> ....................
>> Median +- std dev: 1.11 us +- 0.03 us
>
> This is normal though, as memory views trade lower O(N) costs (reduced
> data copying) for higher O(1) setup costs (creating and managing the
> view, indirection for data access).

Yes.  When data is small, benefit of less data copy can be hidden easily.

One big difficulty of I/O frameworks like asyncio is: we can't assume data size.
Framework should be optimized for both of many small chunks and large data.

With memoryview, when we optimize for large data (e.g. downloading large file),
performance for massive small data (e.g. small JSON API) become worse.

Actually, one pull request is gave up to use memoryview because of it.

https://github.com/python/asyncio/pull/395#issuecomment-249044218

>
>> Proposed solution
>> ===============
>>
>> Adding one more constructor to bytes:
>>
>>     # when length=-1 (default), use until end of *byteslike*.
>>     bytes.frombuffer(byteslike, length=-1, offset=0)
>>
>> With ths API
>>
>>     with memoryview(buf) as m:
>>         line = bytes(m[:n])
>>
>> becomes
>>
>>     line = bytes.frombuffer(buf, n)
>
> Does that need to be a method on the builtin rather than a separate
> helper function, though? Once you define:
>
>     def snapshot(buf, length=None, offset=0):
>         with memoryview(buf) as m:
>             return m[offset:length].tobytes()
>
> then that can be replaced by a more optimised C implementation without
> users needing to care about the internal details.

I'm thinking about adding such helper function in asyncio speedup C extension.
But there are some other non-blocking I/O frameworks: Tornado,
Twisted, and curio.

And relying on C extention make harder to optimize for other Python
implementation.
If it is in standard library, PyPy and other Python implementation can
optimize it.

>
> That is, getting back to a variant on one of Serhiy's suggestions in
> the last PEP 467 discussion, it may make sense for us to offer a
> "buffertools" library that's specifically aimed at supporting
> efficient buffer manipulation operations that minimise data copying.
> The pure Python implementations would work entirely through
> memoryview, but we could also have selected C accelerated operations
> if that showed a noticeable improvement on asyncio's benchmarks.
>

It seems nice idea. I'll read the discussion.

> Regards,
> Nick.
>
> P.S. The length/offset API design is also problematic due to the way
> it differs from range() & slice(), but I don't think it makes sense to
> get into that kind of detail before discussing the larger question of
> adding a new helper module for working efficiently with memory buffers
> vs further widening the method API for the builtin bytes type
>
> --
> Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia

I avoid slice API intentionally, because if it seems like slice,
someone will propose
adding `step` support only for consistency.

But, as Serhiy said, consistent with old buffer API is nice.

-- 
INADA Naoki  <songofacandy at gmail.com>