[Cython] buffer syntax vs. memory view syntax

Tue May 8 10:49:56 CEST 2012

Dag Sverre Seljebotn, 08.05.2012 10:36:
> On 05/08/2012 10:18 AM, Stefan Behnel wrote:
>> Dag Sverre Seljebotn, 08.05.2012 09:57:
>>> On 05/07/2012 11:21 PM, mark florisson wrote:
>>>> On 7 May 2012 19:40, Dag Sverre Seljebotn wrote:
>>>>> mark florisson wrote:
>>>>>> On 7 May 2012 17:00, Dag Sverre Seljebotn wrote:
>>>>>>> On 05/07/2012 04:16 PM, Stefan Behnel wrote:
>>>>>>>> Stefan Behnel, 07.05.2012 15:04:
>>>>>>>>> Dag Sverre Seljebotn, 07.05.2012 13:48:
>>>>>>>>>> BTW, with the coming of memoryviews, me and Mark talked about just
>>>>>>>>>> deprecating the "mytype[...]" meaning buffers, and rather treat it
>>>>>>>>>> as np.ndarray, array.array etc. being some sort of "template types".
>>>>>>>>>> That is,
>>>>>>>>>> we disallow "object[int]" and require some special declarations in
>>>>>>>>>> the relevant pxd files.
>>>>>>>>>
>>>>>>>>> Hmm, yes, it's unfortunate that we have two different types of
>>>>>>>>> syntax now,
>>>>>>>>> one that declares the item type before the brackets and one that
>>>>>>>>> declares it afterwards.
>>>>>>>> Should we consider the
>>>>>>>> buffer interface syntax deprecated and focus on the memory view
>>>>>>>> syntax?
>>>>>>>
>>>>>>> I think that's the very-long-term intention. Then again, it may be
>>>>>>> too early
>>>>>>> to really tell yet, we just need to see how the memory views play out
>>>>>>> in
>>>>>>> real life and whether they'll be able to replace np.ndarray[double]
>>>>>>> among real users. We don't want to shove things down users throats.
>>>>>>>
>>>>>>> But the use of the trailing-[] syntax needs some cleaning up. Me and
>>>>>>> Mark agreed we'd put this proposal forward when we got around to it:
>>>>>>>
>>>>>>>    - Deprecate the "object[double]" form, where [dtype] can be stuck on
>>>>>>>    any extension type
>>>>>>>
>>>>>>>    - But, do NOT (for the next year at least) deprecate
>>>>>>>    np.ndarray[double],
>>>>>>>    array.array[double], etc. Basically, there should be a magic flag in
>>>>>>>    extension type declarations saying "I can be a buffer".
>>>>>>>
>>>>>>> For one thing, that is sort of needed to open up things for templated
>>>>>>> cdef classes/fused types cdef classes, if that is ever implemented.
>>>>>>
>>>>>> Deprecating is definitely a good start. I think at least if you only
>>>>>> allow two types as buffers it will be at least reasonably clear when
>>>>>> one is dealing with fused types or buffers.
>>>>>>
>>>>>> Basically, I think memoryviews should live up to demands of the users,
>>>>>> which would mean there would be no reason to keep the buffer syntax.
>>>>>
>>>>> But they are different approaches -- use a different type/API, or just
>>>>> try to speed up parts of NumPy..
>>>>>
>>>>>> One thing to do is make memoryviews coerce cheaply back to the
>>>>>> original objects if wanted (which is likely). Writting
>>>>>> np.asarray(mymemview) is kind of annoying.
>>>>>
>>>>> It is going to be very confusing to have type(mymemview),
>>>>> repr(mymemview), and so on come out as NumPy arrays, but not have the
>>>>> full API of NumPy. Unless you auto-convert on getattr to...
>>>>
>>>> Yeah, the idea is as very simple, as you mention, just keep the object
>>>> around cached, and when you slice construct one lazily.
>>>>
>>>>> If you want to eradicate the distinction between the backing array and
>>>>> the memory view and make it transparent, I really suggest you kick back
>>>>> alive np.ndarray (it can exist in some 'unrealized' state with delayed
>>>>> construction after slicing, and so on). Implementation much the same
>>>>> either way, it is all about how it is presented to the user.
>>>>
>>>> You mean the buffer syntax?
>>>>
>>>>> Something like mymemview.asobject() could work though, and while not
>>>>> much shorter, it would have some polymorphism that np.asarray does not
>>>>> have (based probably on some custom PEP 3118 extension)
>>>>
>>>> I was thinking you could allow the user to register a callback, and
>>>> use that to coerce from a memoryview back to an object (given a
>>>> memoryview object). For numpy this would be np.asarray, and the
>>>> implementation is allowed to cache the result (which it will).
>>>> It may be too magicky though... but it will be convenient. The
>>>> memoryview will act as a subclass, meaning that any of its methods
>>>> will override methods of the converted object.
>>>
>>> My point was that this seems *way* to magicky.
>>>
>>> Beyond "confusing users" and so on that are sort of subjective, here's a
>>> fundamental problem for you: We're making it very difficult to type-infer
>>> memoryviews. Consider:
>>>
>>> cdef double[:] x = ...
>>> y = x
>>> print y.shape
>>>
>>> Now, because y is not typed, you're semantically throwing in a conversion
>>> on line 2, so that line 3 says that you want the attribute access to be
>>> invoked on "whatever object x coerced back to". And we have no idea what
>>> kind of object that is.
>>>
>>> If you don't transparently convert to object, it'd be safe to automatically
>>> infer y as a double[:].
>>
>> Why can't y be inferred as the type of x due to the assignment?
>>
>>
>>> On a related note, I've said before that I dislike the notion of
>>>
>>> cdef double[:] mview = obj
>>>
>>> I'd rather like
>>>
>>> cdef double[:] mview = double[:](obj)
>>
>> Why? We currently allow
>>
>>      cdef char* s = some_py_bytes_string
>>
>> Auto-coercion is a serious part of the language, and I don't see the
>> advantage of requiring the redundancy in the case above. It's clear enough
>> to me what the typed assignment is intended to mean: get me a buffer view
>> on the object, regardless of what it is.
>>
>>
>>> I support Robert in that "np.ndarray[double]" is the syntax to use when you
>>> want this kind of transparent "be an object when I need to and a memory
>>> view when I need to".
>>>
>>> Proposal:
>>>
>>>   1) We NEVER deprecate "np.ndarray[double]", we commit to keeping that in
>>> the language. It means exactly what you would like double[:] to mean, i.e.
>>> a variable that is memoryview when you need to and an object otherwise.
>>> When you use this type, you bear the consequences of early-binding things
>>> that could in theory be overridden.
>>>
>>>   2) double[:] is for when you want to access data of *any* Python
>>> object in
>>> a generic way. Raw PEP 3118. In those situations, access to the underlying
>>> object is much less useful.
>>>
>>>    2a) Therefore we require that you do "mview.asobject()" manually; doing
>>> "mview.foo()" is a compile-time error
>>
>> Sounds good. I think that would clean up the current syntax overlap very
>> nicely.
>>
>>
>>>    2b) To drive the point home among users, and aid type inference and
>>> overall language clarity, we REMOVE the auto-acquisition and require that
>>> you do
>>>
>>>      cdef double[:] mview = double[:](obj)
>>
>> I don't see the point, as noted above. Either "obj" is statically typed and
>> the bare assignment becomes a no-op, or it's not typed and the assignment
>> coerces by creating a view. As with all other typed assignments.
>>
>>
>>>    2c) Perhaps: Do not even coerce to a Python memoryview and disallow
>>> "print mview"; instead require that you do "print mview.asmemoryview()" or
>>> "print memoryview(mview)" or somesuch.
>>
>> This seems to depend on 2b.
> 
> This I don't understand. The question of 2c) is the analogue to
> auto-coercion of "char*" to bytes; approving 2c) would put memoryviews in
> line with char*.
> 
> Then again, we could in future auto-coerce char* to a ctypes pointer, and
> in that case, coercing a memoryview to an object representing that
> memoryview would be OK.
> 
> Either way, you would never get back the same object that you coerced from!

Ah, that's what you meant. I thought you were referring to getting a
memoryview from an object.

I agree that a buffer view shouldn't auto-coerce back to its owner (or to a
Python object in general), that's the whole point of the syntax cleanup.

In simple cases, buffer.obj would be the thing to talk to, except for
memory views, where only the view knows the mapped memory layout but the
underlying exporter has the methods to deal with the buffer. In that case,
we may really want to leave it to the user to handle this. I don't think
the compiler can do the right thing in all cases, and the user is really
the only one who knows what kind of object should be used or even
instantiated to wrap a buffer. Nothing we can do is shorter or more clearly
readable than np.asarray() or whatever function a specific library has for
this.

So, what about just keeping buffer.obj visible and leaving everything else
to users?

Stefan