[Cython] buffer syntax vs. memory view syntax

Tue May 8 12:35:13 CEST 2012

On 8 May 2012 10:47, Dag Sverre Seljebotn <d.s.seljebotn at astro.uio.no> wrote:
> On 05/08/2012 11:30 AM, Dag Sverre Seljebotn wrote:
>>
>> On 05/08/2012 11:22 AM, mark florisson wrote:
>>>
>>> On 8 May 2012 09:36, Dag Sverre Seljebotn<d.s.seljebotn at astro.uio.no>
>>> wrote:
>>>>
>>>> On 05/08/2012 10:18 AM, Stefan Behnel wrote:
>>>>>
>>>>>
>>>>> Dag Sverre Seljebotn, 08.05.2012 09:57:
>>>>>>
>>>>>>
>>>>>> On 05/07/2012 11:21 PM, mark florisson wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 7 May 2012 19:40, Dag Sverre Seljebotn wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> mark florisson wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 7 May 2012 17:00, Dag Sverre Seljebotn wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 05/07/2012 04:16 PM, Stefan Behnel wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Stefan Behnel, 07.05.2012 15:04:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Dag Sverre Seljebotn, 07.05.2012 13:48:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> BTW, with the coming of memoryviews, me and Mark talked
>>>>>>>>>>>>> about just
>>>>>>>>>>>>> deprecating the "mytype[...]" meaning buffers, and rather
>>>>>>>>>>>>> treat it
>>>>>>>>>>>>> as np.ndarray, array.array etc. being some sort of "template
>>>>>>>>>>>>> types".
>>>>>>>>>>>>> That is,
>>>>>>>>>>>>> we disallow "object[int]" and require some special
>>>>>>>>>>>>> declarations in
>>>>>>>>>>>>> the relevant pxd files.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hmm, yes, it's unfortunate that we have two different types of
>>>>>>>>>>>> syntax now,
>>>>>>>>>>>> one that declares the item type before the brackets and one that
>>>>>>>>>>>> declares it afterwards.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Should we consider the
>>>>>>>>>>> buffer interface syntax deprecated and focus on the memory view
>>>>>>>>>>> syntax?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I think that's the very-long-term intention. Then again, it may be
>>>>>>>>>> too early
>>>>>>>>>> to really tell yet, we just need to see how the memory views
>>>>>>>>>> play out
>>>>>>>>>> in
>>>>>>>>>> real life and whether they'll be able to replace
>>>>>>>>>> np.ndarray[double]
>>>>>>>>>> among real users. We don't want to shove things down users
>>>>>>>>>> throats.
>>>>>>>>>>
>>>>>>>>>> But the use of the trailing-[] syntax needs some cleaning up.
>>>>>>>>>> Me and
>>>>>>>>>> Mark agreed we'd put this proposal forward when we got around
>>>>>>>>>> to it:
>>>>>>>>>>
>>>>>>>>>> - Deprecate the "object[double]" form, where [dtype] can be stuck
>>>>>>>>>> on
>>>>>>>>>> any extension type
>>>>>>>>>>
>>>>>>>>>> - But, do NOT (for the next year at least) deprecate
>>>>>>>>>> np.ndarray[double],
>>>>>>>>>> array.array[double], etc. Basically, there should be a magic flag
>>>>>>>>>> in
>>>>>>>>>> extension type declarations saying "I can be a buffer".
>>>>>>>>>>
>>>>>>>>>> For one thing, that is sort of needed to open up things for
>>>>>>>>>> templated
>>>>>>>>>> cdef classes/fused types cdef classes, if that is ever
>>>>>>>>>> implemented.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Deprecating is definitely a good start. I think at least if you
>>>>>>>>> only
>>>>>>>>> allow two types as buffers it will be at least reasonably clear
>>>>>>>>> when
>>>>>>>>> one is dealing with fused types or buffers.
>>>>>>>>>
>>>>>>>>> Basically, I think memoryviews should live up to demands of the
>>>>>>>>> users,
>>>>>>>>> which would mean there would be no reason to keep the buffer
>>>>>>>>> syntax.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> But they are different approaches -- use a different type/API, or
>>>>>>>> just
>>>>>>>> try to speed up parts of NumPy..
>>>>>>>>
>>>>>>>>> One thing to do is make memoryviews coerce cheaply back to the
>>>>>>>>> original objects if wanted (which is likely). Writting
>>>>>>>>> np.asarray(mymemview) is kind of annoying.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> It is going to be very confusing to have type(mymemview),
>>>>>>>> repr(mymemview), and so on come out as NumPy arrays, but not have
>>>>>>>> the
>>>>>>>> full API of NumPy. Unless you auto-convert on getattr to...
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Yeah, the idea is as very simple, as you mention, just keep the
>>>>>>> object
>>>>>>> around cached, and when you slice construct one lazily.
>>>>>>>
>>>>>>>> If you want to eradicate the distinction between the backing
>>>>>>>> array and
>>>>>>>> the memory view and make it transparent, I really suggest you
>>>>>>>> kick back
>>>>>>>> alive np.ndarray (it can exist in some 'unrealized' state with
>>>>>>>> delayed
>>>>>>>> construction after slicing, and so on). Implementation much the same
>>>>>>>> either way, it is all about how it is presented to the user.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> You mean the buffer syntax?
>>>>>>>
>>>>>>>> Something like mymemview.asobject() could work though, and while not
>>>>>>>> much shorter, it would have some polymorphism that np.asarray
>>>>>>>> does not
>>>>>>>> have (based probably on some custom PEP 3118 extension)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I was thinking you could allow the user to register a callback, and
>>>>>>> use that to coerce from a memoryview back to an object (given a
>>>>>>> memoryview object). For numpy this would be np.asarray, and the
>>>>>>> implementation is allowed to cache the result (which it will).
>>>>>>> It may be too magicky though... but it will be convenient. The
>>>>>>> memoryview will act as a subclass, meaning that any of its methods
>>>>>>> will override methods of the converted object.
>>>>>>
>>>>>>
>>>>>>
>>>>>> My point was that this seems *way* to magicky.
>>>>>>
>>>>>> Beyond "confusing users" and so on that are sort of subjective,
>>>>>> here's a
>>>>>> fundamental problem for you: We're making it very difficult to
>>>>>> type-infer
>>>>>> memoryviews. Consider:
>>>>>>
>>>>>> cdef double[:] x = ...
>>>>>> y = x
>>>>>> print y.shape
>>>>>>
>>>>>> Now, because y is not typed, you're semantically throwing in a
>>>>>> conversion
>>>>>> on line 2, so that line 3 says that you want the attribute access
>>>>>> to be
>>>>>> invoked on "whatever object x coerced back to". And we have no idea
>>>>>> what
>>>>>> kind of object that is.
>>>>>>
>>>>>> If you don't transparently convert to object, it'd be safe to
>>>>>> automatically
>>>>>> infer y as a double[:].
>>>>>
>>>>>
>>>>>
>>>>> Why can't y be inferred as the type of x due to the assignment?
>>>>>
>>>>>
>>>>>> On a related note, I've said before that I dislike the notion of
>>>>>>
>>>>>> cdef double[:] mview = obj
>>>>>>
>>>>>> I'd rather like
>>>>>>
>>>>>> cdef double[:] mview = double[:](obj)
>>>>>
>>>>>
>>>>>
>>>>> Why? We currently allow
>>>>>
>>>>> cdef char* s = some_py_bytes_string
>>>>>
>>>>> Auto-coercion is a serious part of the language, and I don't see the
>>>>> advantage of requiring the redundancy in the case above. It's clear
>>>>> enough
>>>>> to me what the typed assignment is intended to mean: get me a buffer
>>>>> view
>>>>> on the object, regardless of what it is.
>>>>>
>>>>>
>>>>>> I support Robert in that "np.ndarray[double]" is the syntax to use
>>>>>> when
>>>>>> you
>>>>>> want this kind of transparent "be an object when I need to and a
>>>>>> memory
>>>>>> view when I need to".
>>>>>>
>>>>>> Proposal:
>>>>>>
>>>>>> 1) We NEVER deprecate "np.ndarray[double]", we commit to keeping
>>>>>> that in
>>>>>> the language. It means exactly what you would like double[:] to mean,
>>>>>> i.e.
>>>>>> a variable that is memoryview when you need to and an object
>>>>>> otherwise.
>>>>>> When you use this type, you bear the consequences of early-binding
>>>>>> things
>>>>>> that could in theory be overridden.
>>>>>>
>>>>>> 2) double[:] is for when you want to access data of *any* Python
>>>>>> object
>>>>>> in
>>>>>> a generic way. Raw PEP 3118. In those situations, access to the
>>>>>> underlying
>>>>>> object is much less useful.
>>>>>>
>>>>>> 2a) Therefore we require that you do "mview.asobject()" manually;
>>>>>> doing
>>>>>> "mview.foo()" is a compile-time error
>>>>>
>>>>>
>>>>>
>>>>> Sounds good. I think that would clean up the current syntax overlap
>>>>> very
>>>>> nicely.
>>>>>
>>>>>
>>>>>> 2b) To drive the point home among users, and aid type inference and
>>>>>> overall language clarity, we REMOVE the auto-acquisition and
>>>>>> require that
>>>>>> you do
>>>>>>
>>>>>> cdef double[:] mview = double[:](obj)
>>>>>
>>>>>
>>>>>
>>>>> I don't see the point, as noted above. Either "obj" is statically typed
>>>>> and
>>>>> the bare assignment becomes a no-op, or it's not typed and the
>>>>> assignment
>>>>> coerces by creating a view. As with all other typed assignments.
>>>>>
>>>>>
>>>>>> 2c) Perhaps: Do not even coerce to a Python memoryview and disallow
>>>>>> "print mview"; instead require that you do "print
>>>>>> mview.asmemoryview()"
>>>>>> or
>>>>>> "print memoryview(mview)" or somesuch.
>>>>>
>>>>>
>>>>>
>>>>> This seems to depend on 2b.
>>>>
>>>>
>>>>
>>>> This I don't understand. The question of 2c) is the analogue to
>>>> auto-coercion of "char*" to bytes; approving 2c) would put
>>>> memoryviews in
>>>> line with char*.
>>>>
>>>> Then again, we could in future auto-coerce char* to a ctypes pointer,
>>>> and in
>>>> that case, coercing a memoryview to an object representing that
>>>> memoryview
>>>> would be OK.
>>>
>>>
>>> Character pointers coerce to strings. Hell, even structs coerce to and
>>> from python dicts, so disallowing the same for memoryviews would just
>>> be inconsistent and inconvenient.
>>
>>
>> OK, but even structs don't coerce back to some arbitrary type, it's
>> always a dict. I don't necesarrily oppose coercing memoryviews to some
>> Python memoryview object (not necesarrily the builtin).
>>
>> I agree that some mview.asobject() triggering a callback defined by some
>> CEP 1xxx ("cross-language CEP") would be really useful; and that could
>> form the basis of a new, improved np.ndarray[double] that allows fast
>> slicing etc. (where that is used automatically whenever needed).
>
>
> After some thinking I believe I can see more clearly where Mark is coming
> from. To sum up, it's either
>
> A) Keep both np.ndarray[double] and double[:] around, with clearly defined
> and separate roles. np.ndarray[double] implementation is revamped to allow
> fast slicing etc., based on the double[:] implementation.
>
> B) Deprecate np.ndarray[double] sooner rather than later, but make double[:]
> have functionality that is *really* close to what np.ndarray[double]
> currently does. In most cases one should be able to basically replace
> np.ndarray[double] with double[:] and the code should continue to work just
> like before; difference is that if you pass in anything else than a NumPy
> array, it will likely fail with a runtime AttributeError at some point
> rather than fail a PyType_Check.

That's a good summary. I have a big preference for B here, but I agree
that treating a typed memoryview as both a user object (possibly
converted through callback) and a typed memoryview "subclass" is quite
magicky. I wouldn't particularly mind something concise like 'm.obj'.
The AttributeError would be the case as usual, when a python object
doesn't have the right interface.

> Between those two I believe it's a matter of design taste, not so much
> rational argument, and I don't know where I stand yet. And I'm going to stop
> thinking about it until I see what Robert says...
>
>
> Dag
> _______________________________________________
> cython-devel mailing list
> cython-devel at python.org
> http://mail.python.org/mailman/listinfo/cython-devel