[Cython] Appetite for working with upstream to extend the buffer protocol?

Sat Jul 8 15:22:52 EDT 2023

So my superficial thoughts:

1. The buffer protocol has two bits. The first says "given this 
predictable memory layout, you can look up an item in memory with these 
rules"; the second describes what the items in memory are. I think 
you're only proposing to change the second part of it. I'd encourage you 
not to change the first part - the nice thing about the first part is 
that it's relatively simple and doesn't try to do anything. For example 
I'd be sceptical about trying to support ragged arrays.

2. As you identify, for a more advanced memoryview to be useful in 
Cython, Cython really has to be able to know an underlying C type for 
your data at compile-time and be able to validate that the buffer it's 
passed matches that C type at runtime. The validation could have varying 
degrees of strictness (i.e. in the worst case we could just check the 
size matches and trust the user). We already support that to extent 
(packed structs with structured arrays) but that doesn't cover everything

3. For your variable length string example, the C struct to use is 
fairly obvious (just your `struct ss`). The difficult bit is likely to 
be memory management of that. I'd kind of encourage you not to expect 
Cython to handle the memory management for this type of thing (i.e. it 
can expose the struct to the user, but it becomes the user's own problem 
to work out if they need to allocate memory when they modify the struct).

5. Things like the datetime for Pandas, or a way of having a float16 
type seems like the sort of thing we should definitely be able to do.

6. In terms of Apache Arrow - if there was demand we probably could add 
support for it. Their documentation says: "The Arrow C data interface 
defines a very small, stable set of C definitions that can be easily 
/copied/ in any project’s source code" - so that suggests it need not be 
a dependency.

7. One of the points of the "typed memoryview" vs the older "np.ndarray" 
interface is that it was supposed to be more generally compatible.  
While we could extend it to match any non-standard additions that Numpy 
tries to make, that does feel dodgy and likely to conflict when other 
projects do their own thing. I think it would be better if the Python 
standard could be extended (even if it was just something like a code to 
indicate "mystery structure of size X")

Don't know if these thoughts are useful. They're a bit scattered. I 
guess the summary is "we could definitely do more with custom data 
types, but don't break the things that made the buffer protocol nice".

David

On 06/07/2023 17:43, Nathan wrote:
> Hi all,
>
> I'm working on a new data type for numpy to represent arrays of 
> variable-width strings [1]. One limitation of using this data type 
> right now is it's not possible to write idiomatic cython code 
> operating on the array, instead one would need to use e.g. the NumPy 
> iterator API. It turns out this is a papercut that's been around for a 
> while and is most noticeable downstream because datetime arrays cannot 
> be passed to Cython.
>
> Here's an example of a downstream library working around lack of 
> support in Cython for datetimes using an iterator: [2]. Pandas works 
> around this by passing int64 views of the arrays to Cython. I think 
> this issue will become more problematic in the future when NumPy 
> officially ships the NEP 42 custom dtype API, which will make it much 
> easier to develop custom data types. It is also already an issue for 
> the legacy custom data types numpy already supports [3], but those 
> aren't very popular so it hasn't come up much.
>
> I'm curious if there's any appetite among the Cython developers to 
> ultimately make it easier to write cython code that works with numpy 
> arrays that have user-defined data types. Currently it's only possible 
> to write code using the numpy or typed memoryview interfaces for 
> arrays with datatypes that support the buffer protocol. See e.g. 
> https://github.com/numpy/numpy/issues/4983.
>
> One approach to fix this would be to either officially or unofficially 
> extend the buffer protocol to allow arbitrary typecodes to be sent in 
> the format string. Officially Python only supports format codes used 
> in the struct module, but in practice you can put any string in the 
> format code and memoryview will accept it. Of course for it actually 
> to be useful NumPy would need to create format codes that allow cython 
> to correctly read and reconstruct the type.
>
> Sebastian Berg proposed this on the CPython discussion forum [3] and 
> there hasn't been much response from upstream. I response to 
> Sebastian, Michael Droettboom suggested [4] using the Arrow data 
> format, which has rich support for various array memory layouts and 
> has support for exchanging custom extension types [5].
>
> The main problem with the buffer protocol approach is defining the 
> protocol in such a way that Cython can appropriately reconstruct the 
> memory layout for the data type (although only supporting strided 
> arrays at first makes a lot of sense) for an arbitrary user-defined 
> data type, ideally without needing to import any code defining the 
> data type.
>
> The main problem with the approach using Apache Arrow is neither 
> Cython or Numpy has any support for it and I don't think either 
> library can depend on Arrow so both would need to write custom 
> serializers and parsers, whereas Cython already has memoryviews fully 
> working.
>
> Guido van Rossum wanted some more discussion about this, so I'm 
> raising this as an issue here in case any Cython developers are 
> interested. Please chime in on the python disourse thread if so.
>
> -Nathan
>
> [1] https://github.com/numpy/numpy-user-dtypes/tree/main/stringdtype
> [2] https://github.com/scikit-hep/awkward/issues/367
> [3] https://github.com/numpy/numpy/issues/18442
> [4] 
> https://discuss.python.org/t/buffer-protocol-and-arbitrary-data-types/26256
> [5] 
> https://discuss.python.org/t/buffer-protocol-and-arbitrary-data-types/26256/3
> [6] https://arrow.apache.org/docs/format/Columnar.html
>
> _______________________________________________
> cython-devel mailing list
> cython-devel at python.org
> https://mail.python.org/mailman/listinfo/cython-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/cython-devel/attachments/20230708/fbe242cc/attachment.html>