From nathan.goldbaum at gmail.com Thu Jul 6 12:43:44 2023 From: nathan.goldbaum at gmail.com (Nathan) Date: Thu, 6 Jul 2023 10:43:44 -0600 Subject: [Cython] Appetite for working with upstream to extend the buffer protocol? Message-ID: Hi all, I'm working on a new data type for numpy to represent arrays of variable-width strings [1]. One limitation of using this data type right now is it's not possible to write idiomatic cython code operating on the array, instead one would need to use e.g. the NumPy iterator API. It turns out this is a papercut that's been around for a while and is most noticeable downstream because datetime arrays cannot be passed to Cython. Here's an example of a downstream library working around lack of support in Cython for datetimes using an iterator: [2]. Pandas works around this by passing int64 views of the arrays to Cython. I think this issue will become more problematic in the future when NumPy officially ships the NEP 42 custom dtype API, which will make it much easier to develop custom data types. It is also already an issue for the legacy custom data types numpy already supports [3], but those aren't very popular so it hasn't come up much. I'm curious if there's any appetite among the Cython developers to ultimately make it easier to write cython code that works with numpy arrays that have user-defined data types. Currently it's only possible to write code using the numpy or typed memoryview interfaces for arrays with datatypes that support the buffer protocol. See e.g. https://github.com/numpy/numpy/issues/4983. One approach to fix this would be to either officially or unofficially extend the buffer protocol to allow arbitrary typecodes to be sent in the format string. Officially Python only supports format codes used in the struct module, but in practice you can put any string in the format code and memoryview will accept it. Of course for it actually to be useful NumPy would need to create format codes that allow cython to correctly read and reconstruct the type. Sebastian Berg proposed this on the CPython discussion forum [3] and there hasn't been much response from upstream. I response to Sebastian, Michael Droettboom suggested [4] using the Arrow data format, which has rich support for various array memory layouts and has support for exchanging custom extension types [5]. The main problem with the buffer protocol approach is defining the protocol in such a way that Cython can appropriately reconstruct the memory layout for the data type (although only supporting strided arrays at first makes a lot of sense) for an arbitrary user-defined data type, ideally without needing to import any code defining the data type. The main problem with the approach using Apache Arrow is neither Cython or Numpy has any support for it and I don't think either library can depend on Arrow so both would need to write custom serializers and parsers, whereas Cython already has memoryviews fully working. Guido van Rossum wanted some more discussion about this, so I'm raising this as an issue here in case any Cython developers are interested. Please chime in on the python disourse thread if so. -Nathan [1] https://github.com/numpy/numpy-user-dtypes/tree/main/stringdtype [2] https://github.com/scikit-hep/awkward/issues/367 [3] https://github.com/numpy/numpy/issues/18442 [4] https://discuss.python.org/t/buffer-protocol-and-arbitrary-data-types/26256 [5] https://discuss.python.org/t/buffer-protocol-and-arbitrary-data-types/26256/3 [6] https://arrow.apache.org/docs/format/Columnar.html -------------- next part -------------- An HTML attachment was scrubbed... URL: From dw-git at d-woods.co.uk Sat Jul 8 15:22:52 2023 From: dw-git at d-woods.co.uk (da-woods) Date: Sat, 8 Jul 2023 20:22:52 +0100 Subject: [Cython] Appetite for working with upstream to extend the buffer protocol? In-Reply-To: References: Message-ID: <86e23d52-fa80-f38b-c743-44ac0ea09368@d-woods.co.uk> So my superficial thoughts: 1. The buffer protocol has two bits. The first says "given this predictable memory layout, you can look up an item in memory with these rules"; the second describes what the items in memory are. I think you're only proposing to change the second part of it. I'd encourage you not to change the first part - the nice thing about the first part is that it's relatively simple and doesn't try to do anything. For example I'd be sceptical about trying to support ragged arrays. 2. As you identify, for a more advanced memoryview to be useful in Cython, Cython really has to be able to know an underlying C type for your data at compile-time and be able to validate that the buffer it's passed matches that C type at runtime. The validation could have varying degrees of strictness (i.e. in the worst case we could just check the size matches and trust the user). We already support that to extent (packed structs with structured arrays) but that doesn't cover everything 3. For your variable length string example, the C struct to use is fairly obvious (just your `struct ss`). The difficult bit is likely to be memory management of that. I'd kind of encourage you not to expect Cython to handle the memory management for this type of thing (i.e. it can expose the struct to the user, but it becomes the user's own problem to work out if they need to allocate memory when they modify the struct). 5. Things like the datetime for Pandas, or a way of having a float16 type seems like the sort of thing we should definitely be able to do. 6. In terms of Apache Arrow - if there was demand we probably could add support for it. Their documentation says: "The Arrow C data interface defines a very small, stable set of C definitions that can be easily /copied/ in any project?s source code" - so that suggests it need not be a dependency. 7. One of the points of the "typed memoryview" vs the older "np.ndarray" interface is that it was supposed to be more generally compatible.? While we could extend it to match any non-standard additions that Numpy tries to make, that does feel dodgy and likely to conflict when other projects do their own thing. I think it would be better if the Python standard could be extended (even if it was just something like a code to indicate "mystery structure of size X") Don't know if these thoughts are useful. They're a bit scattered. I guess the summary is "we could definitely do more with custom data types, but don't break the things that made the buffer protocol nice". David On 06/07/2023 17:43, Nathan wrote: > Hi all, > > I'm working on a new data type for numpy to represent arrays of > variable-width strings [1]. One limitation of using this data type > right now is it's not possible to write idiomatic cython code > operating on the array, instead one would need to use e.g. the NumPy > iterator API. It turns out this is a papercut that's been around for a > while and is most noticeable downstream because datetime arrays cannot > be passed to Cython. > > Here's an example of a downstream library working around lack of > support in Cython for datetimes using an iterator: [2]. Pandas works > around this by passing int64 views of the arrays to Cython. I think > this issue will become more problematic in the future when NumPy > officially ships the NEP 42 custom dtype API, which will make it much > easier to develop custom data types. It is also already an issue for > the legacy custom data types numpy already supports [3], but those > aren't very popular so it hasn't come up much. > > I'm curious if there's any appetite among the Cython developers to > ultimately make it easier to write cython code that works with numpy > arrays that have user-defined data types. Currently it's only possible > to write code using the numpy or typed memoryview interfaces for > arrays with datatypes that support the buffer protocol. See e.g. > https://github.com/numpy/numpy/issues/4983. > > One approach to fix this would be to either officially or unofficially > extend the buffer protocol to allow arbitrary typecodes to be sent in > the format string. Officially Python only supports format codes used > in the struct module, but in practice you can put any string in the > format code and memoryview will accept it. Of course for it actually > to be useful NumPy would need to create format codes that allow cython > to correctly read and reconstruct the type. > > Sebastian Berg proposed this on the CPython discussion forum [3] and > there hasn't been much response from upstream. I response to > Sebastian, Michael Droettboom suggested [4] using the Arrow data > format, which has rich support for various array memory layouts and > has support for exchanging custom extension types [5]. > > The main problem with the buffer protocol approach is defining the > protocol in such a way that Cython can appropriately reconstruct the > memory layout for the data type (although only supporting strided > arrays at first makes a lot of sense) for an arbitrary user-defined > data type, ideally without needing to import any code defining the > data type. > > The main problem with the approach using Apache Arrow is neither > Cython or Numpy has any support for it and I don't think either > library can depend on Arrow so both would need to write custom > serializers and parsers, whereas Cython already has memoryviews fully > working. > > Guido van Rossum wanted some more discussion about this, so I'm > raising this as an issue here in case any Cython developers are > interested. Please chime in on the python disourse thread if so. > > -Nathan > > [1] https://github.com/numpy/numpy-user-dtypes/tree/main/stringdtype > [2] https://github.com/scikit-hep/awkward/issues/367 > [3] https://github.com/numpy/numpy/issues/18442 > [4] > https://discuss.python.org/t/buffer-protocol-and-arbitrary-data-types/26256 > [5] > https://discuss.python.org/t/buffer-protocol-and-arbitrary-data-types/26256/3 > [6] https://arrow.apache.org/docs/format/Columnar.html > > _______________________________________________ > cython-devel mailing list > cython-devel at python.org > https://mail.python.org/mailman/listinfo/cython-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefan_ml at behnel.de Thu Jul 13 01:31:07 2023 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 13 Jul 2023 07:31:07 +0200 Subject: [Cython] Cython 3.0 RC 2 released Message-ID: Hi all, after close to five long years, we're almost there ? I've pushed a release candidate for Cython 3.0 with a long list of bug fixes (followed by a second one with one important fix). https://cython.readthedocs.io/en/latest/src/changes.html Please give it some final testing. Unless we find something really serious in the RC2 release, the changes for the final release will be very limited and safe, hopefully none at all. The RC is just in time for this week's US-SciPy, and I'll make sure we have a final release for next week's EuroPython in Praha. Have fun, Stefan From stefan_ml at behnel.de Mon Jul 17 11:24:36 2023 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 17 Jul 2023 17:24:36 +0200 Subject: [Cython] Cython 3.0 final released Message-ID: Hi all, after close to five long years, I'm proud to announce the release of Cython 3.0. It's done. It's out. Finally! The full list of improvements compared to the 0.29.x release series is entirely incredible. https://cython.readthedocs.io/en/latest/src/changes.html Cython 3.0 is better than any other Cython release before, in all aspects. It's much more Python, integrates better with C and C++, supports more Python implementations and configurations, provides many great new language features ? it's faster, safer and easier to use. It's simply better. New language features include: - Python 3 syntax and semantics by default - Cython type annotations in plain Python code - automatic NumPy ufunc generation - fast @dataclass and @total_ordering extension types - safe exception propagation in C functions by default - Unicode identifiers in Cython code All of this wouldn't have been possible without the help of the many, many people who contributed code and documentation, tested features, found and described bugs, helped debugging problems. Those who started using Cython in new environments, new build systems, new use cases, and helped to get it working there. Who proposed new features or found mismatches and gaps in the existing set of features. Thank you all, you helped making Cython 3.0 an awesome language! Along the way, we added two people to the list of Cython developers. * David Woods has contributed a tremendous list of features and fixes to this release. It would honestly not have been possible without his efforts. * Mat?? Valo has put a lot of work into the documentation and the pure Python mode. He found many issues that make Cython now easier and more consistent to use from Python code. Thank you both for your contributions. I'm happy to work together with you. Everyone, have fun using Cython 3.0, and whatever good comes after it. Best, Stefan