From nathan.goldbaum at gmail.com  Thu Jul  6 12:43:44 2023
From: nathan.goldbaum at gmail.com (Nathan)
Date: Thu, 6 Jul 2023 10:43:44 -0600
Subject: [Cython] Appetite for working with upstream to extend the buffer
 protocol?
Message-ID: <CAJXewOms2uh2fswbp837-4YMMm1XhMX4RZnc2GOBuyWTK1x9uA@mail.gmail.com>

Hi all,

I'm working on a new data type for numpy to represent arrays of
variable-width strings [1]. One limitation of using this data type right
now is it's not possible to write idiomatic cython code operating on the
array, instead one would need to use e.g. the NumPy iterator API. It turns
out this is a papercut that's been around for a while and is most
noticeable downstream because datetime arrays cannot be passed to Cython.

Here's an example of a downstream library working around lack of support in
Cython for datetimes using an iterator: [2]. Pandas works around this by
passing int64 views of the arrays to Cython. I think this issue will become
more problematic in the future when NumPy officially ships the NEP 42
custom dtype API, which will make it much easier to develop custom data
types. It is also already an issue for the legacy custom data types numpy
already supports [3], but those aren't very popular so it hasn't come up
much.

I'm curious if there's any appetite among the Cython developers to
ultimately make it easier to write cython code that works with numpy arrays
that have user-defined data types. Currently it's only possible to write
code using the numpy or typed memoryview interfaces for arrays with
datatypes that support the buffer protocol. See e.g.
https://github.com/numpy/numpy/issues/4983.

One approach to fix this would be to either officially or unofficially
extend the buffer protocol to allow arbitrary typecodes to be sent in the
format string. Officially Python only supports format codes used in the
struct module, but in practice you can put any string in the format code
and memoryview will accept it. Of course for it actually to be useful NumPy
would need to create format codes that allow cython to correctly read and
reconstruct the type.

Sebastian Berg proposed this on the CPython discussion forum [3] and there
hasn't been much response from upstream. I response to Sebastian, Michael
Droettboom suggested [4] using the Arrow data format, which has rich
support for various array memory layouts and has support for exchanging
custom extension types [5].

The main problem with the buffer protocol approach is defining the protocol
in such a way that Cython can appropriately reconstruct the memory layout
for the data type (although only supporting strided arrays at first makes a
lot of sense) for an arbitrary user-defined data type, ideally without
needing to import any code defining the data type.

The main problem with the approach using Apache Arrow is neither Cython or
Numpy has any support for it and I don't think either library can depend on
Arrow so both would need to write custom serializers and parsers, whereas
Cython already has memoryviews fully working.

Guido van Rossum wanted some more discussion about this, so I'm raising
this as an issue here in case any Cython developers are interested. Please
chime in on the python disourse thread if so.

-Nathan

[1] https://github.com/numpy/numpy-user-dtypes/tree/main/stringdtype
[2] https://github.com/scikit-hep/awkward/issues/367
[3] https://github.com/numpy/numpy/issues/18442
[4]
https://discuss.python.org/t/buffer-protocol-and-arbitrary-data-types/26256
[5]
https://discuss.python.org/t/buffer-protocol-and-arbitrary-data-types/26256/3
[6] https://arrow.apache.org/docs/format/Columnar.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/cython-devel/attachments/20230706/e50f1171/attachment.html>

From dw-git at d-woods.co.uk  Sat Jul  8 15:22:52 2023
From: dw-git at d-woods.co.uk (da-woods)
Date: Sat, 8 Jul 2023 20:22:52 +0100
Subject: [Cython] Appetite for working with upstream to extend the buffer
 protocol?
In-Reply-To: <f1190364-27b3-3da2-8ff9-b02100ff0589@d-woods.co.uk>
References: <f1190364-27b3-3da2-8ff9-b02100ff0589@d-woods.co.uk>
Message-ID: <86e23d52-fa80-f38b-c743-44ac0ea09368@d-woods.co.uk>

So my superficial thoughts:

1. The buffer protocol has two bits. The first says "given this 
predictable memory layout, you can look up an item in memory with these 
rules"; the second describes what the items in memory are. I think 
you're only proposing to change the second part of it. I'd encourage you 
not to change the first part - the nice thing about the first part is 
that it's relatively simple and doesn't try to do anything. For example 
I'd be sceptical about trying to support ragged arrays.

2. As you identify, for a more advanced memoryview to be useful in 
Cython, Cython really has to be able to know an underlying C type for 
your data at compile-time and be able to validate that the buffer it's 
passed matches that C type at runtime. The validation could have varying 
degrees of strictness (i.e. in the worst case we could just check the 
size matches and trust the user). We already support that to extent 
(packed structs with structured arrays) but that doesn't cover everything

3. For your variable length string example, the C struct to use is 
fairly obvious (just your `struct ss`). The difficult bit is likely to 
be memory management of that. I'd kind of encourage you not to expect 
Cython to handle the memory management for this type of thing (i.e. it 
can expose the struct to the user, but it becomes the user's own problem 
to work out if they need to allocate memory when they modify the struct).

5. Things like the datetime for Pandas, or a way of having a float16 
type seems like the sort of thing we should definitely be able to do.

6. In terms of Apache Arrow - if there was demand we probably could add 
support for it. Their documentation says: "The Arrow C data interface 
defines a very small, stable set of C definitions that can be easily 
/copied/ in any project?s source code" - so that suggests it need not be 
a dependency.

7. One of the points of the "typed memoryview" vs the older "np.ndarray" 
interface is that it was supposed to be more generally compatible.? 
While we could extend it to match any non-standard additions that Numpy 
tries to make, that does feel dodgy and likely to conflict when other 
projects do their own thing. I think it would be better if the Python 
standard could be extended (even if it was just something like a code to 
indicate "mystery structure of size X")

Don't know if these thoughts are useful. They're a bit scattered. I 
guess the summary is "we could definitely do more with custom data 
types, but don't break the things that made the buffer protocol nice".

David


On 06/07/2023 17:43, Nathan wrote:
> Hi all,
>
> I'm working on a new data type for numpy to represent arrays of 
> variable-width strings [1]. One limitation of using this data type 
> right now is it's not possible to write idiomatic cython code 
> operating on the array, instead one would need to use e.g. the NumPy 
> iterator API. It turns out this is a papercut that's been around for a 
> while and is most noticeable downstream because datetime arrays cannot 
> be passed to Cython.
>
> Here's an example of a downstream library working around lack of 
> support in Cython for datetimes using an iterator: [2]. Pandas works 
> around this by passing int64 views of the arrays to Cython. I think 
> this issue will become more problematic in the future when NumPy 
> officially ships the NEP 42 custom dtype API, which will make it much 
> easier to develop custom data types. It is also already an issue for 
> the legacy custom data types numpy already supports [3], but those 
> aren't very popular so it hasn't come up much.
>
> I'm curious if there's any appetite among the Cython developers to 
> ultimately make it easier to write cython code that works with numpy 
> arrays that have user-defined data types. Currently it's only possible 
> to write code using the numpy or typed memoryview interfaces for 
> arrays with datatypes that support the buffer protocol. See e.g. 
> https://github.com/numpy/numpy/issues/4983.
>
> One approach to fix this would be to either officially or unofficially 
> extend the buffer protocol to allow arbitrary typecodes to be sent in 
> the format string. Officially Python only supports format codes used 
> in the struct module, but in practice you can put any string in the 
> format code and memoryview will accept it. Of course for it actually 
> to be useful NumPy would need to create format codes that allow cython 
> to correctly read and reconstruct the type.
>
> Sebastian Berg proposed this on the CPython discussion forum [3] and 
> there hasn't been much response from upstream. I response to 
> Sebastian, Michael Droettboom suggested [4] using the Arrow data 
> format, which has rich support for various array memory layouts and 
> has support for exchanging custom extension types [5].
>
> The main problem with the buffer protocol approach is defining the 
> protocol in such a way that Cython can appropriately reconstruct the 
> memory layout for the data type (although only supporting strided 
> arrays at first makes a lot of sense) for an arbitrary user-defined 
> data type, ideally without needing to import any code defining the 
> data type.
>
> The main problem with the approach using Apache Arrow is neither 
> Cython or Numpy has any support for it and I don't think either 
> library can depend on Arrow so both would need to write custom 
> serializers and parsers, whereas Cython already has memoryviews fully 
> working.
>
> Guido van Rossum wanted some more discussion about this, so I'm 
> raising this as an issue here in case any Cython developers are 
> interested. Please chime in on the python disourse thread if so.
>
> -Nathan
>
> [1] https://github.com/numpy/numpy-user-dtypes/tree/main/stringdtype
> [2] https://github.com/scikit-hep/awkward/issues/367
> [3] https://github.com/numpy/numpy/issues/18442
> [4] 
> https://discuss.python.org/t/buffer-protocol-and-arbitrary-data-types/26256
> [5] 
> https://discuss.python.org/t/buffer-protocol-and-arbitrary-data-types/26256/3
> [6] https://arrow.apache.org/docs/format/Columnar.html
>
> _______________________________________________
> cython-devel mailing list
> cython-devel at python.org
> https://mail.python.org/mailman/listinfo/cython-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/cython-devel/attachments/20230708/fbe242cc/attachment.html>

From stefan_ml at behnel.de  Thu Jul 13 01:31:07 2023
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Thu, 13 Jul 2023 07:31:07 +0200
Subject: [Cython] Cython 3.0 RC 2 released
Message-ID: <e73a16ad-b573-9e6c-9c4a-7cb7f7dbbd3c@behnel.de>

Hi all,

after close to five long years, we're almost there ? I've pushed a release 
candidate for Cython 3.0 with a long list of bug fixes (followed by a 
second one with one important fix).

https://cython.readthedocs.io/en/latest/src/changes.html

Please give it some final testing. Unless we find something really serious 
in the RC2 release, the changes for the final release will be very limited 
and safe, hopefully none at all.

The RC is just in time for this week's US-SciPy, and I'll make sure we have 
a final release for next week's EuroPython in Praha.

Have fun,
Stefan

From stefan_ml at behnel.de  Mon Jul 17 11:24:36 2023
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Mon, 17 Jul 2023 17:24:36 +0200
Subject: [Cython] Cython 3.0 final released
Message-ID: <afc3b27a-76c6-319f-c190-f6515299f74f@behnel.de>

Hi all,

after close to five long years, I'm proud to announce the release of
Cython 3.0. It's done. It's out. Finally!

The full list of improvements compared to the 0.29.x release series is 
entirely incredible.

https://cython.readthedocs.io/en/latest/src/changes.html

Cython 3.0 is better than any other Cython release before, in all aspects. 
It's much more Python, integrates better with C and C++, supports more 
Python implementations and configurations, provides many great new language 
features
	? it's faster, safer and easier to use. It's simply better.

New language features include:

- Python 3 syntax and semantics by default
- Cython type annotations in plain Python code
- automatic NumPy ufunc generation
- fast @dataclass and @total_ordering extension types
- safe exception propagation in C functions by default
- Unicode identifiers in Cython code

All of this wouldn't have been possible without the help of the many, many 
people who contributed code and documentation, tested features, found and 
described bugs, helped debugging problems. Those who started using Cython 
in new environments, new build systems, new use cases, and helped to get it 
working there. Who proposed new features or found mismatches and gaps in 
the existing set of features.

Thank you all, you helped making Cython 3.0 an awesome language!

Along the way, we added two people to the list of Cython developers.

* David Woods has contributed a tremendous list of features and fixes to 
this release. It would honestly not have been possible without his efforts.

* Mat?? Valo has put a lot of work into the documentation and the pure 
Python mode. He found many issues that make Cython now easier and more 
consistent to use from Python code.

Thank you both for your contributions. I'm happy to work together with you.

Everyone, have fun using Cython 3.0, and whatever good comes after it.

Best,
Stefan