From melissawm at gmail.com Mon Mar 2 09:30:41 2020 From: melissawm at gmail.com (=?UTF-8?Q?Melissa_Mendon=C3=A7a?=) Date: Mon, 2 Mar 2020 11:30:41 -0300 Subject: [Numpy-discussion] Proposal to accept NEP #44: Restructuring the NumPy Documentation In-Reply-To: <004c39a1-49dd-4a77-a6c1-3bdce2fbfad3@www.fastmail.com> References: <004c39a1-49dd-4a77-a6c1-3bdce2fbfad3@www.fastmail.com> Message-ID: Hello all, Since there were no objections raised, I propose this NEP is accepted. I'll be submitting the PR with the final links to this thread in a few minutes. Best, Melissa On Fri, Feb 21, 2020 at 3:30 PM Stefan van der Walt wrote: > On Wed, Feb 19, 2020, at 03:58, Melissa Mendon?a wrote: > > I am proposing the acceptance of NEP 44 - Restructuring the NumPy > Documentation. > > https://numpy.org/neps/nep-0044-restructuring-numpy-docs.html > > > Thanks, Melissa, for developing this NEP! The plan makes sense to me. > > Best regards, > St?fan > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -- Melissa Weber Mendon?a -- "Knowledge is knowing a tomato is a fruit; wisdom is knowing you don't put tomato in fruit salad." -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Tue Mar 3 14:01:27 2020 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Tue, 03 Mar 2020 11:01:27 -0800 Subject: [Numpy-discussion] NumPy Community Meeting Wednesday, March 4 Message-ID: <83edf73a17cc3baf6ce565aae7bc980410401eb3.camel@sipsolutions.net> Hi all, There will be a NumPy Community meeting Wednesday March 4 at 11 am Pacific Time. Everyone is invited and encouraged to join in and edit the work-in-progress meeting topics and notes: https://hackmd.io/76o-IxCjQX2mOXO_wwkcpg?both Best wishes Sebastian -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From sebastian at sipsolutions.net Tue Mar 3 18:34:00 2020 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Tue, 03 Mar 2020 15:34:00 -0800 Subject: [Numpy-discussion] NEP 37: A dispatch protocol for NumPy-like modules In-Reply-To: <39784d9f-17c0-d5b4-4575-3ad2826ea3ce@gmail.com> References: <21692339-9f4b-029c-d422-ea549acbe6c3@gmail.com> <1cfce715d48b847e91739c2a56b9750f15b1958f.camel@sipsolutions.net> <39784d9f-17c0-d5b4-4575-3ad2826ea3ce@gmail.com> Message-ID: On Fri, 2020-02-28 at 11:28 -0500, Allan Haldane wrote: > On 2/23/20 6:59 PM, Ralf Gommers wrote: > > One of the main rationales for the whole NEP, and the argument in > > multiple places > > ( > > https://numpy.org/neps/nep-0037-array-module.html#opt-in-vs-opt-out-for-users > > ) > > is that it's now opt-in while __array_function__ was opt-out. This > > isn't > > really true - the problem is simply *moved*, from the duck array > > libraries to the array-consuming libraries. The end user will still > > see > > the backwards incompatible change, with no way to turn it off. It > > will > > be easier with __array_module__ to warn users, but this should be > > expanded on in the NEP. > > Might it be possible to flip this NEP back to opt-out while keeping > the > nice simplifications and configurabile array-creation routines, > relative > to __array_function__? > > That is, what if we define two modules, "numpy" and "numpy_strict". > "numpy_strict" would raise an exception on duck-arrays defining > __array_module__ (as numpy currently does). "numpy" would be a > wrapper > around "numpy_strict" that decorates all numpy methods with a call to > "get_array_module(inputs).func(inputs)". This would be possible, but I think we strongly leaned against the idea. Basically, if you have to opt-out, from a library perspective there may be `np.asarray` calls, which for example later call into C and expect arrays. So, I have large doubts that an opt-out solution works easily for library authors. Array function is opt-out, but effectively most clean library code already opted out... We had previously discussed the opposite, of having a namespace of implicit dispatching based on get_array_module, but if we keep array function around, I am not sure there is much reason for it. > > Then end-user code that did "import numpy as np" would accept > ducktypes > by default, while library developers who want to signal they don't > support ducktypes can opt-out by doing "import numpy_strict as np". > Issues with `np.as_array` seem mitigated compared to > __array_function__ > since that method would now be ducktype-aware. My tendency is that if we want to go there, we would need to push ahead with the `np.duckarray()` idea instead. To be clear: I currently very much prefer the get_array_module() idea. It just seems much cleaner for library authors, and they are the primary issue at the moment in my opinion. - Sebastian > -Allan > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From sebastian at sipsolutions.net Tue Mar 3 19:20:31 2020 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Tue, 03 Mar 2020 16:20:31 -0800 Subject: [Numpy-discussion] NEP 37: A dispatch protocol for NumPy-like modules In-Reply-To: References: <21692339-9f4b-029c-d422-ea549acbe6c3@gmail.com> <1cfce715d48b847e91739c2a56b9750f15b1958f.camel@sipsolutions.net> Message-ID: <90692cc4a2f009b061e55d8fe9b118f709ff1375.camel@sipsolutions.net> On Sun, 2020-02-23 at 22:44 -0800, Stephan Hoyer wrote: > On Sun, Feb 23, 2020 at 3:59 PM Ralf Gommers > wrote: > > > > On Sun, Feb 23, 2020 at 3:31 PM Stephan Hoyer > > wrote: > > > On Thu, Feb 6, 2020 at 12:20 PM Sebastian Berg < > > > sebastian at sipsolutions.net> wrote: > > > > > > I don't think NumPy needs to do anything about warnings. It is > > > straightforward for libraries that want to use use > > > get_array_module() to issue their own warnings before calling > > > get_array_module(), if desired. > > > > > > Or alternatively, if a library is about to add a new > > > __array_module__ method, it is straightforward to issue a warning > > > inside the new __array_module__ method before returning the NumPy > > > functions. > > > > > > > I don't think this is quite enough. Sebastian points out a fairly > > important issue. One of the main rationales for the whole NEP, and > > the argument in multiple places ( > > https://numpy.org/neps/nep-0037-array-module.html#opt-in-vs-opt-out-for-users > > ) is that it's now opt-in while __array_function__ was opt-out. > > This isn't really true - the problem is simply *moved*, from the > > duck array libraries to the array-consuming libraries. The end user > > will still see the backwards incompatible change, with no way to > > turn it off. It will be easier with __array_module__ to warn users, > > but this should be expanded on in the NEP. > > > > Ralf, thanks for sharing your thoughts. > > I'm not quite I understand the concerns about backwards > incompatibility: > 1. The intention is that implementing a __array_module__ method > should be backwards compatible with all current uses of NumPy. This > satisfies backwards compatibility concerns for an array-implementing > library like JAX. > 2. In contrast, calling get_array_module() offers no guarantees about > backwards compatibility. This seems nearly impossible, because the > entire point of the protocol is to make it possible to opt-in to new > behavior. So backwards compatibility isn't solved for Scikit-Learn > switching to use get_array_module(), and after Scikit-Learn does so, > adding __array_module__ to new types of arrays could potentially have > backwards incompatible consequences for Scikit-Learn (unless sklearn > uses default=None). > > Are you suggesting just adding something like what I'm writing here > into the NEP? Perhaps along with advice to consider issuing warnings > inside __array_module__ and falling back to legacy behavior when > first implementing it on a new type? I think that should be sufficient, personally. We could mention that scikit-learn will likely use a context manager to do this. We can also think about providing a global default (which sklearn can use as its own default if they wish so, but that is reserved to the end-user). That would be a small amendment, and I think we could add it even after accepting the NEP as it is. > > We could also potentially make a few changes to make backwards > compatibility even easier, by making the protocol less aggressive > about assuming that NumPy is a safe fallback. Some non-exclusive > options: > a. We could switch the default value of "default" on > get_array_module() to None, so an exception is raised if nothing > implements __array_module__. I am not sure that I feel switching the default to None makes much of a difference to be honest. Unless we use it to signal a super strict mode similar to b. below. > b. We could includes *all* argument types in "types", not just types > that implement __array_module__. NumPy's ndarray.__array_module__ > could then recognize and refuse to return an implementation if there > are other arguments that might implement __array_module__ in the > future (e.g., anything outside the standard library?). That is a good point, anything that is not NumPy recognized could simply be rejected. It does mean that you have to call `module.asarray()` manually more often though. For `list`, it could also make sense to just add np.ndarray to types. If we want to be conservative, maybe we could also just error before calling `__array_module__`. Whenever there is something that we do not know how to interpret force the user to clarify? > > The downside of making either of these choices is that it would > potentially make get_array_function() a bit less usable, because it > is more likely to fail, e.g., if called on a float, or some custom > type that should be treated as a scalar. Right, although we could relax it later if it seems overly annoying. > > > Also, I'm still not sure I agree with the tone of the discussion on > > this topic. It's very heavily inspired by what the JAX devs are > > telling you (the NEP still says PyTorch and scipy.sparse as well, > > but that's not true in both cases). If you ask Dask and CuPy for > > example, they're quite happy with __array_function__ and there > > haven't been many complaints about backwards compat breakage. > > > > I'm linking to comments you wrote in reference to PyTorch and > scipy.sparse in the current draft of the NEP, so I certainly want to > make sure that you agree my characterization :). > > Would it be fair to say: > - JAX is reluctant to implement __array_function__ because of > concerns about breaking existing code. JAX developers think that when > users use NumPy functions on JAX arrays, they are explicitly choosing > to convert from JAX to NumPy. This model is fundamentally > incompatible __array_function__, which we chose to override the > existing numpy namespace. > - PyTorch and scipy.sparse are not yet in position to implement > __array_function__ (due to a lack of a direct implementation of > NumPy's API), but these projects take backwards compatibility > seriously. > > Does "take backwards compatibility seriously" sound about right to > you? I'm very open to specific suggestions here. (TensorFlow could > probably also be safely added to this second list.) This will need input from Ralf, my personal main concern is backward compatibility in libraries: I am pretty sure sklearn would only use a potential `np.asduckarray` when the user opted in. But in that case my personal feeling is that the `get_array_module` solution is cleaner and makes it easier to expand functionality slowly (for libraries). Two other points: First, I am wondering if we should add something like a `__qualname__` to the contract. I.e. a returned module must have a well defined `module.__name__` (that is usually already correct), so that sklearn could do: module = np.get_array_module(*arrays) if module.__name__ not in ("numpy", "sparse"): raise TypeError("Currently only numpy and sparse are supported") if they wish so (that is trivial, but if you return a class acting as a module it may be important). Second, we have to make progress on whether or not the "restricted" namespace idea should have priority. My personal opinion is tending strongly towards no. The NumPy version should normally be older than other libraries, and if NumPy updates the API so do the downstream implementers. E.g. dask may have to provide multiple version of the same function depending on the installed NumPy version, but that seems OK to me? It is just as downstream libraries currently have to support multiple NumPy versions. We could add a contract that the first time `get_array_module` is used to e.g. get the dask namespace and the NumPy version is too new, a warning should be given. The practical thing seems to me that we ignore this for the moment (as something we can do later on)? If there is missing API, in most cases an AttributeError will be raised which could provide some additional information to the user? The only alternative seems the complete opposite?: Create a new module, and make even NumPy only one of the implementers of that new (restricted) module. That may be cleaner, but I fear that it is impractical to be honest. I will put this on the agenda for tomorrow, even if we discuss it only very briefly. My feeling (and hope) is that we are nearing a point where we can make a final decision. Best, Sebastian > > Best, > Stephan > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From arj.python at gmail.com Thu Mar 5 02:14:53 2020 From: arj.python at gmail.com (Abdur-Rahmaan Janhangeer) Date: Thu, 5 Mar 2020 11:14:53 +0400 Subject: [Numpy-discussion] Good use of __dunder__ methods in numpy Message-ID: Greetings list, I have a talk about dunder methods in Python (https://conference.mscc.mu/speaker/67604187-57c3-4be6-987c-ea4bef388ad3) and it would be nice to include Numpy in the mix. Can someone point me to one or two use cases / file link where dunder methods help numpy? Thanks fun info: i am a tiny numpy contributor with a one line merge. -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Thu Mar 5 13:31:30 2020 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Thu, 05 Mar 2020 10:31:30 -0800 Subject: [Numpy-discussion] Good use of __dunder__ methods in numpy In-Reply-To: References: Message-ID: <97371f612474f98cd85df3a427acbbe7f6f78746.camel@sipsolutions.net> Hi, On Thu, 2020-03-05 at 11:14 +0400, Abdur-Rahmaan Janhangeer wrote: > Greetings list, > > I have a talk about dunder methods in Python > > ( > https://conference.mscc.mu/speaker/67604187-57c3-4be6-987c-ea4bef388ad3 > ) > > and it would be nice to include Numpy in the mix. Can someone point > me to one or two use cases / file link where dunder methods help > numpy? > I am not sure in what sense you are looking for. NumPy has its own set of dunder methods (some of which should not be used super much probably), like `__array__`, `__array_interface__`, `__array_ufunc__`, `__array_function__`, `__array_finalize__`, ... So we are using `__array_*__` for numpy related dunders. Of course we use most Python defined dunders, but I am not sure that you are looking for that? Best, Sebastian > Thanks > > fun info: i am a tiny numpy contributor with a one line merge. > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From arj.python at gmail.com Thu Mar 5 14:01:10 2020 From: arj.python at gmail.com (Abdur-Rahmaan Janhangeer) Date: Thu, 5 Mar 2020 23:01:10 +0400 Subject: [Numpy-discussion] Good use of __dunder__ methods in numpy In-Reply-To: <97371f612474f98cd85df3a427acbbe7f6f78746.camel@sipsolutions.net> References: <97371f612474f98cd85df3a427acbbe7f6f78746.camel@sipsolutions.net> Message-ID: Ah thank you for info, yes i'm talking about normal methods. If i can get a link to a file that shows how dunder methods help with having cool coding APIs that would be great! Thanks! Yours, Abdur-Rahmaan Janhangeer pythonmembers.club | github Mauritius On Thu, Mar 5, 2020 at 10:32 PM Sebastian Berg wrote: > Hi, > > On Thu, 2020-03-05 at 11:14 +0400, Abdur-Rahmaan Janhangeer wrote: > > Greetings list, > > > > I have a talk about dunder methods in Python > > > > ( > > https://conference.mscc.mu/speaker/67604187-57c3-4be6-987c-ea4bef388ad3 > > ) > > > > and it would be nice to include Numpy in the mix. Can someone point > > me to one or two use cases / file link where dunder methods help > > numpy? > > > > I am not sure in what sense you are looking for. NumPy has its own set > of dunder methods (some of which should not be used super much > probably), like `__array__`, `__array_interface__`, `__array_ufunc__`, > `__array_function__`, `__array_finalize__`, ... > So we are using `__array_*__` for numpy related dunders. > > Of course we use most Python defined dunders, but I am not sure that > you are looking for that? > > Best, > > Sebastian > > > > Thanks > > > > fun info: i am a tiny numpy contributor with a one line merge. > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From grlee77 at gmail.com Thu Mar 5 17:13:30 2020 From: grlee77 at gmail.com (Gregory Lee) Date: Thu, 5 Mar 2020 17:13:30 -0500 Subject: [Numpy-discussion] Good use of __dunder__ methods in numpy In-Reply-To: References: <97371f612474f98cd85df3a427acbbe7f6f78746.camel@sipsolutions.net> Message-ID: On Thu, Mar 5, 2020 at 2:01 PM Abdur-Rahmaan Janhangeer < arj.python at gmail.com> wrote: > Ah thank you for info, yes i'm talking about normal methods. > If i can get a link to a file that shows how dunder methods help with > having cool coding APIs that would be great! > > Thanks! > > > Yours, > > Abdur-Rahmaan Janhangeer > pythonmembers.club | github > > Mauritius > > You may want to take a look at PEP 465 as an example, then. If I recall correctly, the __matmul__ method described in it was added to the standard library largely with NumPy in mind. https://www.python.org/dev/peps/pep-0465/ > > On Thu, Mar 5, 2020 at 10:32 PM Sebastian Berg > wrote: > >> Hi, >> >> On Thu, 2020-03-05 at 11:14 +0400, Abdur-Rahmaan Janhangeer wrote: >> > Greetings list, >> > >> > I have a talk about dunder methods in Python >> > >> > ( >> > https://conference.mscc.mu/speaker/67604187-57c3-4be6-987c-ea4bef388ad3 >> > ) >> > >> > and it would be nice to include Numpy in the mix. Can someone point >> > me to one or two use cases / file link where dunder methods help >> > numpy? >> > >> >> I am not sure in what sense you are looking for. NumPy has its own set >> of dunder methods (some of which should not be used super much >> probably), like `__array__`, `__array_interface__`, `__array_ufunc__`, >> `__array_function__`, `__array_finalize__`, ... >> So we are using `__array_*__` for numpy related dunders. >> >> Of course we use most Python defined dunders, but I am not sure that >> you are looking for that? >> >> Best, >> >> Sebastian >> >> >> > Thanks >> > >> > fun info: i am a tiny numpy contributor with a one line merge. >> > _______________________________________________ >> > NumPy-Discussion mailing list >> > NumPy-Discussion at python.org >> > https://mail.python.org/mailman/listinfo/numpy-discussion >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From arj.python at gmail.com Thu Mar 5 20:11:05 2020 From: arj.python at gmail.com (Abdur-Rahmaan Janhangeer) Date: Fri, 6 Mar 2020 05:11:05 +0400 Subject: [Numpy-discussion] Good use of __dunder__ methods in numpy In-Reply-To: References: <97371f612474f98cd85df3a427acbbe7f6f78746.camel@sipsolutions.net> Message-ID: On Fri, 6 Mar 2020, 02:14 Gregory Lee, wrote: > > You may want to take a look at PEP 465 as an example, then. If I recall > correctly, the __matmul__ method described in it was added to the standard > library largely with NumPy in mind. > https://www.python.org/dev/peps/pep-0465/ > Hum thanks i just discovered a completely new feature of Py along the way (the @) Thanks ??? -------------- next part -------------- An HTML attachment was scrubbed... URL: From gschwind at gnu-log.net Tue Mar 10 12:25:17 2020 From: gschwind at gnu-log.net (Benoit Gschwind) Date: Tue, 10 Mar 2020 17:25:17 +0100 Subject: [Numpy-discussion] Handle type convertion in C API Message-ID: <0b8144d564328ae7a3d83ed6c70cbd079f0e9456.camel@gnu-log.net> Hello, I writing a python binding of one of our library. The binding intend to vectorize the function call. for exemple: double foo(double, double) will be bound to a python: module.foo(, ) and the function foo will be called like : for (int i = 0; i < size; ++i) outarr[i] = foo(inarr0[i], inarr[1]); My question is about how can I handle type conversion of input array, preferably in efficient manner, given that each input array may require different input type. Currently I basically enforce the type and no type conversion is performed. But I would like relax it. I thought of several possibility starting from the obvious solution consisting on recasting in the inner loop that would give: for (int i = 0; i < size; ++i) if (inarr[i] need recast) in0 = recast(inarr[i]) else in0 = inarr[i] [... same for all inputs parameter ...] outarr[i] = foo(in0, in1, ...); This solution is memory efficient, but not actually computationally efficient. The second solution is to copy&recast the entire inputs arrays, but in that case it's not memory efficient. And my final thought is to mix the first and the second by chunking the second method, i.e. converting N input in a raw, then applying the function to them en so on until all the array is processed. Thus my questions are: - there is another way to do what I want? - there is an existing or recommended way to do it? And a side question, I use the PyArray_FROM_OTF, but I do not understand well it's semantic. If I pass a python list, it is converted to the desired type and requirement; when I pass a non-continuous array it is converted to the continuous one; but when I pass a numpy array of another type than the one specified I do not get the conversion. There is a function that do the conversion unconditionally? Did I missed something ? Thank you by advance for your help Best regards From ewm at redtetrahedron.org Tue Mar 10 13:13:09 2020 From: ewm at redtetrahedron.org (Eric Moore) Date: Tue, 10 Mar 2020 13:13:09 -0400 Subject: [Numpy-discussion] Handle type convertion in C API In-Reply-To: <0b8144d564328ae7a3d83ed6c70cbd079f0e9456.camel@gnu-log.net> References: <0b8144d564328ae7a3d83ed6c70cbd079f0e9456.camel@gnu-log.net> Message-ID: Hi Benoit, Since you have a function that takes two scalars to one scalar, it sounds to me as though you would be best off creating a ufunc. This will then handle the conversion to and looping over the arrays, etc for you. The documentation is available here: https://numpy.org/doc/1.18/user/c-info.ufunc-tutorial.html. Regards, Eric On Tue, Mar 10, 2020 at 12:28 PM Benoit Gschwind wrote: > Hello, > > I writing a python binding of one of our library. The binding intend to > vectorize the function call. for exemple: > > double foo(double, double) will be bound to a python: > > module.foo(, ) > > and the function foo will be called like : > > for (int i = 0; i < size; ++i) > outarr[i] = foo(inarr0[i], inarr[1]); > > My question is about how can I handle type conversion of input array, > preferably in efficient manner, given that each input array may require > different input type. > > Currently I basically enforce the type and no type conversion is > performed. But I would like relax it. I thought of several possibility > starting from the obvious solution consisting on recasting in the inner > loop that would give: > > for (int i = 0; i < size; ++i) > if (inarr[i] need recast) > in0 = > recast(inarr[i]) > else > in0 = inarr[i] > [... same for all > inputs parameter ...] > outarr[i] = foo(in0, in1, ...); > > This solution is memory efficient, but not actually computationally > efficient. > > The second solution is to copy&recast the entire inputs arrays, but in > that case it's not memory efficient. And my final thought is to mix the > first and the second by chunking the second method, i.e. converting N > input in a raw, then applying the function to them en so on until all > the array is processed. > > Thus my questions are: > - there is another way to do what I want? > - there is an existing or recommended way to do it? > > And a side question, I use the PyArray_FROM_OTF, but I do not > understand well it's semantic. If I pass a python list, it is converted > to the desired type and requirement; when I pass a non-continuous array > it is converted to the continuous one; but when I pass a numpy array of > another type than the one specified I do not get the conversion. There > is a function that do the conversion unconditionally? Did I missed > something ? > > Thank you by advance for your help > > Best regards > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Tue Mar 10 13:57:17 2020 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Tue, 10 Mar 2020 10:57:17 -0700 Subject: [Numpy-discussion] NumPy Development Meeting - Triage Focus Message-ID: Hi all, (Note that the US has switched to daylight saving time) Our bi-weekly triage-focused NumPy development meeting is tomorrow (Wednesday, Februrary 26) at 11 am Pacific Time. Everyone is invited to join in and edit the work-in-progress meeting topics and notes: https://hackmd.io/68i_JvOYQfy9ERiHgXMPvg I encourage everyone to notify us of issues or PRs that you feel should be prioritized or simply discussed briefly. Just comment on it so we can label it, or add your PR/issue to this weeks topics for discussion. Best regards Sebastian -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From gschwind at gnu-log.net Wed Mar 11 04:40:12 2020 From: gschwind at gnu-log.net (Benoit Gschwind) Date: Wed, 11 Mar 2020 09:40:12 +0100 Subject: [Numpy-discussion] Handle type convertion in C API In-Reply-To: References: <0b8144d564328ae7a3d83ed6c70cbd079f0e9456.camel@gnu-log.net> Message-ID: <6d1a95fe8d0da50a6ab75bad1d6809588f6fee8d.camel@gnu-log.net> Hello Eric, Thank you for pointing out ufunc, I implemented my binding using it, it's working and it's simpler than my previous implementation, but I still not have the flexibility to dynamically recast input array, i.e. using a int64 array as input for a int32. For instance for testing my code I use numpy.random that provide int64 array, I have to convert the array manually before calling my ufunc, which is somehow annoying. For function that have 1 or 2 parameters it's practical to have 4 variant of the function, but in case of 6-8 parameters it's becoming more difficult. Best regards Le mardi 10 mars 2020 ? 13:13 -0400, Eric Moore a ?crit : > Hi Benoit, > > Since you have a function that takes two scalars to one scalar, it > sounds to me as though you would be best off creating a ufunc. This > will then handle the conversion to and looping over the arrays, etc > for you. The documentation is available here: > https://numpy.org/doc/1.18/user/c-info.ufunc-tutorial.html. > > Regards, > > Eric > > On Tue, Mar 10, 2020 at 12:28 PM Benoit Gschwind < > gschwind at gnu-log.net> wrote: > > Hello, > > > > I writing a python binding of one of our library. The binding > > intend to > > vectorize the function call. for exemple: > > > > double foo(double, double) will be bound to a python: > > > > module.foo(, ) > > > > and the function foo will be called like : > > > > for (int i = 0; i < size; ++i) > > outarr[i] = foo(inarr0[i], inarr[1]); > > > > My question is about how can I handle type conversion of input > > array, > > preferably in efficient manner, given that each input array may > > require > > different input type. > > > > Currently I basically enforce the type and no type conversion is > > performed. But I would like relax it. I thought of several > > possibility > > starting from the obvious solution consisting on recasting in the > > inner > > loop that would give: > > > > for (int i = 0; i < size; ++i) > > if (inarr[i] need recast) > > in0 = > > recast(inarr[i]) > > else > > in0 = inarr[i] > > [... same for all > > inputs parameter ...] > > outarr[i] = foo(in0, in1, ...); > > > > This solution is memory efficient, but not actually computationally > > efficient. > > > > The second solution is to copy&recast the entire inputs arrays, but > > in > > that case it's not memory efficient. And my final thought is to mix > > the > > first and the second by chunking the second method, i.e. converting > > N > > input in a raw, then applying the function to them en so on until > > all > > the array is processed. > > > > Thus my questions are: > > - there is another way to do what I want? > > - there is an existing or recommended way to do it? > > > > And a side question, I use the PyArray_FROM_OTF, but I do not > > understand well it's semantic. If I pass a python list, it is > > converted > > to the desired type and requirement; when I pass a non-continuous > > array > > it is converted to the continuous one; but when I pass a numpy > > array of > > another type than the one specified I do not get the conversion. > > There > > is a function that do the conversion unconditionally? Did I missed > > something ? > > > > Thank you by advance for your help > > > > Best regards > > > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion From ewm at redtetrahedron.org Wed Mar 11 13:52:55 2020 From: ewm at redtetrahedron.org (Eric Moore) Date: Wed, 11 Mar 2020 13:52:55 -0400 Subject: [Numpy-discussion] Handle type convertion in C API In-Reply-To: <6d1a95fe8d0da50a6ab75bad1d6809588f6fee8d.camel@gnu-log.net> References: <0b8144d564328ae7a3d83ed6c70cbd079f0e9456.camel@gnu-log.net> <6d1a95fe8d0da50a6ab75bad1d6809588f6fee8d.camel@gnu-log.net> Message-ID: There are a variety of ways to resolve your issues. You can try using the optional arguments casting, dtype or signature that work for all ufuncs see https://numpy.org/doc/1.18/reference/ufuncs.html#optional-keyword-arguments This will allow you to override the default type checks. What you'll find for many ufuncs in, for instance, scipy.special, is that the underlying function in cython, C or Fortran is only defined for doubles, but the function has a signature for f->f, which just casts the input and output in the loop. Generally, downcasting is not permitted without explicit opt in, e.g. int64 to int32, since this is not a safe cast as there are many values that an int64 can hold that an int32 cannot. Generally speaking, you do have to manage the types of your arrays when the defaults aren't what you want. There really isn't anyway around it. Eric On Wed, Mar 11, 2020 at 4:43 AM Benoit Gschwind wrote: > Hello Eric, > > Thank you for pointing out ufunc, I implemented my binding using it, > it's working and it's simpler than my previous implementation, but I > still not have the flexibility to dynamically recast input array, i.e. > using a int64 array as input for a int32. For instance for testing my > code I use numpy.random that provide int64 array, I have to convert the > array manually before calling my ufunc, which is somehow annoying. > > For function that have 1 or 2 parameters it's practical to have 4 > variant of the function, but in case of 6-8 parameters it's becoming > more difficult. > > Best regards > > Le mardi 10 mars 2020 ? 13:13 -0400, Eric Moore a ?crit : > > Hi Benoit, > > > > Since you have a function that takes two scalars to one scalar, it > > sounds to me as though you would be best off creating a ufunc. This > > will then handle the conversion to and looping over the arrays, etc > > for you. The documentation is available here: > > https://numpy.org/doc/1.18/user/c-info.ufunc-tutorial.html. > > > > Regards, > > > > Eric > > > > On Tue, Mar 10, 2020 at 12:28 PM Benoit Gschwind < > > gschwind at gnu-log.net> wrote: > > > Hello, > > > > > > I writing a python binding of one of our library. The binding > > > intend to > > > vectorize the function call. for exemple: > > > > > > double foo(double, double) will be bound to a python: > > > > > > module.foo(, ) > > > > > > and the function foo will be called like : > > > > > > for (int i = 0; i < size; ++i) > > > outarr[i] = foo(inarr0[i], inarr[1]); > > > > > > My question is about how can I handle type conversion of input > > > array, > > > preferably in efficient manner, given that each input array may > > > require > > > different input type. > > > > > > Currently I basically enforce the type and no type conversion is > > > performed. But I would like relax it. I thought of several > > > possibility > > > starting from the obvious solution consisting on recasting in the > > > inner > > > loop that would give: > > > > > > for (int i = 0; i < size; ++i) > > > if (inarr[i] need recast) > > > in0 = > > > recast(inarr[i]) > > > else > > > in0 = inarr[i] > > > [... same for all > > > inputs parameter ...] > > > outarr[i] = foo(in0, in1, ...); > > > > > > This solution is memory efficient, but not actually computationally > > > efficient. > > > > > > The second solution is to copy&recast the entire inputs arrays, but > > > in > > > that case it's not memory efficient. And my final thought is to mix > > > the > > > first and the second by chunking the second method, i.e. converting > > > N > > > input in a raw, then applying the function to them en so on until > > > all > > > the array is processed. > > > > > > Thus my questions are: > > > - there is another way to do what I want? > > > - there is an existing or recommended way to do it? > > > > > > And a side question, I use the PyArray_FROM_OTF, but I do not > > > understand well it's semantic. If I pass a python list, it is > > > converted > > > to the desired type and requirement; when I pass a non-continuous > > > array > > > it is converted to the continuous one; but when I pass a numpy > > > array of > > > another type than the one specified I do not get the conversion. > > > There > > > is a function that do the conversion unconditionally? Did I missed > > > something ? > > > > > > Thank you by advance for your help > > > > > > Best regards > > > > > > > > > > > > _______________________________________________ > > > NumPy-Discussion mailing list > > > NumPy-Discussion at python.org > > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Wed Mar 11 20:02:14 2020 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Wed, 11 Mar 2020 17:02:14 -0700 Subject: [Numpy-discussion] Proposal: NEP 41 -- First step towards a new Datatype System Message-ID: <3ee04012570e418a874e4c436b5b6f8ac5f56cdb.camel@sipsolutions.net> Hi all, I am pleased to propose NEP 41: First step towards a new Datatype System https://numpy.org/neps/nep-0041-improved-dtype-support.html This NEP motivates the larger restructure of the datatype machinery in NumPy and defines a few fundamental design aspects. The long term user impact will be allowing easier and more rich featured user defined datatypes. As this is a large restructure, the NEP represents only the first steps with some additional information in further NEPs being drafted [1] (this may be helpful to look at depending on the level of detail you are interested in). The NEP itself does not propose to add significant new public API. Instead it proposes to move forward with an incremental internal refactor and lays the foundation for this process. The main user facing change at this time is that datatypes will become classes (e.g. ``type(np.dtype("float64"))`` will be a float64 specific class. For most users, the main impact should be many new datatypes in the long run (see the user impact section). However, for those interested in API design within NumPy or with respect to implementing new datatypes, this and the following NEPs are important decisions in the future roadmap for NumPy. The current full text is reproduced below, although the above link is probably a better way to read it. Cheers Sebastian [1] NEP 40 gives some background information about the current systems and issues with it: https://github.com/numpy/numpy/blob/1248cf7a8765b7b53d883f9e7061173817533aac/doc/neps/nep-0040-legacy-datatype-impl.rst and NEP 42 being a first draft of how the new API may look like: https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38f53cee3/doc/neps/nep-0042-new-dtypes.rst (links to current rendered versions, check https://github.com/numpy/numpy/pull/15505 and https://github.com/numpy/numpy/pull/15507 for updates) ---------------------------------------------------------------------- ================================================= NEP 41 ? First step towards a new Datatype System ================================================= :title: Improved Datatype Support :Author: Sebastian Berg :Author: St?fan van der Walt :Author: Matti Picus :Status: Draft :Type: Standard Track :Created: 2020-02-03 .. note:: This NEP is part of a series of NEPs encompassing first information about the previous dtype implementation and issues with it in NEP 40. NEP 41 (this document) then provides an overview and generic design choices for the refactor. Further NEPs 42 and 43 go into the technical details of the datatype and universal function related internal and external API changes. In some cases it may be necessary to consult the other NEPs for a full picture of the desired changes and why these changes are necessary. Abstract -------- `Datatypes ` in NumPy describe how to interpret each element in arrays. NumPy provides ``int``, ``float``, and ``complex`` numerical types, as well as string, datetime, and structured datatype capabilities. The growing Python community, however, has need for more diverse datatypes. Examples are datatypes with unit information attached (such as meters) or categorical datatypes (fixed set of possible values). However, the current NumPy datatype API is too limited to allow the creation of these. This NEP is the first step to enable such growth; it will lead to a simpler development path for new datatypes. In the long run the new datatype system will also support the creation of datatypes directly from Python rather than C. Refactoring the datatype API will improve maintainability and facilitate development of both user-defined external datatypes, as well as new features for existing datatypes internal to NumPy. Motivation and Scope -------------------- .. seealso:: The user impact section includes examples of what kind of new datatypes will be enabled by the proposed changes in the long run. It may thus help to read these section out of order. Motivation ^^^^^^^^^^ One of the main issues with the current API is the definition of typical functions such as addition and multiplication for parametric datatypes (see also NEP 40) which require additional steps to determine the output type. For example when adding two strings of length 4, the result is a string of length 8, which is different from the input. Similarly, a datatype which embeds a physical unit must calculate the new unit information: dividing a distance by a time results in a speed. A related difficulty is that the :ref:`current casting rules <_ufuncs.casting>` -- the conversion between different datatypes -- cannot describe casting for such parametric datatypes implemented outside of NumPy. This additional functionality for supporting parametric datatypes introduces increased complexity within NumPy itself, and furthermore is not available to external user-defined datatypes. In general the concerns of different datatypes are not well well-encapsulated. This burden is exacerbated by the exposure of internal C structures, limiting the addition of new fields (for example to support new sorting methods [new_sort]_). Currently there are many factors which limit the creation of new user-defined datatypes: * Creating casting rules for parametric user-defined dtypes is either impossible or so complex that it has never been attempted. * Type promotion, e.g. the operation deciding that adding float and integer values should return a float value, is very valuable for numeric datatypes but is limited in scope for user-defined and especially parametric datatypes. * Much of the logic (e.g. promotion) is written in single functions instead of being split as methods on the datatype itself. * In the current design datatypes cannot have methods that do not generalize to other datatypes. For example a unit datatype cannot have a ``.to_si()`` method to easily find the datatype which would represent the same values in SI units. The large need to solve these issues has driven the scientific community to create work-arounds in multiple projects implementing physical units as an array-like class instead of a datatype, which would generalize better across multiple array-likes (Dask, pandas, etc.). Already, Pandas has made a push into the same direction with its extension arrays [pandas_extension_arrays]_ and undoubtedly the community would be best served if such new features could be common between NumPy, Pandas, and other projects. Scope ^^^^^ The proposed refactoring of the datatype system is a large undertaking and thus is proposed to be split into various phases, roughly: * Phase I: Restructure and extend the datatype infrastructure (This NEP 41) * Phase II: Incrementally define or rework API (Detailed largely in NEPs 42/43) * Phase III: Growth of NumPy and Scientific Python Ecosystem capabilities. For a more detailed accounting of the various phases, see "Plan to Approach the Full Refactor" in the Implementation section below. This NEP proposes to move ahead with the necessary creation of new dtype subclasses (Phase I), and start working on implementing current functionality. Within the context of this NEP all development will be fully private API or use preliminary underscored names which must be changed in the future. Most of the internal and public API choices are part of a second Phase and will be discussed in more detail in the following NEPs 42 and 43. The initial implementation of this NEP will have little or no effect on users, but provides the necessary ground work for incrementally addressing the full rework. The implementation of this NEP and the following, implied large rework of how datatypes are defined in NumPy is expected to create small incompatibilities (see backward compatibility section). However, a transition requiring large code adaption is not anticipated and not within scope. Specifically, this NEP makes the following design choices which are discussed in more details in the detailed description section: 1. Each datatype will be an instance of a subclass of ``np.dtype``, with most of the datatype-specific logic being implemented as special methods on the class. In the C-API, these correspond to specific slots. In short, for ``f = np.dtype("f8")``, ``isinstance(f, np.dtype)`` will remain true, but ``type(f)`` will be a subclass of ``np.dtype`` rather than just ``np.dtype`` itself. The ``PyArray_ArrFuncs`` which are currently stored as a pointer on the instance (as ``PyArray_Descr->f``), should instead be stored on the class as typically done in Python. In the future these may correspond to python side dunder methods. Storage information such as itemsize and byteorder can differ between different dtype instances (e.g. "S3" vs. "S8") and will remain part of the instance. This means that in the long run the current lowlevel access to dtype methods will be removed (see ``PyArray_ArrFuncs`` in NEP 40). 2. The current NumPy scalars will *not* change, they will not be instances of datatypes. This will also be true for new datatypes, scalars will not be instances of a dtype (although ``isinstance(scalar, dtype)`` may be made to return ``True`` when appropriate). Detailed technical decisions to follow in NEP 42. Further, the public API will be designed in a way that is extensible in the future: 3. All new C-API functions provided to the user will hide implementation details as much as possible. The public API should be an identical, but limited, version of the C-API used for the internal NumPy datatypes. The changes to the datatype system in Phase II must include a large refactor of the UFunc machinery, which will be further defined in NEP 43: 4. To enable all of the desired functionality for new user-defined datatypes, the UFunc machinery will be changed to replace the current dispatching and type resolution system. The old system should be *mostly* supported as a legacy version for some time. Additionally, as a general design principle, the addition of new user-defined datatypes will *not* change the behaviour of programs. For example ``common_dtype(a, b)`` must not be ``c`` unless ``a`` or ``b`` know that ``c`` exists. User Impact ----------- The current ecosystem has very few user-defined datatypes using NumPy, the two most prominent being: ``rational`` and ``quaternion``. These represent fairly simple datatypes which are not strongly impacted by the current limitations. However, we have identified a need for datatypes such as: * bfloat16, used in deep learning * categorical types * physical units (such as meters) * datatypes for tracing/automatic differentiation * high, fixed precision math * specialized integer types such as int2, int24 * new, better datetime representations * extending e.g. integer dtypes to have a sentinel NA value * geometrical objects [pygeos]_ Some of these are partially solved; for example unit capability is provided in ``astropy.units``, ``unyt``, or ``pint``, as `numpy.ndarray` subclasses. Most of these datatypes, however, simply cannot be reasonably defined right now. An advantage of having such datatypes in NumPy is that they should integrate seamlessly with other array or array-like packages such as Pandas, ``xarray`` [xarray_dtype_issue]_, or ``Dask``. The long term user impact of implementing this NEP will be to allow both the growth of the whole ecosystem by having such new datatypes, as well as consolidating implementation of such datatypes within NumPy to achieve better interoperability. Examples ^^^^^^^^ The following examples represent future user-defined datatypes we wish to enable. These datatypes are not part the NEP and choices (e.g. choice of casting rules) are possibilities we wish to enable and do not represent recommendations. Simple Numerical Types """""""""""""""""""""" Mainly used where memory is a consideration, lower-precision numeric types such as :ref:```bfloat16`` ` are common in other computational frameworks. For these types the definitions of things such as ``np.common_type`` and ``np.can_cast`` are some of the most important interfaces. Once they support ``np.common_type``, it is (for the most part) possible to find the correct ufunc loop to call, since most ufuncs -- such as add -- effectively only require ``np.result_type``:: >>> np.add(arr1, arr2).dtype == np.result_type(arr1, arr2) and `~numpy.result_type` is largely identical to `~numpy.common_type`. Fixed, high precision math """""""""""""""""""""""""" Allowing arbitrary precision or higher precision math is important in simulations. For instance ``mpmath`` defines a precision:: >>> import mpmath as mp >>> print(mp.dps) # the current (default) precision 15 NumPy should be able to construct a native, memory-efficient array from a list of ``mpmath.mpf`` floating point objects:: >>> arr_15_dps = np.array(mp.arange(3)) # (mp.arange returns a list) >>> print(arr_15_dps) # Must find the correct precision from the objects: array(['0.0', '1.0', '2.0'], dtype=mpf[dps=15]) We should also be able to specify the desired precision when creating the datatype for an array. Here, we use ``np.dtype[mp.mpf]`` to find the DType class (the notation is not part of this NEP), which is then instantiated with the desired parameter. This could also be written as ``MpfDType`` class:: >>> arr_100_dps = np.array([1, 2, 3], dtype=np.dtype[mp.mpf](dps=100)) >>> print(arr_15_dps + arr_100_dps) array(['0.0', '2.0', '4.0'], dtype=mpf[dps=100]) The ``mpf`` datatype can decide that the result of the operation should be the higher precision one of the two, so uses a precision of 100. Furthermore, we should be able to define casting, for example as in:: >>> np.can_cast(arr_15_dps.dtype, arr_100_dps.dtype, casting="safe") True >>> np.can_cast(arr_100_dps.dtype, arr_15_dps.dtype, casting="safe") False # loses precision >>> np.can_cast(arr_100_dps.dtype, arr_100_dps.dtype, casting="same_kind") True Casting from float is a probably always at least a ``same_kind`` cast, but in general, it is not safe:: >>> np.can_cast(np.float64, np.dtype[mp.mpf](dps=4), casting="safe") False since a float64 has a higer precision than the ``mpf`` datatype with ``dps=4``. Alternatively, we can say that:: >>> np.common_type(np.dtype[mp.mpf](dps=5), np.dtype[mp.mpf](dps=10)) np.dtype[mp.mpf](dps=10) And possibly even:: >>> np.common_type(np.dtype[mp.mpf](dps=5), np.float64) np.dtype[mp.mpf](dps=16) # equivalent precision to float64 (I believe) since ``np.float64`` can be cast to a ``np.dtype[mp.mpf](dps=16)`` safely. Categoricals """""""""""" Categoricals are interesting in that they can have fixed, predefined values, or can be dynamic with the ability to modify categories when necessary. The fixed categories (defined ahead of time) is the most straight forward categorical definition. Categoricals are *hard*, since there are many strategies to implement them, suggesting NumPy should only provide the scaffolding for user-defined categorical types. For instance:: >>> cat = Categorical(["eggs", "spam", "toast"]) >>> breakfast = array(["eggs", "spam", "eggs", "toast"], dtype=cat) could store the array very efficiently, since it knows that there are only 3 categories. Since a categorical in this sense knows almost nothing about the data stored in it, few operations makes, sense, although equality does: >>> breakfast2 = array(["eggs", "eggs", "eggs", "eggs"], dtype=cat) >>> breakfast == breakfast2 array[True, False, True, False]) The categorical datatype could work like a dictionary: no two items names can be equal (checked on dtype creation), so that the equality operation above can be performed very efficiently. If the values define an order, the category labels (internally integers) could be ordered the same way to allow efficient sorting and comparison. Whether or not casting is defined from one categorical with less to one with strictly more values defined, is something that the Categorical datatype would need to decide. Both options should be available. Unit on the Datatype """""""""""""""""""" There are different ways to define Units, depending on how the internal machinery would be organized, one way is to have a single Unit datatype for every existing numerical type. This will be written as ``Unit[float64]``, the unit itself is part of the DType instance ``Unit[float64]("m")`` is a ``float64`` with meters attached:: >>> from astropy import units >>> meters = np.array([1, 2, 3], dtype=np.float64) * units.m # meters >>> print(meters) array([1.0, 2.0, 3.0], dtype=Unit[float64]("m")) Note that units are a bit tricky. It is debatable, whether:: >>> np.array([1.0, 2.0, 3.0], dtype=Unit[float64]("m")) should be valid syntax (coercing the float scalars without a unit to meters). Once the array is created, math will work without any issue:: >>> meters / (2 * unit.seconds) array([0.5, 1.0, 1.5], dtype=Unit[float64]("m/s")) Casting is not valid from one unit to the other, but can be valid between different scales of the same dimensionality (although this may be "unsafe"):: >>> meters.astype(Unit[float64]("s")) TypeError: Cannot cast meters to seconds. >>> meters.astype(Unit[float64]("km")) >>> # Convert to centimeter-gram-second (cgs) units: >>> meters.astype(meters.dtype.to_cgs()) The above notation is somewhat clumsy. Functions could be used instead to convert between units. There may be ways to make these more convenient, but those must be left for future discussions:: >>> units.convert(meters, "km") >>> units.to_cgs(meters) There are some open questions. For example, whether additional methods on the array object could exist to simplify some of the notions, and how these would percolate from the datatype to the ``ndarray``. The interaction with other scalars would likely be defined through:: >>> np.common_type(np.float64, Unit) Unit[np.float64](dimensionless) Ufunc output datatype determination can be more involved than for simple numerical dtypes since there is no "universal" output type:: >>> np.multiply(meters, seconds).dtype != np.result_type(meters, seconds) In fact ``np.result_type(meters, seconds)`` must error without context of the operation being done. This example highlights how the specific ufunc loop (loop with known, specific DTypes as inputs), has to be able to to make certain decisions before the actual calculation can start. Implementation -------------- Plan to Approach the Full Refactor ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To address these issues in NumPy and enable new datatypes, multiple development stages are required: * Phase I: Restructure and extend the datatype infrastructure (This NEP) * Organize Datatypes like normal Python classes [`PR 15508`]_ * Phase II: Incrementally define or rework API * Create a new and easily extensible API for defining new datatypes and related functionality. (NEP 42) * Incrementally define all necessary functionality through the new API (NEP 42): * Defining operations such as ``np.common_type``. * Allowing to define casting between datatypes. * Add functionality necessary to create a numpy array from Python scalars (i.e. ``np.array(...)``). * ? * Restructure how universal functions work (NEP 43), in order to: * make it possible to allow a `~numpy.ufunc` such as ``np.add`` to be extended by user-defined datatypes such as Units. * allow efficient lookup for the correct implementation for user-defined datatypes. * enable reuse of existing code. Units should be able to use the normal math loops and add additional logic to determine output type. * Phase III: Growth of NumPy and Scientific Python Ecosystem capabilities: * Cleanup of legacy behaviour where it is considered buggy or undesirable. * Provide a path to define new datatypes from Python. * Assist the community in creating types such as Units or Categoricals * Allow strings to be used in functions such as ``np.equal`` or ``np.add``. * Remove legacy code paths within NumPy to improve long term maintainability This document serves as a basis for phase I and provides the vision and motivation for the full project. Phase I does not introduce any new user-facing features, but is concerned with the necessary conceptual cleanup of the current datatype system. It provides a more "pythonic" datatype Python type object, with a clear class hierarchy. The second phase is the incremental creation of all APIs necessary to define fully featured datatypes and reorganization of the NumPy datatype system. This phase will thus be primarily concerned with defining an, initially preliminary, stable public API. Some of the benefits of a large refactor may only become evident after the full deprecation of the current legacy implementation (i.e. larger code removals). However, these steps are necessary for improvements to many parts of the core NumPy API, and are expected to make the implementation generally easier to understand. The following figure illustrates the proposed design at a high level, and roughly delineates the components of the overall design. Note that this NEP only regards Phase I (shaded area), the rest encompasses Phase II and the design choices are up for discussion, however, it highlights that the DType datatype class is the central, necessary concept: .. image:: _static/nep-0041-mindmap.svg First steps directly related to this NEP ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The required changes necessary to NumPy are large and touch many areas of the code base but many of these changes can be addressed incrementally. To enable an incremental approach we will start by creating a C defined ``PyArray_DTypeMeta`` class with its instances being the ``DType`` classes, subclasses of ``np.dtype``. This is necessary to add the ability of storing custom slots on the DType in C. This ``DTypeMeta`` will be implemented first to then enable incremental restructuring of current code. The addition of ``DType`` will then enable addressing other changes incrementally, some of which may begin before the settling the full internal API: 1. New machinery for array coercion, with the goal of enabling user DTypes with appropriate class methods. 2. The replacement or wrapping of the current casting machinery. 3. Incremental redefinition of the current ``PyArray_ArrFuncs`` slots into DType method slots. At this point, no or only very limited new public API will be added and the internal API is considered to be in flux. Any new public API may be set up give warnings and will have leading underscores to indicate that it is not finalized and can be changed without warning. Backward compatibility ---------------------- While the actual backward compatibility impact of implementing Phase I and II are not yet fully clear, we anticipate, and accept the following changes: * **Python API**: * ``type(np.dtype("f8"))`` will be a subclass of ``np.dtype``, while right now ``type(np.dtype("f8")) is np.dtype``. Code should use ``isinstance`` checks, and in very rare cases may have to be adapted to use it. * **C-API**: * In old versions of NumPy ``PyArray_DescrCheck`` is a macro which uses ``type(dtype) is np.dtype``. When compiling against an old NumPy version, the macro may have to be replaced with the corresponding ``PyObject_IsInstance`` call. (If this is a problem, we could backport fixing the macro) * The UFunc machinery changes will break *limited* parts of the current implementation. Replacing e.g. the default ``TypeResolver`` is expected to remain supported for a time, although optimized masked inner loop iteration (which is not even used *within* NumPy) will no longer be supported. * All functions currently defined on the dtypes, such as ``PyArray_Descr->f->nonzero``, will be defined and accessed differently. This means that in the long run lowlevel access code will have to be changed to use the new API. Such changes are expected to be necessary in very few project. * **dtype implementors (C-API)**: * The array which is currently provided to some functions (such as cast functions), will no longer be provided. For example ``PyArray_Descr->f->nonzero`` or ``PyArray_Descr->f->copyswapn``, may instead receive a dummy array object with only some fields (mainly the dtype), being valid. At least in some code paths, a similar mechanism is already used. * The ``scalarkind`` slot and registration of scalar casting will be removed/ignored without replacement. It currently allows partial value-based casting. The ``PyArray_ScalarKind`` function will continue to work for builtin types, but will not be used internally and be deprecated. * Currently user dtypes are defined as instances of ``np.dtype``. The creation works by the user providing a prototype instance. NumPy will need to modify at least the type during registration. This has no effect for either ``rational`` or ``quaternion`` and mutation of the structure seems unlikely after registration. Since there is a fairly large API surface concerning datatypes, further changes or the limitation certain function to currently existing datatypes is likely to occur. For example functions which use the type number as input should be replaced with functions taking DType classes instead. Although public, large parts of this C-API seem to be used rarely, possibly never, by downstream projects. Detailed Description -------------------- This section details the design decisions covered by this NEP. The subsections correspond to the list of design choices presented in the Scope section. Datatypes as Python Classes (1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The current NumPy datatypes are not full scale python classes. They are instead (prototype) instances of a single ``np.dtype`` class. Changing this means that any special handling, e.g. for ``datetime`` can be moved to the Datetime DType class instead, away from monolithic general code (e.g. current ``PyArray_AdjustFlexibleDType``). The main consequence of this change with respect to the API is that special methods move from the dtype instances to methods on the new DType class. This is the typical design pattern used in Python. Organizing these methods and information in a more Pythonic way provides a solid foundation for refining and extending the API in the future. The current API cannot be extended due to how it is exposed publically. This means for example that the methods currently stored in ``PyArray_ArrFuncs`` on each datatype (see NEP 40) will be defined differently in the future and deprecated in the long run. The most prominent visible side effect of this will be that ``type(np.dtype(np.float64))`` will not be ``np.dtype`` anymore. Instead it will be a subclass of ``np.dtype`` meaning that ``isinstance(np.dtype(np.float64), np.dtype)`` will remain true. This will also add the ability to use ``isinstance(dtype, np.dtype[float64])`` thus removing the need to use ``dtype.kind``, ``dtype.char``, or ``dtype.type`` to do this check. With the design decision of DTypes as full-scale Python classes, the question of subclassing arises. Inheritance, however, appears problematic and a complexity best avoided (at least initially) for container datatypes. Further, subclasses may be more interesting for interoperability for example with GPU backends (CuPy) storing additional methods related to the GPU rather than as a mechanism to define new datatypes. A class hierarchy does provides value, this may be achieved by allowing the creation of *abstract* datatypes. An example for an abstract datatype would be the datatype equivalent of ``np.floating``, representing any floating point number. These can serve the same purpose as Python's abstract base classes. Scalars should not be instances of the datatypes (2) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For simple datatypes such as ``float64`` (see also below), it seems tempting that the instance of a ``np.dtype("float64")`` can be the scalar. This idea may be even more appealing due to the fact that scalars, rather than datatypes, currently define a useful type hierarchy. However, we have specifically decided against this for a number of reasons. First, the new datatypes described herein would be instances of DType classes. Making these instances themselves classes, while possible, adds additional complexity that users need to understand. It would also mean that scalars must have storage information (such as byteorder) which is generally unnecessary and currently is not used. Second, while the simple NumPy scalars such as ``float64`` may be such instances, it should be possible to create datatypes for Python objects without enforcing NumPy as a dependency. However, Python objects that do not depend on NumPy cannot be instances of a NumPy DType. Third, there is a mismatch between the methods and attributes which are useful for scalars and datatypes. For instance ``to_float()`` makes sense for a scalar but not for a datatype and ``newbyteorder`` is not useful on a scalar (or has a different meaning). Overall, it seem rather than reducing the complexity, i.e. by merging the two distinct type hierarchies, making scalars instances of DTypes would increase the complexity of both the design and implementation. A possible future path may be to instead simplify the current NumPy scalars to be much simpler objects which largely derive their behaviour from the datatypes. C-API for creating new Datatypes (3) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The current C-API with which users can create new datatypes is limited in scope, and requires use of "private" structures. This means the API is not extensible: no new members can be added to the structure without losing binary compatibility. This has already limited the inclusion of new sorting methods into NumPy [new_sort]_. The new version shall thus replace the current ``PyArray_ArrFuncs`` structure used to define new datatypes. Datatypes that currently exist and are defined using these slots will be supported during a deprecation period. The most likely solution is to hide the implementation from the user and thus make it extensible in the future is to model the API after Python's stable API [PEP-384]_: .. code-block:: C static struct PyArrayMethodDef slots[] = { {NPY_dt_method, method_implementation}, ..., {0, NULL} } typedef struct{ PyTypeObject *typeobj; /* type of python scalar */ ...; PyType_Slot *slots; } PyArrayDTypeMeta_Spec; PyObject* PyArray_InitDTypeMetaFromSpec( PyArray_DTypeMeta *user_dtype, PyArrayDTypeMeta_Spec *dtype_spec); The C-side slots should be designed to mirror Python side methods such as ``dtype.__dtype_method__``, although the exposure to Python is a later step in the implementation to reduce the complexity of the initial implementation. C-API Changes to the UFunc Machinery (4) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Proposed changes to the UFunc machinery will be part of NEP 43. However, the following changes will be necessary (see NEP 40 for a detailed description of the current implementation and its issues): * The current UFunc type resolution must be adapted to allow better control for user-defined dtypes as well as resolve current inconsistencies. * The inner-loop used in UFuncs must be expanded to include a return value. Further, error reporting must be improved, and passing in dtype-specific information enabled. This requires the modification of the inner-loop function signature and addition of new hooks called before and after the inner-loop is used. An important goal for any changes to the universal functions will be to allow the reuse of existing loops. It should be easy for a new units datatype to fall back to existing math functions after handling the unit related computations. Discussion ---------- See NEP 40 for a list of previous meetings and discussions. References ---------- .. [pandas_extension_arrays] https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types .. _xarray_dtype_issue: https://github.com/pydata/xarray/issues/1262 .. [pygeos] https://github.com/caspervdw/pygeos .. [new_sort] https://github.com/numpy/numpy/pull/12945 .. [PEP-384] https://www.python.org/dev/peps/pep-0384/ .. [PR 15508] https://github.com/numpy/numpy/pull/15508 Copyright --------- This document has been placed in the public domain. Acknowledgments --------------- The effort to create new datatypes for NumPy has been discussed for several years in many different contexts and settings, making it impossible to list everyone involved. We would like to thank especially Stephan Hoyer, Nathaniel Smith, and Eric Wieser for repeated in-depth discussion about datatype design. We are very grateful for the community input in reviewing and revising this NEP and would like to thank especially Ross Barnowski and Ralf Gommers. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From cournape at gmail.com Wed Mar 11 22:42:28 2020 From: cournape at gmail.com (David Cournapeau) Date: Thu, 12 Mar 2020 11:42:28 +0900 Subject: [Numpy-discussion] Opportunities for a NumPy sprint in Japan Message-ID: Hi there, My current employer is organizing regular "hack weeks" where engineers can work on the projects of their choice. My team is using a lot of pydata tools, including numpy, so contributing to NumPy would be natural. Before I decide to add NumPy as a project, I want to make sure it does not create more burden for the maintainers, the goal is to help the project, and for my team's members to help them contributing to OSS in "proper way". I would have time to prepare the hack week, making sure we prepare good PRs, and even use this time to actually review existing PRs myself. While I have been away from the project for a long time, I should still be able to review code, help my team building / testing / debugging, etc. Does that sound useful ? The first hack week would be held from 23th to 27th March, in Tokyo. Everybody would be remote because of the coronavirus, so except for time difference, if it makes sense, people outside of my employer could join. regards, David -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Thu Mar 12 00:16:07 2020 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 11 Mar 2020 22:16:07 -0600 Subject: [Numpy-discussion] Opportunities for a NumPy sprint in Japan In-Reply-To: References: Message-ID: Hi David, On Wed, Mar 11, 2020 at 8:43 PM David Cournapeau wrote: > Hi there, > > My current employer is organizing regular "hack weeks" where engineers can > work on the projects of their choice. My team is using a lot of pydata > tools, including numpy, so contributing to NumPy would be natural. Before I > decide to add NumPy as a project, I want to make sure it does not create > more burden for the maintainers, the goal is to help the project, and for > my team's members to help them contributing to OSS in "proper way". > > I would have time to prepare the hack week, making sure we prepare good > PRs, and even use this time to actually review existing PRs myself. While I > have been away from the project for a long time, I should still be able to > review code, help my team building / testing / debugging, etc. > > Does that sound useful ? > > The first hack week would be held from 23th to 27th March, in Tokyo. > Everybody would be remote because of the coronavirus, so except for time > difference, if it makes sense, people outside of my employer could join. > > Good to hear from you. We always appreciate help, especially if it comes with some experience :) It would also be helpful if you reviewed the NEPs dealing with dtypes. We don't get as much feedback from users as we would like and if you are using a lot of the open source tools your perspective could be very helpful. Keep us informed, and maybe toss around some ideas for discussion. We have community meetings on zoom every Wednesday at 12:00 noon Pacific time, so you may want to drop in if the time works for you. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From cournape at gmail.com Thu Mar 12 02:14:21 2020 From: cournape at gmail.com (David Cournapeau) Date: Thu, 12 Mar 2020 15:14:21 +0900 Subject: [Numpy-discussion] Opportunities for a NumPy sprint in Japan In-Reply-To: References:

Message-ID: Hi chuck, Le jeu. 12 mars 2020 ? 13:17, Charles R Harris a ?crit : > > Hi David, > > On Wed, Mar 11, 2020 at 8:43 PM David Cournapeau > wrote: > >> Hi there, >> >> My current employer is organizing regular "hack weeks" where engineers >> can work on the projects of their choice. My team is using a lot of pydata >> tools, including numpy, so contributing to NumPy would be natural. Before I >> decide to add NumPy as a project, I want to make sure it does not create >> more burden for the maintainers, the goal is to help the project, and for >> my team's members to help them contributing to OSS in "proper way". >> >> I would have time to prepare the hack week, making sure we prepare good >> PRs, and even use this time to actually review existing PRs myself. While I >> have been away from the project for a long time, I should still be able to >> review code, help my team building / testing / debugging, etc. >> >> Does that sound useful ? >> >> The first hack week would be held from 23th to 27th March, in Tokyo. >> Everybody would be remote because of the coronavirus, so except for time >> difference, if it makes sense, people outside of my employer could join. >> >> > Good to hear from you. We always appreciate help, especially if it comes > with some experience :) It would also be helpful if you reviewed the NEPs > dealing with dtypes. We don't get as much feedback from users as we would > like and if you are using a lot of the open source tools your perspective > could be very helpful. Keep us informed, and maybe toss around some ideas > for discussion. We have community meetings on zoom every Wednesday at 12:00 > noon Pacific time, so you may want to drop in if the time works for you. > So to clarify, I am expecting to have people from my team making most of the contributions. In this case, since people are not so familiar w/ the internal of NumPy, I was thinking more about bugs / docs / etc. than the hard stuff like dtype. I will try to join the meeting next week though (where are the info to join ?), David > > Chuck > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From matti.picus at gmail.com Thu Mar 12 02:40:57 2020 From: matti.picus at gmail.com (Matti Picus) Date: Thu, 12 Mar 2020 08:40:57 +0200 Subject: [Numpy-discussion] Opportunities for a NumPy sprint in Japan In-Reply-To: References:

Message-ID: <0fd8356c-570c-f65e-0a6f-31a71b30f21a@gmail.com> On 12/3/20 8:14 am, David Cournapeau wrote: > > I will try to join the meeting next week though (where are the info to > join ?), > > David > > The next one will be a community meeting on Wed Mar 18, 11am California time on this zoom channel https://berkeley.zoom.us/j/762261535. We will be updating this hackmd document https://hackmd.io/76o-IxCjQX2mOXO_wwkcpg with topics for discussion, I will add you to the agenda. If you miss that one, we hold triage meetings on alternate weeks at the same time on the same zoom channel, the document for that is https://hackmd.io/68i_JvOYQfy9ERiHgXMPvg. Matti From ralf.gommers at gmail.com Thu Mar 12 05:35:43 2020 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Thu, 12 Mar 2020 10:35:43 +0100 Subject: [Numpy-discussion] Opportunities for a NumPy sprint in Japan In-Reply-To: References:

Message-ID: On Thu, Mar 12, 2020 at 7:14 AM David Cournapeau wrote: > Hi chuck, > > > > Le jeu. 12 mars 2020 ? 13:17, Charles R Harris > a ?crit : > >> >> Hi David, >> >> On Wed, Mar 11, 2020 at 8:43 PM David Cournapeau >> wrote: >> >>> Hi there, >>> >>> My current employer is organizing regular "hack weeks" where engineers >>> can work on the projects of their choice. My team is using a lot of pydata >>> tools, including numpy, so contributing to NumPy would be natural. Before I >>> decide to add NumPy as a project, I want to make sure it does not create >>> more burden for the maintainers, the goal is to help the project, and for >>> my team's members to help them contributing to OSS in "proper way". >>> >>> I would have time to prepare the hack week, making sure we prepare good >>> PRs, and even use this time to actually review existing PRs myself. While I >>> have been away from the project for a long time, I should still be able to >>> review code, help my team building / testing / debugging, etc. >>> >>> Does that sound useful ? >>> >>> The first hack week would be held from 23th to 27th March, in Tokyo. >>> Everybody would be remote because of the coronavirus, so except for time >>> difference, if it makes sense, people outside of my employer could join. >>> >>> >> Good to hear from you. We always appreciate help, especially if it comes >> with some experience :) It would also be helpful if you reviewed the NEPs >> dealing with dtypes. We don't get as much feedback from users as we would >> like and if you are using a lot of the open source tools your perspective >> could be very helpful. Keep us informed, and maybe toss around some ideas >> for discussion. We have community meetings on zoom every Wednesday at 12:00 >> noon Pacific time, so you may want to drop in if the time works for you. >> > > So to clarify, I am expecting to have people from my team making most of > the contributions. In this case, since people are not so familiar w/ the > internal of NumPy, I was thinking more about bugs / docs / etc. than the > hard stuff like dtype. > Hey David, great to hear from you! This sounds great. I assume most people are experienced NumPy users, and/or experienced sw devs? So they could write longer docs, or tackle bugs that require some digging? Or are you looking for good-first-issue type things mainly? Cheers, Ralf > I will try to join the meeting next week though (where are the info to > join ?), > > David > > >> >> Chuck >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From melissawm at gmail.com Thu Mar 12 15:21:12 2020 From: melissawm at gmail.com (=?UTF-8?Q?Melissa_Mendon=C3=A7a?=) Date: Thu, 12 Mar 2020 16:21:12 -0300 Subject: [Numpy-discussion] Documentation Team Meetings - all are welcome! Message-ID: Hello all, As part of NumPy's recent focus on documentation writing and improvement (see NEP 44), we are proposing a regular meeting for the Documentation Team to discuss possible content, ideas and general planning. A small group had been participating in these meetings (which started as part of the Google Season of Docs project) but the idea is to recruit more people who might be interested in documentation writing and the creation of educational materials using NumPy. For now, the meetings will take place every three weeks, on mondays 3PM UTC. If you wish to join on Zoom, you need to use this link https://zoom.us/j/420005230 Here's the permanent hackmd document with the meeting notes: https://hackmd.io/oB_boakvRqKR-_2jRV-Qjg Hope to see you around! - Melissa -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at gmail.com Sun Mar 15 16:24:40 2020 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Sun, 15 Mar 2020 21:24:40 +0100 Subject: [Numpy-discussion] Documentation Team Meetings - all are welcome! In-Reply-To: References: Message-ID: On Thu, Mar 12, 2020 at 8:21 PM Melissa Mendon?a wrote: > Hello all, > > As part of NumPy's recent focus on documentation writing and improvement > (see NEP 44), we are proposing a regular meeting for the Documentation Team > to discuss possible content, ideas and general planning. A small group had > been participating in these meetings (which started as part of the Google > Season of Docs project) but the idea is to recruit more people who might be > interested in documentation writing and the creation of educational > materials using NumPy. > Thanks Melissa! A couple of things to add: 1. A plug for Melissa's nice blog post "Documentation as a way to build Community": https://labs.quansight.org/blog/2020/03/documentation-as-a-way-to-build-community/ 2. There are some bigger picture topics on the agenda, such as the docs refactor, and starting a new Jupyter notebooks tutorial - see https://hackmd.io/oB_boakvRqKR-_2jRV-Qjg?edit. 3. Please feel free to add more topics. Please join tomorrow if you're interested in higher quality educational materials! > For now, the meetings will take place every three weeks, on mondays 3PM > UTC. If you wish to join on Zoom, you need to use this link > This is 8am Pacific Time, 11am Eastern Time, 4pm Europe, 8pm India, 11pm China - maximizing for regions of the world who can join (sorry Aus-Pacific, let us know if you'd join some other time). Cheers, Ralf > https://zoom.us/j/420005230 > > Here's the permanent hackmd document with the meeting notes: > > https://hackmd.io/oB_boakvRqKR-_2jRV-Qjg > > > Hope to see you around! > > - Melissa > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cournape at gmail.com Tue Mar 17 07:46:05 2020 From: cournape at gmail.com (David Cournapeau) Date: Tue, 17 Mar 2020 20:46:05 +0900 Subject: [Numpy-discussion] Opportunities for a NumPy sprint in Japan In-Reply-To: <0fd8356c-570c-f65e-0a6f-31a71b30f21a@gmail.com> References:

<0fd8356c-570c-f65e-0a6f-31a71b30f21a@gmail.com> Message-ID: On Thu, Mar 12, 2020 at 3:41 PM Matti Picus wrote: > > On 12/3/20 8:14 am, David Cournapeau wrote: > > > > I will try to join the meeting next week though (where are the info to > > join ?), > > > > David > > > > > > The next one will be a community meeting on Wed Mar 18, 11am California > time on this zoom channel https://berkeley.zoom.us/j/762261535. Unfortunately, 11 a.m. pacific time is particularly difficult to make it in Japan (3 a.m. here). Would it be possible to instead prepare questions beforehand, and manage those asynchronously ? I am looking for the following: 1. Issues that are good for first timers. Pretty much all of my team has experience in python, and knows the core numpy, and has easy access to both mac and linux. Not many people have experience with C or Cython. 2. Doc and issues that require digging: this is good too, my team needs to learn this. In that context, I can try to do some triaging myself before the 18th march meeting, and get confirmation about the list ? I can myself review some PRs as well, but as this is review time in the quarter and I have a large team, I may not be able to do deep PRs. David > We will > be updating this hackmd document > https://hackmd.io/76o-IxCjQX2mOXO_wwkcpg with topics for discussion, I > will add you to the agenda. If you miss that one, we hold triage > meetings on alternate weeks at the same time on the same zoom channel, > the document for that is https://hackmd.io/68i_JvOYQfy9ERiHgXMPvg. > > Matti > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at gmail.com Tue Mar 17 09:15:31 2020 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Tue, 17 Mar 2020 14:15:31 +0100 Subject: [Numpy-discussion] Opportunities for a NumPy sprint in Japan In-Reply-To: References:

<0fd8356c-570c-f65e-0a6f-31a71b30f21a@gmail.com> Message-ID: On Tue, Mar 17, 2020 at 12:46 PM David Cournapeau wrote: > > > On Thu, Mar 12, 2020 at 3:41 PM Matti Picus wrote: > >> >> On 12/3/20 8:14 am, David Cournapeau wrote: >> > >> > I will try to join the meeting next week though (where are the info to >> > join ?), >> > >> > David >> > >> > >> >> The next one will be a community meeting on Wed Mar 18, 11am California >> time on this zoom channel https://berkeley.zoom.us/j/762261535. > > > Unfortunately, 11 a.m. pacific time is particularly difficult to make it > in Japan (3 a.m. here). > > Would it be possible to instead prepare questions beforehand, and manage > those asynchronously ? > Definitely. If you want to make a temporary dedicated label for this sprint, please feel free to do that - it's easy to merge back into good-first-issue later (the older good-first-issue ones tend to be a bit polluted). > I am looking for the following: > > 1. Issues that are good for first timers. Pretty much all of my team has > experience in python, and knows the core numpy, and has easy access to both > mac and linux. Not many people have experience with C or Cython. > There's few clear cut actionable Python-only issues in core I think - looking at the issues for various modules, I especially recommend `numpy.ma` issues, quite a bit of low hanging fruit there. 2. Doc and issues that require digging: this is good too, my team needs to > learn this. > > In that context, I can try to do some triaging myself before the 18th > march meeting, and get confirmation about the list ? > Sounds good. Cheers, Ralf > I can myself review some PRs as well, but as this is review time in the > quarter and I have a large team, I may not be able to do deep PRs. > > David > > >> We will >> be updating this hackmd document >> https://hackmd.io/76o-IxCjQX2mOXO_wwkcpg with topics for discussion, I >> will add you to the agenda. If you miss that one, we hold triage >> meetings on alternate weeks at the same time on the same zoom channel, >> the document for that is https://hackmd.io/68i_JvOYQfy9ERiHgXMPvg. >> >> Matti >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From madicken.munk at gmail.com Tue Mar 17 11:46:30 2020 From: madicken.munk at gmail.com (Madicken Munk) Date: Tue, 17 Mar 2020 10:46:30 -0500 Subject: [Numpy-discussion] Announcing the 2020 John Hunter Excellence in Plotting Contest Message-ID: Dear all, In memory of John Hunter, we are pleased to announce the John Hunter Excellence in Plotting Contest for 2020. This open competition aims to highlight the importance of data visualization to scientific progress and showcase the capabilities of open source software. Participants are invited to submit scientific plots to be judged by a panel. The winning entries will be announced and displayed at SciPy 2020 or announced in the John Hunter Excellence in Plotting Contest website and youtube channel. John Hunter?s family are graciously sponsoring cash prizes for the winners in the following amounts: - 1st prize: $1000 - 2nd prize: $750 - 3rd prize: $500 - Entries must be submitted by June 1st to the form at https://forms.gle/SrexmkDwiAmDc7ej7 - Winners will be announced at Scipy 2020 in Austin, TX or publicly on the John Hunter Excellence in Plotting Contest website and youtube channel - Participants do not need to attend the Scipy conference. - Entries may take the definition of ?visualization? rather broadly. Entries may be, for example, a traditional printed plot, an interactive visualization for the web, a dashboard, or an animation. - Source code for the plot must be provided, in the form of Python code and/or a Jupyter notebook, along with a rendering of the plot in a widely used format. The rendering may be, for example, PDF for print, standalone HTML and Javascript for an interactive plot, or MPEG-4 for a video. If the original data can not be shared for reasons of size or licensing, "fake" data may be substituted, along with an image of the plot using real data. - Each entry must include a 300-500 word abstract describing the plot and its importance for a general scientific audience. - Entries will be judged on their clarity, innovation and aesthetics, but most importantly for their effectiveness in communicating a real-world problem. Entrants are encouraged to submit plots that were used during the course of research or work, rather than merely being hypothetical. - SciPy and the John Hunter Excellence in Plotting Contest organizers reserves the right to display any and all entries, whether prize-winning or not, at the conference, use in any materials or on its website, with attribution to the original author(s). - Past entries can be found at https://jhepc.github.io/ - Questions regarding the contest can be sent to jhepc.organizers at gmail.com John Hunter Excellence in Plotting Contest Co-Chairs Madicken Munk Nelle Varoquaux -------------- next part -------------- An HTML attachment was scrubbed... URL: From fiorellazampetti at gmail.com Tue Mar 17 12:20:51 2020 From: fiorellazampetti at gmail.com (Fiorella Zampetti) Date: Tue, 17 Mar 2020 17:20:51 +0100 Subject: [Numpy-discussion] Study on annotating implementation and design choices, and of technical debt Message-ID: Dear all, As software engineering research teams at the University of Sannio (Italy) and Eindhoven University of Technology (The Netherlands) we are interested in investigating the protocol used by developers while they have to annotate implementation and design choices during their normal development activities. More specifically, we are looking at whether, where and what kind of annotations developers usually use trying to be focused more on those annotations mainly aimed at highlighting that the code is not in the right shape (e.g., comments for annotating delayed or intended work activities such as TODO, FIXME, hack, workaround, etc). In the latter case, we are looking at what is the content of the above annotations, as well as how they usually behave while evolving the code that has been previously annotated. When answering the survey, in case your annotation practices are different in different open source projects you may contribute, please refer to how you behave for the projects where you have been contacted. Filling out the survey will take about 5 minutes. Please note that your identity and personal data will not be disclosed, while we plan to use the aggregated results and anonymized responses as part of a scientific publication. If you have any questions about the questionnaire or our research, please do not hesitate to contact us. You can find the survey link here: https://forms.gle/NQULdWRVvXYeMc1r6 Thanks and regards, Fiorella Zampetti (fzampetti at unisannio.it) Gianmarco Fucci (gianmarcofucci94 at gmail.com) Alexander Serebrenik (a.serebrenik at tue.nl) Massimiliano Di Penta (dipenta at unisannio.it) From charlesr.harris at gmail.com Tue Mar 17 13:32:35 2020 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 17 Mar 2020 11:32:35 -0600 Subject: [Numpy-discussion] NumPy 1.18.2 released. Message-ID: Hi All, On behalf of the NumPy team I am pleased to announce that NumPy 1.18.2 has been released.This small release contains a fix for a performance regression in numpy/random and several bug/maintenance updates. The Python versions supported in this release are 3.5-3.8. Downstream developers should use Cython >= 0.29.15 for Python 3.8 support and OpenBLAS >= 3.7 to avoid errors on the Skylake architecture. Wheels for this release can be downloaded from PyPI , source archives and release notes are available from Github . *Contributors* A total of 5 people contributed to this release. People with a "+" by their names contributed a patch for the first time. - Charles Harris - Ganesh Kathiresan + - Matti Picus - Sebastian Berg - przemb + *Pull requests merged* A total of 7 pull requests were merged for this release. - https://github.com/numpy/numpy/pull/15675: TST: move _no_tracing to testing._private - https://github.com/numpy/numpy/pull/15676: MAINT: Large overhead in some random functions - https://github.com/numpy/numpy/pull/15677: TST: Do not create gfortran link in azure Mac testing. - https://github.com/numpy/numpy/pull/15679: BUG: Added missing error check in ndarray.__contains__ - https://github.com/numpy/numpy/pull/15722: MAINT: use list-based APIs to call subprocesses - https://github.com/numpy/numpy/pull/15729: REL: Prepare for 1.18.2 release. - https://github.com/numpy/numpy/pull/15734: BUG: fix logic error when nm fails on 32-bit Cheers, Charles Harris -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Tue Mar 17 13:36:07 2020 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Tue, 17 Mar 2020 12:36:07 -0500 Subject: [Numpy-discussion] NumPy Community Meeting Wednesday, March 18 Message-ID: <2ff5df1c57e2bafaac09a7fd6eb67ee05ce08dc0.camel@sipsolutions.net> Hi all, There will be a NumPy Community meeting Wednesday March 18 at 11 am Pacific Time. Everyone is invited and encouraged to join in and edit the work-in-progress meeting topics and notes: https://hackmd.io/76o-IxCjQX2mOXO_wwkcpg?both Best wishes Sebastian PS: The US had its daylight saving time switch already, so for some countries there is one hour difference in offset this week. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From sebastian at sipsolutions.net Tue Mar 17 16:02:43 2020 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Tue, 17 Mar 2020 15:02:43 -0500 Subject: [Numpy-discussion] Proposal: NEP 41 -- First step towards a new Datatype System In-Reply-To: <3ee04012570e418a874e4c436b5b6f8ac5f56cdb.camel@sipsolutions.net> References: <3ee04012570e418a874e4c436b5b6f8ac5f56cdb.camel@sipsolutions.net> Message-ID: <909ccfa50c89620b6440585e7dac4d2c521d62fc.camel@sipsolutions.net> Hi all, in the spirit of trying to keep this moving, can I assume that the main reason for little discussion is that the actual changes proposed are not very far reaching as of now? Or is the reason that this is a fairly complex topic that you need more time to think about it? If it is the latter, is there some way I can help with it? I tried to minimize how much is part of this initial NEP. If there is not much need for discussion, I would like to officially accept the NEP very soon, sending out an official one week notice in the next days. To summarize one more time, the main point is that: type(np.dtype(np.float64)) will be `np.dtype[float64]`, a subclass of dtype, so that: issubclass(np.dtype[float64], np.dtype) is true. This means that we will have one class for every current type number: `dtype.num`. The implementation of these subclasses will be a C-written (extension) MetaClass, all details of this class are supposed to remain experimental in flux at this time. Cheers Sebastian On Wed, 2020-03-11 at 17:02 -0700, Sebastian Berg wrote: > Hi all, > > I am pleased to propose NEP 41: First step towards a new Datatype > System https://numpy.org/neps/nep-0041-improved-dtype-support.html > > This NEP motivates the larger restructure of the datatype machinery > in > NumPy and defines a few fundamental design aspects. The long term > user > impact will be allowing easier and more rich featured user defined > datatypes. > > As this is a large restructure, the NEP represents only the first > steps > with some additional information in further NEPs being drafted [1] > (this may be helpful to look at depending on the level of detail you > are interested in). > The NEP itself does not propose to add significant new public API. > Instead it proposes to move forward with an incremental internal > refactor and lays the foundation for this process. > > The main user facing change at this time is that datatypes will > become > classes (e.g. ``type(np.dtype("float64"))`` will be a float64 > specific > class. > For most users, the main impact should be many new datatypes in the > long run (see the user impact section). However, for those interested > in API design within NumPy or with respect to implementing new > datatypes, this and the following NEPs are important decisions in the > future roadmap for NumPy. > > The current full text is reproduced below, although the above link is > probably a better way to read it. > > Cheers > > Sebastian > > > [1] NEP 40 gives some background information about the current > systems > and issues with it: > https://github.com/numpy/numpy/blob/1248cf7a8765b7b53d883f9e7061173817533aac/doc/neps/nep-0040-legacy-datatype-impl.rst > and NEP 42 being a first draft of how the new API may look like: > > https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38f53cee3/doc/neps/nep-0042-new-dtypes.rst > (links to current rendered versions, check > https://github.com/numpy/numpy/pull/15505 and > https://github.com/numpy/numpy/pull/15507 for updates) > > > ------------------------------------------------------------------- > --- > > > ================================================= > NEP 41 ? First step towards a new Datatype System > ================================================= > > :title: Improved Datatype Support > :Author: Sebastian Berg > :Author: St?fan van der Walt > :Author: Matti Picus > :Status: Draft > :Type: Standard Track > :Created: 2020-02-03 > > > .. note:: > > This NEP is part of a series of NEPs encompassing first > information > about the previous dtype implementation and issues with it in NEP > 40. > NEP 41 (this document) then provides an overview and generic > design > choices for the refactor. > Further NEPs 42 and 43 go into the technical details of the > datatype > and universal function related internal and external API changes. > In some cases it may be necessary to consult the other NEPs for a > full > picture of the desired changes and why these changes are > necessary. > > > Abstract > -------- > > `Datatypes ` in NumPy describe how to > interpret each > element in arrays. NumPy provides ``int``, ``float``, and ``complex`` > numerical > types, as well as string, datetime, and structured datatype > capabilities. > The growing Python community, however, has need for more diverse > datatypes. > Examples are datatypes with unit information attached (such as > meters) or > categorical datatypes (fixed set of possible values). > However, the current NumPy datatype API is too limited to allow the > creation > of these. > > This NEP is the first step to enable such growth; it will lead to > a simpler development path for new datatypes. > In the long run the new datatype system will also support the > creation > of datatypes directly from Python rather than C. > Refactoring the datatype API will improve maintainability and > facilitate > development of both user-defined external datatypes, > as well as new features for existing datatypes internal to NumPy. > > > Motivation and Scope > -------------------- > > .. seealso:: > > The user impact section includes examples of what kind of new > datatypes > will be enabled by the proposed changes in the long run. > It may thus help to read these section out of order. > > Motivation > ^^^^^^^^^^ > > One of the main issues with the current API is the definition of > typical > functions such as addition and multiplication for parametric > datatypes > (see also NEP 40) which require additional steps to determine the > output type. > For example when adding two strings of length 4, the result is a > string > of length 8, which is different from the input. > Similarly, a datatype which embeds a physical unit must calculate the > new unit > information: dividing a distance by a time results in a speed. > A related difficulty is that the :ref:`current casting rules > <_ufuncs.casting>` > -- the conversion between different datatypes -- > cannot describe casting for such parametric datatypes implemented > outside of NumPy. > > This additional functionality for supporting parametric datatypes > introduces > increased complexity within NumPy itself, > and furthermore is not available to external user-defined datatypes. > In general the concerns of different datatypes are not well well- > encapsulated. > This burden is exacerbated by the exposure of internal C structures, > limiting the addition of new fields > (for example to support new sorting methods [new_sort]_). > > Currently there are many factors which limit the creation of new > user-defined > datatypes: > > * Creating casting rules for parametric user-defined dtypes is either > impossible > or so complex that it has never been attempted. > * Type promotion, e.g. the operation deciding that adding float and > integer > values should return a float value, is very valuable for numeric > datatypes > but is limited in scope for user-defined and especially parametric > datatypes. > * Much of the logic (e.g. promotion) is written in single functions > instead of being split as methods on the datatype itself. > * In the current design datatypes cannot have methods that do not > generalize > to other datatypes. For example a unit datatype cannot have a > ``.to_si()`` method to > easily find the datatype which would represent the same values in > SI units. > > The large need to solve these issues has driven the scientific > community > to create work-arounds in multiple projects implementing physical > units as an > array-like class instead of a datatype, which would generalize better > across > multiple array-likes (Dask, pandas, etc.). > Already, Pandas has made a push into the same direction with its > extension arrays [pandas_extension_arrays]_ and undoubtedly > the community would be best served if such new features could be > common > between NumPy, Pandas, and other projects. > > Scope > ^^^^^ > > The proposed refactoring of the datatype system is a large > undertaking and > thus is proposed to be split into various phases, roughly: > > * Phase I: Restructure and extend the datatype infrastructure (This > NEP 41) > * Phase II: Incrementally define or rework API (Detailed largely in > NEPs 42/43) > * Phase III: Growth of NumPy and Scientific Python Ecosystem > capabilities. > > For a more detailed accounting of the various phases, see > "Plan to Approach the Full Refactor" in the Implementation section > below. > This NEP proposes to move ahead with the necessary creation of new > dtype > subclasses (Phase I), > and start working on implementing current functionality. > Within the context of this NEP all development will be fully private > API or > use preliminary underscored names which must be changed in the > future. > Most of the internal and public API choices are part of a second > Phase > and will be discussed in more detail in the following NEPs 42 and 43. > The initial implementation of this NEP will have little or no effect > on users, > but provides the necessary ground work for incrementally addressing > the > full rework. > > The implementation of this NEP and the following, implied large > rework of how > datatypes are defined in NumPy is expected to create small > incompatibilities > (see backward compatibility section). > However, a transition requiring large code adaption is not > anticipated and not > within scope. > > Specifically, this NEP makes the following design choices which are > discussed > in more details in the detailed description section: > > 1. Each datatype will be an instance of a subclass of ``np.dtype``, > with most of the > datatype-specific logic being implemented > as special methods on the class. In the C-API, these correspond to > specific > slots. In short, for ``f = np.dtype("f8")``, ``isinstance(f, > np.dtype)`` will remain true, > but ``type(f)`` will be a subclass of ``np.dtype`` rather than > just ``np.dtype`` itself. > The ``PyArray_ArrFuncs`` which are currently stored as a pointer > on the instance (as ``PyArray_Descr->f``), > should instead be stored on the class as typically done in Python. > In the future these may correspond to python side dunder methods. > Storage information such as itemsize and byteorder can differ > between > different dtype instances (e.g. "S3" vs. "S8") and will remain > part of the instance. > This means that in the long run the current lowlevel access to > dtype methods > will be removed (see ``PyArray_ArrFuncs`` in NEP 40). > > 2. The current NumPy scalars will *not* change, they will not be > instances of > datatypes. This will also be true for new datatypes, scalars will > not be > instances of a dtype (although ``isinstance(scalar, dtype)`` may > be made > to return ``True`` when appropriate). > > Detailed technical decisions to follow in NEP 42. > > Further, the public API will be designed in a way that is extensible > in the future: > > 3. All new C-API functions provided to the user will hide > implementation details > as much as possible. The public API should be an identical, but > limited, > version of the C-API used for the internal NumPy datatypes. > > The changes to the datatype system in Phase II must include a large > refactor of the > UFunc machinery, which will be further defined in NEP 43: > > 4. To enable all of the desired functionality for new user-defined > datatypes, > the UFunc machinery will be changed to replace the current > dispatching > and type resolution system. > The old system should be *mostly* supported as a legacy version > for some time. > > Additionally, as a general design principle, the addition of new > user-defined > datatypes will *not* change the behaviour of programs. > For example ``common_dtype(a, b)`` must not be ``c`` unless ``a`` or > ``b`` know > that ``c`` exists. > > > User Impact > ----------- > > The current ecosystem has very few user-defined datatypes using > NumPy, the > two most prominent being: ``rational`` and ``quaternion``. > These represent fairly simple datatypes which are not strongly > impacted > by the current limitations. > However, we have identified a need for datatypes such as: > > * bfloat16, used in deep learning > * categorical types > * physical units (such as meters) > * datatypes for tracing/automatic differentiation > * high, fixed precision math > * specialized integer types such as int2, int24 > * new, better datetime representations > * extending e.g. integer dtypes to have a sentinel NA value > * geometrical objects [pygeos]_ > > Some of these are partially solved; for example unit capability is > provided > in ``astropy.units``, ``unyt``, or ``pint``, as `numpy.ndarray` > subclasses. > Most of these datatypes, however, simply cannot be reasonably defined > right now. > An advantage of having such datatypes in NumPy is that they should > integrate > seamlessly with other array or array-like packages such as Pandas, > ``xarray`` [xarray_dtype_issue]_, or ``Dask``. > > The long term user impact of implementing this NEP will be to allow > both > the growth of the whole ecosystem by having such new datatypes, as > well as > consolidating implementation of such datatypes within NumPy to > achieve > better interoperability. > > > Examples > ^^^^^^^^ > > The following examples represent future user-defined datatypes we > wish to enable. > These datatypes are not part the NEP and choices (e.g. choice of > casting rules) > are possibilities we wish to enable and do not represent > recommendations. > > Simple Numerical Types > """""""""""""""""""""" > > Mainly used where memory is a consideration, lower-precision numeric > types > such as :ref:```bfloat16`` < > https://en.wikipedia.org/wiki/Bfloat16_floating-point_format>` > are common in other computational frameworks. > For these types the definitions of things such as ``np.common_type`` > and > ``np.can_cast`` are some of the most important interfaces. Once they > support ``np.common_type``, it is (for the most part) possible to > find > the correct ufunc loop to call, since most ufuncs -- such as add -- > effectively > only require ``np.result_type``:: > > >>> np.add(arr1, arr2).dtype == np.result_type(arr1, arr2) > > and `~numpy.result_type` is largely identical to > `~numpy.common_type`. > > > Fixed, high precision math > """""""""""""""""""""""""" > > Allowing arbitrary precision or higher precision math is important in > simulations. For instance ``mpmath`` defines a precision:: > > >>> import mpmath as mp > >>> print(mp.dps) # the current (default) precision > 15 > > NumPy should be able to construct a native, memory-efficient array > from > a list of ``mpmath.mpf`` floating point objects:: > > >>> arr_15_dps = np.array(mp.arange(3)) # (mp.arange returns a > list) > >>> print(arr_15_dps) # Must find the correct precision from the > objects: > array(['0.0', '1.0', '2.0'], dtype=mpf[dps=15]) > > We should also be able to specify the desired precision when > creating the datatype for an array. Here, we use ``np.dtype[mp.mpf]`` > to find the DType class (the notation is not part of this NEP), > which is then instantiated with the desired parameter. > This could also be written as ``MpfDType`` class:: > > >>> arr_100_dps = np.array([1, 2, 3], > dtype=np.dtype[mp.mpf](dps=100)) > >>> print(arr_15_dps + arr_100_dps) > array(['0.0', '2.0', '4.0'], dtype=mpf[dps=100]) > > The ``mpf`` datatype can decide that the result of the operation > should be the > higher precision one of the two, so uses a precision of 100. > Furthermore, we should be able to define casting, for example as in:: > > >>> np.can_cast(arr_15_dps.dtype, arr_100_dps.dtype, > casting="safe") > True > >>> np.can_cast(arr_100_dps.dtype, arr_15_dps.dtype, > casting="safe") > False # loses precision > >>> np.can_cast(arr_100_dps.dtype, arr_100_dps.dtype, > casting="same_kind") > True > > Casting from float is a probably always at least a ``same_kind`` > cast, but > in general, it is not safe:: > > >>> np.can_cast(np.float64, np.dtype[mp.mpf](dps=4), > casting="safe") > False > > since a float64 has a higer precision than the ``mpf`` datatype with > ``dps=4``. > > Alternatively, we can say that:: > > >>> np.common_type(np.dtype[mp.mpf](dps=5), > np.dtype[mp.mpf](dps=10)) > np.dtype[mp.mpf](dps=10) > > And possibly even:: > > >>> np.common_type(np.dtype[mp.mpf](dps=5), np.float64) > np.dtype[mp.mpf](dps=16) # equivalent precision to float64 (I > believe) > > since ``np.float64`` can be cast to a ``np.dtype[mp.mpf](dps=16)`` > safely. > > > Categoricals > """""""""""" > > Categoricals are interesting in that they can have fixed, predefined > values, > or can be dynamic with the ability to modify categories when > necessary. > The fixed categories (defined ahead of time) is the most straight > forward > categorical definition. > Categoricals are *hard*, since there are many strategies to implement > them, > suggesting NumPy should only provide the scaffolding for user-defined > categorical types. For instance:: > > >>> cat = Categorical(["eggs", "spam", "toast"]) > >>> breakfast = array(["eggs", "spam", "eggs", "toast"], > dtype=cat) > > could store the array very efficiently, since it knows that there are > only 3 > categories. > Since a categorical in this sense knows almost nothing about the data > stored > in it, few operations makes, sense, although equality does: > > >>> breakfast2 = array(["eggs", "eggs", "eggs", "eggs"], > dtype=cat) > >>> breakfast == breakfast2 > array[True, False, True, False]) > > The categorical datatype could work like a dictionary: no two > items names can be equal (checked on dtype creation), so that the > equality > operation above can be performed very efficiently. > If the values define an order, the category labels (internally > integers) could > be ordered the same way to allow efficient sorting and comparison. > > Whether or not casting is defined from one categorical with less to > one with > strictly more values defined, is something that the Categorical > datatype would > need to decide. Both options should be available. > > > Unit on the Datatype > """""""""""""""""""" > > There are different ways to define Units, depending on how the > internal > machinery would be organized, one way is to have a single Unit > datatype > for every existing numerical type. > This will be written as ``Unit[float64]``, the unit itself is part of > the > DType instance ``Unit[float64]("m")`` is a ``float64`` with meters > attached:: > > >>> from astropy import units > >>> meters = np.array([1, 2, 3], dtype=np.float64) * units.m # > meters > >>> print(meters) > array([1.0, 2.0, 3.0], dtype=Unit[float64]("m")) > > Note that units are a bit tricky. It is debatable, whether:: > > >>> np.array([1.0, 2.0, 3.0], dtype=Unit[float64]("m")) > > should be valid syntax (coercing the float scalars without a unit to > meters). > Once the array is created, math will work without any issue:: > > >>> meters / (2 * unit.seconds) > array([0.5, 1.0, 1.5], dtype=Unit[float64]("m/s")) > > Casting is not valid from one unit to the other, but can be valid > between > different scales of the same dimensionality (although this may be > "unsafe"):: > > >>> meters.astype(Unit[float64]("s")) > TypeError: Cannot cast meters to seconds. > >>> meters.astype(Unit[float64]("km")) > >>> # Convert to centimeter-gram-second (cgs) units: > >>> meters.astype(meters.dtype.to_cgs()) > > The above notation is somewhat clumsy. Functions > could be used instead to convert between units. > There may be ways to make these more convenient, but those must be > left > for future discussions:: > > >>> units.convert(meters, "km") > >>> units.to_cgs(meters) > > There are some open questions. For example, whether additional > methods > on the array object could exist to simplify some of the notions, and > how these > would percolate from the datatype to the ``ndarray``. > > The interaction with other scalars would likely be defined through:: > > >>> np.common_type(np.float64, Unit) > Unit[np.float64](dimensionless) > > Ufunc output datatype determination can be more involved than for > simple > numerical dtypes since there is no "universal" output type:: > > >>> np.multiply(meters, seconds).dtype != np.result_type(meters, > seconds) > > In fact ``np.result_type(meters, seconds)`` must error without > context > of the operation being done. > This example highlights how the specific ufunc loop > (loop with known, specific DTypes as inputs), has to be able to to > make > certain decisions before the actual calculation can start. > > > > Implementation > -------------- > > Plan to Approach the Full Refactor > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > To address these issues in NumPy and enable new datatypes, > multiple development stages are required: > > * Phase I: Restructure and extend the datatype infrastructure (This > NEP) > > * Organize Datatypes like normal Python classes [`PR 15508`]_ > > * Phase II: Incrementally define or rework API > > * Create a new and easily extensible API for defining new datatypes > and related functionality. (NEP 42) > > * Incrementally define all necessary functionality through the new > API (NEP 42): > > * Defining operations such as ``np.common_type``. > * Allowing to define casting between datatypes. > * Add functionality necessary to create a numpy array from Python > scalars > (i.e. ``np.array(...)``). > * ? > > * Restructure how universal functions work (NEP 43), in order to: > > * make it possible to allow a `~numpy.ufunc` such as ``np.add`` > to be > extended by user-defined datatypes such as Units. > > * allow efficient lookup for the correct implementation for user- > defined > datatypes. > > * enable reuse of existing code. Units should be able to use the > normal math loops and add additional logic to determine output > type. > > * Phase III: Growth of NumPy and Scientific Python Ecosystem > capabilities: > > * Cleanup of legacy behaviour where it is considered buggy or > undesirable. > * Provide a path to define new datatypes from Python. > * Assist the community in creating types such as Units or > Categoricals > * Allow strings to be used in functions such as ``np.equal`` or > ``np.add``. > * Remove legacy code paths within NumPy to improve long term > maintainability > > This document serves as a basis for phase I and provides the vision > and > motivation for the full project. > Phase I does not introduce any new user-facing features, > but is concerned with the necessary conceptual cleanup of the current > datatype system. > It provides a more "pythonic" datatype Python type object, with a > clear class hierarchy. > > The second phase is the incremental creation of all APIs necessary to > define > fully featured datatypes and reorganization of the NumPy datatype > system. > This phase will thus be primarily concerned with defining an, > initially preliminary, stable public API. > > Some of the benefits of a large refactor may only become evident > after the full > deprecation of the current legacy implementation (i.e. larger code > removals). > However, these steps are necessary for improvements to many parts of > the > core NumPy API, and are expected to make the implementation generally > easier to understand. > > The following figure illustrates the proposed design at a high level, > and roughly delineates the components of the overall design. > Note that this NEP only regards Phase I (shaded area), > the rest encompasses Phase II and the design choices are up for > discussion, > however, it highlights that the DType datatype class is the central, > necessary > concept: > > .. image:: _static/nep-0041-mindmap.svg > > > First steps directly related to this NEP > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > The required changes necessary to NumPy are large and touch many > areas > of the code base > but many of these changes can be addressed incrementally. > > To enable an incremental approach we will start by creating a C > defined > ``PyArray_DTypeMeta`` class with its instances being the ``DType`` > classes, > subclasses of ``np.dtype``. > This is necessary to add the ability of storing custom slots on the > DType in C. > This ``DTypeMeta`` will be implemented first to then enable > incremental > restructuring of current code. > > The addition of ``DType`` will then enable addressing other changes > incrementally, some of which may begin before the settling the full > internal > API: > > 1. New machinery for array coercion, with the goal of enabling user > DTypes > with appropriate class methods. > 2. The replacement or wrapping of the current casting machinery. > 3. Incremental redefinition of the current ``PyArray_ArrFuncs`` slots > into > DType method slots. > > At this point, no or only very limited new public API will be added > and > the internal API is considered to be in flux. > Any new public API may be set up give warnings and will have leading > underscores > to indicate that it is not finalized and can be changed without > warning. > > > Backward compatibility > ---------------------- > > While the actual backward compatibility impact of implementing Phase > I and II > are not yet fully clear, we anticipate, and accept the following > changes: > > * **Python API**: > > * ``type(np.dtype("f8"))`` will be a subclass of ``np.dtype``, > while right > now ``type(np.dtype("f8")) is np.dtype``. > Code should use ``isinstance`` checks, and in very rare cases may > have to > be adapted to use it. > > * **C-API**: > > * In old versions of NumPy ``PyArray_DescrCheck`` is a macro > which uses > ``type(dtype) is np.dtype``. When compiling against an old > NumPy version, > the macro may have to be replaced with the corresponding > ``PyObject_IsInstance`` call. (If this is a problem, we could > backport > fixing the macro) > > * The UFunc machinery changes will break *limited* parts of the > current > implementation. Replacing e.g. the default ``TypeResolver`` is > expected > to remain supported for a time, although optimized masked inner > loop iteration > (which is not even used *within* NumPy) will no longer be > supported. > > * All functions currently defined on the dtypes, such as > ``PyArray_Descr->f->nonzero``, will be defined and accessed > differently. > This means that in the long run lowlevel access code will > have to be changed to use the new API. Such changes are expected > to be > necessary in very few project. > > * **dtype implementors (C-API)**: > > * The array which is currently provided to some functions (such as > cast functions), > will no longer be provided. > For example ``PyArray_Descr->f->nonzero`` or ``PyArray_Descr->f- > >copyswapn``, > may instead receive a dummy array object with only some fields > (mainly the > dtype), being valid. > At least in some code paths, a similar mechanism is already used. > > * The ``scalarkind`` slot and registration of scalar casting will > be > removed/ignored without replacement. > It currently allows partial value-based casting. > The ``PyArray_ScalarKind`` function will continue to work for > builtin types, > but will not be used internally and be deprecated. > > * Currently user dtypes are defined as instances of ``np.dtype``. > The creation works by the user providing a prototype instance. > NumPy will need to modify at least the type during registration. > This has no effect for either ``rational`` or ``quaternion`` and > mutation > of the structure seems unlikely after registration. > > Since there is a fairly large API surface concerning datatypes, > further changes > or the limitation certain function to currently existing datatypes is > likely to occur. > For example functions which use the type number as input > should be replaced with functions taking DType classes instead. > Although public, large parts of this C-API seem to be used rarely, > possibly never, by downstream projects. > > > > Detailed Description > -------------------- > > This section details the design decisions covered by this NEP. > The subsections correspond to the list of design choices presented > in the Scope section. > > Datatypes as Python Classes (1) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > The current NumPy datatypes are not full scale python classes. > They are instead (prototype) instances of a single ``np.dtype`` > class. > Changing this means that any special handling, e.g. for ``datetime`` > can be moved to the Datetime DType class instead, away from > monolithic general > code (e.g. current ``PyArray_AdjustFlexibleDType``). > > The main consequence of this change with respect to the API is that > special methods move from the dtype instances to methods on the new > DType class. > This is the typical design pattern used in Python. > Organizing these methods and information in a more Pythonic way > provides a > solid foundation for refining and extending the API in the future. > The current API cannot be extended due to how it is exposed > publically. > This means for example that the methods currently stored in > ``PyArray_ArrFuncs`` > on each datatype (see NEP 40) will be defined differently in the > future and > deprecated in the long run. > > The most prominent visible side effect of this will be that > ``type(np.dtype(np.float64))`` will not be ``np.dtype`` anymore. > Instead it will be a subclass of ``np.dtype`` meaning that > ``isinstance(np.dtype(np.float64), np.dtype)`` will remain true. > This will also add the ability to use ``isinstance(dtype, > np.dtype[float64])`` > thus removing the need to use ``dtype.kind``, ``dtype.char``, or > ``dtype.type`` > to do this check. > > With the design decision of DTypes as full-scale Python classes, > the question of subclassing arises. > Inheritance, however, appears problematic and a complexity best > avoided > (at least initially) for container datatypes. > Further, subclasses may be more interesting for interoperability for > example with GPU backends (CuPy) storing additional methods related > to the > GPU rather than as a mechanism to define new datatypes. > A class hierarchy does provides value, this may be achieved by > allowing the creation of *abstract* datatypes. > An example for an abstract datatype would be the datatype equivalent > of > ``np.floating``, representing any floating point number. > These can serve the same purpose as Python's abstract base classes. > > > Scalars should not be instances of the datatypes (2) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > For simple datatypes such as ``float64`` (see also below), it seems > tempting that the instance of a ``np.dtype("float64")`` can be the > scalar. > This idea may be even more appealing due to the fact that scalars, > rather than datatypes, currently define a useful type hierarchy. > > However, we have specifically decided against this for a number of > reasons. > First, the new datatypes described herein would be instances of DType > classes. > Making these instances themselves classes, while possible, adds > additional > complexity that users need to understand. > It would also mean that scalars must have storage information (such > as byteorder) > which is generally unnecessary and currently is not used. > Second, while the simple NumPy scalars such as ``float64`` may be > such instances, > it should be possible to create datatypes for Python objects without > enforcing > NumPy as a dependency. > However, Python objects that do not depend on NumPy cannot be > instances of a NumPy DType. > Third, there is a mismatch between the methods and attributes which > are useful > for scalars and datatypes. For instance ``to_float()`` makes sense > for a scalar > but not for a datatype and ``newbyteorder`` is not useful on a scalar > (or has > a different meaning). > > Overall, it seem rather than reducing the complexity, i.e. by merging > the two distinct type hierarchies, making scalars instances of DTypes > would > increase the complexity of both the design and implementation. > > A possible future path may be to instead simplify the current NumPy > scalars to > be much simpler objects which largely derive their behaviour from the > datatypes. > > C-API for creating new Datatypes (3) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > The current C-API with which users can create new datatypes > is limited in scope, and requires use of "private" structures. This > means > the API is not extensible: no new members can be added to the > structure > without losing binary compatibility. > This has already limited the inclusion of new sorting methods into > NumPy [new_sort]_. > > The new version shall thus replace the current ``PyArray_ArrFuncs`` > structure used > to define new datatypes. > Datatypes that currently exist and are defined using these slots will > be > supported during a deprecation period. > > The most likely solution is to hide the implementation from the user > and thus make > it extensible in the future is to model the API after Python's stable > API [PEP-384]_: > > .. code-block:: C > > static struct PyArrayMethodDef slots[] = { > {NPY_dt_method, method_implementation}, > ..., > {0, NULL} > } > > typedef struct{ > PyTypeObject *typeobj; /* type of python scalar */ > ...; > PyType_Slot *slots; > } PyArrayDTypeMeta_Spec; > > PyObject* PyArray_InitDTypeMetaFromSpec( > PyArray_DTypeMeta *user_dtype, PyArrayDTypeMeta_Spec > *dtype_spec); > > The C-side slots should be designed to mirror Python side methods > such as ``dtype.__dtype_method__``, although the exposure to Python > is > a later step in the implementation to reduce the complexity of the > initial > implementation. > > > C-API Changes to the UFunc Machinery (4) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Proposed changes to the UFunc machinery will be part of NEP 43. > However, the following changes will be necessary (see NEP 40 for a > detailed > description of the current implementation and its issues): > > * The current UFunc type resolution must be adapted to allow better > control > for user-defined dtypes as well as resolve current inconsistencies. > * The inner-loop used in UFuncs must be expanded to include a return > value. > Further, error reporting must be improved, and passing in dtype- > specific > information enabled. > This requires the modification of the inner-loop function signature > and > addition of new hooks called before and after the inner-loop is > used. > > An important goal for any changes to the universal functions will be > to > allow the reuse of existing loops. > It should be easy for a new units datatype to fall back to existing > math > functions after handling the unit related computations. > > > Discussion > ---------- > > See NEP 40 for a list of previous meetings and discussions. > > > References > ---------- > > .. [pandas_extension_arrays] > https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types > > .. _xarray_dtype_issue: https://github.com/pydata/xarray/issues/1262 > > .. [pygeos] https://github.com/caspervdw/pygeos > > .. [new_sort] https://github.com/numpy/numpy/pull/12945 > > .. [PEP-384] https://www.python.org/dev/peps/pep-0384/ > > .. [PR 15508] https://github.com/numpy/numpy/pull/15508 > > > Copyright > --------- > > This document has been placed in the public domain. > > > Acknowledgments > --------------- > > The effort to create new datatypes for NumPy has been discussed for > several > years in many different contexts and settings, making it impossible > to list everyone involved. > We would like to thank especially Stephan Hoyer, Nathaniel Smith, and > Eric Wieser > for repeated in-depth discussion about datatype design. > We are very grateful for the community input in reviewing and > revising this > NEP and would like to thank especially Ross Barnowski and Ralf > Gommers. > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From cmeyer1969 at gmail.com Tue Mar 17 18:34:10 2020 From: cmeyer1969 at gmail.com (Chris Meyer) Date: Tue, 17 Mar 2020 15:34:10 -0700 Subject: [Numpy-discussion] Proposal: NEP 41 -- First step towards a new Datatype System In-Reply-To: <909ccfa50c89620b6440585e7dac4d2c521d62fc.camel@sipsolutions.net> References: <3ee04012570e418a874e4c436b5b6f8ac5f56cdb.camel@sipsolutions.net> <909ccfa50c89620b6440585e7dac4d2c521d62fc.camel@sipsolutions.net> Message-ID: <053AD87E-4024-4211-9BCB-7B60A4F7F92D@gmail.com> > On Mar 17, 2020, at 1:02 PM, Sebastian Berg wrote: > > in the spirit of trying to keep this moving, can I assume that the main > reason for little discussion is that the actual changes proposed are > not very far reaching as of now? Or is the reason that this is a > fairly complex topic that you need more time to think about it? > If it is the latter, is there some way I can help with it? I tried to > minimize how much is part of this initial NEP. One reason for not responding is that it seems a lot of discussion of this has already taken place and this NEP is presented more as a conclusion summary rather than a discussion point. I implement scientific imaging software and overall this NEP looks useful. My only caveat is that I don?t think tracking physical units should be a primary use case. Units are fundamentally different than data types, even though there are libraries out there that treat them more like data types. For instance, it makes sense to have the same physical unit but with different storage types. For instance, data with nanometer physical units can be stored as a float32 or as an int16 and be equally useful. In addition, a unit is something that is mutated by the operation. For instance, reducing a 2D image with physical units by a factor of two in each dimension produces a different unit scaling (1km/pixel goes to 2km/pixel); whereas cropping the center half does not (1km/pixel stays as 1km/pixel). Finally, units may be different for each axis in multidimensional data. For instance, we want a float32 array with two dimensions with the units on one dimension being time and the other dimension being spatial. (3 seconds x 50 nm). I?m not sure these comments take away from this NEP ? but maybe there is another approach for units: metadata about the shape of the data rather than a new datatype for physical units. We do this in our software already - but it would be helpful if NumPy had a built-in mechanism for that. From rmay31 at gmail.com Tue Mar 17 19:34:07 2020 From: rmay31 at gmail.com (Ryan May) Date: Tue, 17 Mar 2020 17:34:07 -0600 Subject: [Numpy-discussion] Proposal: NEP 41 -- First step towards a new Datatype System In-Reply-To: <053AD87E-4024-4211-9BCB-7B60A4F7F92D@gmail.com> References: <3ee04012570e418a874e4c436b5b6f8ac5f56cdb.camel@sipsolutions.net> <909ccfa50c89620b6440585e7dac4d2c521d62fc.camel@sipsolutions.net> <053AD87E-4024-4211-9BCB-7B60A4F7F92D@gmail.com> Message-ID: On Tue, Mar 17, 2020 at 4:35 PM Chris Meyer wrote: > > On Mar 17, 2020, at 1:02 PM, Sebastian Berg > wrote: > > > > in the spirit of trying to keep this moving, can I assume that the main > > reason for little discussion is that the actual changes proposed are > > not very far reaching as of now? Or is the reason that this is a > > fairly complex topic that you need more time to think about it? > > If it is the latter, is there some way I can help with it? I tried to > > minimize how much is part of this initial NEP. > > One reason for not responding is that it seems a lot of discussion of this > has already taken place and this NEP is presented more as a conclusion > summary rather than a discussion point. > > I implement scientific imaging software and overall this NEP looks useful. > > My only caveat is that I don?t think tracking physical units should be a > primary use case. Units are fundamentally different than data types, even > though there are libraries out there that treat them more like data types. > I strongly disagree. Right now, you need to implement a custom container to handle units, which makes it exceedingly difficult to then properly interact with other array_like objects, like dask, pandas, and xarray; handling units is completely orthogonal to handling slicing operations, data access, etc. so having to implement a container is overkill. Unit information describes information about the type of each of the elements within an array, including describing how operations between individual elements work. This sounds exactly like a dtype to me. > For instance, it makes sense to have the same physical unit but with > different storage types. For instance, data with nanometer physical units > can be stored as a float32 or as an int16 and be equally useful. > Yes, you would have the unit tracking as a mixin that would allow different storage types, absolutely. > In addition, a unit is something that is mutated by the operation. For > instance, reducing a 2D image with physical units by a factor of two in > each dimension produces a different unit scaling (1km/pixel goes to > 2km/pixel); whereas cropping the center half does not (1km/pixel stays as > 1km/pixel). > I'm not sure what your point is. Dtypes can change for some operations (np.sqrt(np.arange(5)) produces a float) while staying the same for others (e.g. addition) > Finally, units may be different for each axis in multidimensional data. > For instance, we want a float32 array with two dimensions with the units on > one dimension being time and the other dimension being spatial. (3 seconds > x 50 nm). > The units for an array describe the elements *within* the array, they would have nothing to do with the dimensions. So for an array of image data, e.g. brightness temperatures, you would have physical units (e.g. Kelvin). You would have separate arrays of coordinates describing the spatial extent of the data along the relevant dimensions--each of these arrays of coordinates would have their own physical quantity information. Ryan -- Ryan May -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmeyer1969 at gmail.com Tue Mar 17 19:50:02 2020 From: cmeyer1969 at gmail.com (Chris Meyer) Date: Tue, 17 Mar 2020 16:50:02 -0700 Subject: [Numpy-discussion] Proposal: NEP 41 -- First step towards a new Datatype System In-Reply-To: References: <3ee04012570e418a874e4c436b5b6f8ac5f56cdb.camel@sipsolutions.net> <909ccfa50c89620b6440585e7dac4d2c521d62fc.camel@sipsolutions.net> <053AD87E-4024-4211-9BCB-7B60A4F7F92D@gmail.com> Message-ID: <2E876537-A3A3-43D3-9B39-0BD79A6C0118@gmail.com> > My only caveat is that I don?t think tracking physical units should be a primary use case. Units are fundamentally different than data types, even though there are libraries out there that treat them more like data types. > > I strongly disagree. Right now, you need to implement a custom container to handle units, which makes it exceedingly difficult to then properly interact with other array_like objects, like dask, pandas, and xarray; handling units is completely orthogonal to handling slicing operations, data access, etc. so having to implement a container is overkill. Unit information describes information about the type of each of the elements within an array, including describing how operations between individual elements work. This sounds exactly like a dtype to me. Yes I see your point. > Finally, units may be different for each axis in multidimensional data. For instance, we want a float32 array with two dimensions with the units on one dimension being time and the other dimension being spatial. (3 seconds x 50 nm). > > The units for an array describe the elements *within* the array, they would have nothing to do with the dimensions. So for an array of image data, e.g. brightness temperatures, you would have physical units (e.g. Kelvin). You would have separate arrays of coordinates describing the spatial extent of the data along the relevant dimensions--each of these arrays of coordinates would have their own physical quantity information. Again, you are correct. I?m asking for similar metadata attached to the data shape rather than the data type. -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmeyer1969 at gmail.com Tue Mar 17 20:04:45 2020 From: cmeyer1969 at gmail.com (Chris Meyer) Date: Tue, 17 Mar 2020 17:04:45 -0700 Subject: [Numpy-discussion] Proposal: NEP 41 -- First step towards a new Datatype System In-Reply-To: <2E876537-A3A3-43D3-9B39-0BD79A6C0118@gmail.com> References: <3ee04012570e418a874e4c436b5b6f8ac5f56cdb.camel@sipsolutions.net> <909ccfa50c89620b6440585e7dac4d2c521d62fc.camel@sipsolutions.net> <053AD87E-4024-4211-9BCB-7B60A4F7F92D@gmail.com> <2E876537-A3A3-43D3-9B39-0BD79A6C0118@gmail.com> Message-ID: <03EB8EF5-8C7F-4F9F-A073-B31CE82F46FA@gmail.com> > On Mar 17, 2020, at 4:50 PM, Chris Meyer wrote: > >> The units for an array describe the elements *within* the array, they would have nothing to do with the dimensions. So for an array of image data, e.g. brightness temperatures, you would have physical units (e.g. Kelvin). You would have separate arrays of coordinates describing the spatial extent of the data along the relevant dimensions--each of these arrays of coordinates would have their own physical quantity information. > > Again, you are correct. I?m asking for similar metadata attached to the data shape rather than the data type. This would be better worded as "similar metadata attached to the data shape _in addition to_ the data type.?. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at gmail.com Wed Mar 18 11:16:21 2020 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Wed, 18 Mar 2020 16:16:21 +0100 Subject: [Numpy-discussion] Proposal: NEP 41 -- First step towards a new Datatype System In-Reply-To: <909ccfa50c89620b6440585e7dac4d2c521d62fc.camel@sipsolutions.net> References: <3ee04012570e418a874e4c436b5b6f8ac5f56cdb.camel@sipsolutions.net> <909ccfa50c89620b6440585e7dac4d2c521d62fc.camel@sipsolutions.net> Message-ID: On Tue, Mar 17, 2020 at 9:03 PM Sebastian Berg wrote: > Hi all, > > in the spirit of trying to keep this moving, can I assume that the main > reason for little discussion is that the actual changes proposed are > not very far reaching as of now? Or is the reason that this is a > fairly complex topic that you need more time to think about it? > Probably (a) it's a long NEP on a complex topic, (b) the past week has been a very weird week for everyone (in the extra-news-reading-time I could easily have re-reviewed the NEP), and (c) the amount of feedback one expects to get on a NEP is roughly inversely proportional to the scope and complexity of the NEP contents. Today I re-read the parts I commented on before. This version is a big improvement over the previous ones. Thanks in particular for adding clear examples and the diagram, it helps a lot. > If it is the latter, is there some way I can help with it? I tried to > minimize how much is part of this initial NEP. > > If there is not much need for discussion, I would like to officially > accept the NEP very soon, sending out an official one week notice in > the next days. > I agree. I think I would like to keep the option open though to come back to the NEP later to improve the clarity of the text about motivation/plan/examples/scope, given that this will be the reference for a major amount of work for a long time to come. To summarize one more time, the main point is that: > This point seems fine, and I'm +1 for going ahead with the described parts of the technical design. Cheers, Ralf > type(np.dtype(np.float64)) > > will be `np.dtype[float64]`, a subclass of dtype, so that: > > issubclass(np.dtype[float64], np.dtype) > > is true. This means that we will have one class for every current type > number: `dtype.num`. The implementation of these subclasses will be a > C-written (extension) MetaClass, all details of this class are supposed > to remain experimental in flux at this time. > > Cheers > > Sebastian > > > On Wed, 2020-03-11 at 17:02 -0700, Sebastian Berg wrote: > > Hi all, > > > > I am pleased to propose NEP 41: First step towards a new Datatype > > System https://numpy.org/neps/nep-0041-improved-dtype-support.html > > > > This NEP motivates the larger restructure of the datatype machinery > > in > > NumPy and defines a few fundamental design aspects. The long term > > user > > impact will be allowing easier and more rich featured user defined > > datatypes. > > > > As this is a large restructure, the NEP represents only the first > > steps > > with some additional information in further NEPs being drafted [1] > > (this may be helpful to look at depending on the level of detail you > > are interested in). > > The NEP itself does not propose to add significant new public API. > > Instead it proposes to move forward with an incremental internal > > refactor and lays the foundation for this process. > > > > The main user facing change at this time is that datatypes will > > become > > classes (e.g. ``type(np.dtype("float64"))`` will be a float64 > > specific > > class. > > For most users, the main impact should be many new datatypes in the > > long run (see the user impact section). However, for those interested > > in API design within NumPy or with respect to implementing new > > datatypes, this and the following NEPs are important decisions in the > > future roadmap for NumPy. > > > > The current full text is reproduced below, although the above link is > > probably a better way to read it. > > > > Cheers > > > > Sebastian > > > > > > [1] NEP 40 gives some background information about the current > > systems > > and issues with it: > > > https://github.com/numpy/numpy/blob/1248cf7a8765b7b53d883f9e7061173817533aac/doc/neps/nep-0040-legacy-datatype-impl.rst > > and NEP 42 being a first draft of how the new API may look like: > > > > > https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38f53cee3/doc/neps/nep-0042-new-dtypes.rst > > (links to current rendered versions, check > > https://github.com/numpy/numpy/pull/15505 and > > https://github.com/numpy/numpy/pull/15507 for updates) > > > > > > ------------------------------------------------------------------- > > --- > > > > > > ================================================= > > NEP 41 ? First step towards a new Datatype System > > ================================================= > > > > :title: Improved Datatype Support > > :Author: Sebastian Berg > > :Author: St?fan van der Walt > > :Author: Matti Picus > > :Status: Draft > > :Type: Standard Track > > :Created: 2020-02-03 > > > > > > .. note:: > > > > This NEP is part of a series of NEPs encompassing first > > information > > about the previous dtype implementation and issues with it in NEP > > 40. > > NEP 41 (this document) then provides an overview and generic > > design > > choices for the refactor. > > Further NEPs 42 and 43 go into the technical details of the > > datatype > > and universal function related internal and external API changes. > > In some cases it may be necessary to consult the other NEPs for a > > full > > picture of the desired changes and why these changes are > > necessary. > > > > > > Abstract > > -------- > > > > `Datatypes ` in NumPy describe how to > > interpret each > > element in arrays. NumPy provides ``int``, ``float``, and ``complex`` > > numerical > > types, as well as string, datetime, and structured datatype > > capabilities. > > The growing Python community, however, has need for more diverse > > datatypes. > > Examples are datatypes with unit information attached (such as > > meters) or > > categorical datatypes (fixed set of possible values). > > However, the current NumPy datatype API is too limited to allow the > > creation > > of these. > > > > This NEP is the first step to enable such growth; it will lead to > > a simpler development path for new datatypes. > > In the long run the new datatype system will also support the > > creation > > of datatypes directly from Python rather than C. > > Refactoring the datatype API will improve maintainability and > > facilitate > > development of both user-defined external datatypes, > > as well as new features for existing datatypes internal to NumPy. > > > > > > Motivation and Scope > > -------------------- > > > > .. seealso:: > > > > The user impact section includes examples of what kind of new > > datatypes > > will be enabled by the proposed changes in the long run. > > It may thus help to read these section out of order. > > > > Motivation > > ^^^^^^^^^^ > > > > One of the main issues with the current API is the definition of > > typical > > functions such as addition and multiplication for parametric > > datatypes > > (see also NEP 40) which require additional steps to determine the > > output type. > > For example when adding two strings of length 4, the result is a > > string > > of length 8, which is different from the input. > > Similarly, a datatype which embeds a physical unit must calculate the > > new unit > > information: dividing a distance by a time results in a speed. > > A related difficulty is that the :ref:`current casting rules > > <_ufuncs.casting>` > > -- the conversion between different datatypes -- > > cannot describe casting for such parametric datatypes implemented > > outside of NumPy. > > > > This additional functionality for supporting parametric datatypes > > introduces > > increased complexity within NumPy itself, > > and furthermore is not available to external user-defined datatypes. > > In general the concerns of different datatypes are not well well- > > encapsulated. > > This burden is exacerbated by the exposure of internal C structures, > > limiting the addition of new fields > > (for example to support new sorting methods [new_sort]_). > > > > Currently there are many factors which limit the creation of new > > user-defined > > datatypes: > > > > * Creating casting rules for parametric user-defined dtypes is either > > impossible > > or so complex that it has never been attempted. > > * Type promotion, e.g. the operation deciding that adding float and > > integer > > values should return a float value, is very valuable for numeric > > datatypes > > but is limited in scope for user-defined and especially parametric > > datatypes. > > * Much of the logic (e.g. promotion) is written in single functions > > instead of being split as methods on the datatype itself. > > * In the current design datatypes cannot have methods that do not > > generalize > > to other datatypes. For example a unit datatype cannot have a > > ``.to_si()`` method to > > easily find the datatype which would represent the same values in > > SI units. > > > > The large need to solve these issues has driven the scientific > > community > > to create work-arounds in multiple projects implementing physical > > units as an > > array-like class instead of a datatype, which would generalize better > > across > > multiple array-likes (Dask, pandas, etc.). > > Already, Pandas has made a push into the same direction with its > > extension arrays [pandas_extension_arrays]_ and undoubtedly > > the community would be best served if such new features could be > > common > > between NumPy, Pandas, and other projects. > > > > Scope > > ^^^^^ > > > > The proposed refactoring of the datatype system is a large > > undertaking and > > thus is proposed to be split into various phases, roughly: > > > > * Phase I: Restructure and extend the datatype infrastructure (This > > NEP 41) > > * Phase II: Incrementally define or rework API (Detailed largely in > > NEPs 42/43) > > * Phase III: Growth of NumPy and Scientific Python Ecosystem > > capabilities. > > > > For a more detailed accounting of the various phases, see > > "Plan to Approach the Full Refactor" in the Implementation section > > below. > > This NEP proposes to move ahead with the necessary creation of new > > dtype > > subclasses (Phase I), > > and start working on implementing current functionality. > > Within the context of this NEP all development will be fully private > > API or > > use preliminary underscored names which must be changed in the > > future. > > Most of the internal and public API choices are part of a second > > Phase > > and will be discussed in more detail in the following NEPs 42 and 43. > > The initial implementation of this NEP will have little or no effect > > on users, > > but provides the necessary ground work for incrementally addressing > > the > > full rework. > > > > The implementation of this NEP and the following, implied large > > rework of how > > datatypes are defined in NumPy is expected to create small > > incompatibilities > > (see backward compatibility section). > > However, a transition requiring large code adaption is not > > anticipated and not > > within scope. > > > > Specifically, this NEP makes the following design choices which are > > discussed > > in more details in the detailed description section: > > > > 1. Each datatype will be an instance of a subclass of ``np.dtype``, > > with most of the > > datatype-specific logic being implemented > > as special methods on the class. In the C-API, these correspond to > > specific > > slots. In short, for ``f = np.dtype("f8")``, ``isinstance(f, > > np.dtype)`` will remain true, > > but ``type(f)`` will be a subclass of ``np.dtype`` rather than > > just ``np.dtype`` itself. > > The ``PyArray_ArrFuncs`` which are currently stored as a pointer > > on the instance (as ``PyArray_Descr->f``), > > should instead be stored on the class as typically done in Python. > > In the future these may correspond to python side dunder methods. > > Storage information such as itemsize and byteorder can differ > > between > > different dtype instances (e.g. "S3" vs. "S8") and will remain > > part of the instance. > > This means that in the long run the current lowlevel access to > > dtype methods > > will be removed (see ``PyArray_ArrFuncs`` in NEP 40). > > > > 2. The current NumPy scalars will *not* change, they will not be > > instances of > > datatypes. This will also be true for new datatypes, scalars will > > not be > > instances of a dtype (although ``isinstance(scalar, dtype)`` may > > be made > > to return ``True`` when appropriate). > > > > Detailed technical decisions to follow in NEP 42. > > > > Further, the public API will be designed in a way that is extensible > > in the future: > > > > 3. All new C-API functions provided to the user will hide > > implementation details > > as much as possible. The public API should be an identical, but > > limited, > > version of the C-API used for the internal NumPy datatypes. > > > > The changes to the datatype system in Phase II must include a large > > refactor of the > > UFunc machinery, which will be further defined in NEP 43: > > > > 4. To enable all of the desired functionality for new user-defined > > datatypes, > > the UFunc machinery will be changed to replace the current > > dispatching > > and type resolution system. > > The old system should be *mostly* supported as a legacy version > > for some time. > > > > Additionally, as a general design principle, the addition of new > > user-defined > > datatypes will *not* change the behaviour of programs. > > For example ``common_dtype(a, b)`` must not be ``c`` unless ``a`` or > > ``b`` know > > that ``c`` exists. > > > > > > User Impact > > ----------- > > > > The current ecosystem has very few user-defined datatypes using > > NumPy, the > > two most prominent being: ``rational`` and ``quaternion``. > > These represent fairly simple datatypes which are not strongly > > impacted > > by the current limitations. > > However, we have identified a need for datatypes such as: > > > > * bfloat16, used in deep learning > > * categorical types > > * physical units (such as meters) > > * datatypes for tracing/automatic differentiation > > * high, fixed precision math > > * specialized integer types such as int2, int24 > > * new, better datetime representations > > * extending e.g. integer dtypes to have a sentinel NA value > > * geometrical objects [pygeos]_ > > > > Some of these are partially solved; for example unit capability is > > provided > > in ``astropy.units``, ``unyt``, or ``pint``, as `numpy.ndarray` > > subclasses. > > Most of these datatypes, however, simply cannot be reasonably defined > > right now. > > An advantage of having such datatypes in NumPy is that they should > > integrate > > seamlessly with other array or array-like packages such as Pandas, > > ``xarray`` [xarray_dtype_issue]_, or ``Dask``. > > > > The long term user impact of implementing this NEP will be to allow > > both > > the growth of the whole ecosystem by having such new datatypes, as > > well as > > consolidating implementation of such datatypes within NumPy to > > achieve > > better interoperability. > > > > > > Examples > > ^^^^^^^^ > > > > The following examples represent future user-defined datatypes we > > wish to enable. > > These datatypes are not part the NEP and choices (e.g. choice of > > casting rules) > > are possibilities we wish to enable and do not represent > > recommendations. > > > > Simple Numerical Types > > """""""""""""""""""""" > > > > Mainly used where memory is a consideration, lower-precision numeric > > types > > such as :ref:```bfloat16`` < > > https://en.wikipedia.org/wiki/Bfloat16_floating-point_format>` > > are common in other computational frameworks. > > For these types the definitions of things such as ``np.common_type`` > > and > > ``np.can_cast`` are some of the most important interfaces. Once they > > support ``np.common_type``, it is (for the most part) possible to > > find > > the correct ufunc loop to call, since most ufuncs -- such as add -- > > effectively > > only require ``np.result_type``:: > > > > >>> np.add(arr1, arr2).dtype == np.result_type(arr1, arr2) > > > > and `~numpy.result_type` is largely identical to > > `~numpy.common_type`. > > > > > > Fixed, high precision math > > """""""""""""""""""""""""" > > > > Allowing arbitrary precision or higher precision math is important in > > simulations. For instance ``mpmath`` defines a precision:: > > > > >>> import mpmath as mp > > >>> print(mp.dps) # the current (default) precision > > 15 > > > > NumPy should be able to construct a native, memory-efficient array > > from > > a list of ``mpmath.mpf`` floating point objects:: > > > > >>> arr_15_dps = np.array(mp.arange(3)) # (mp.arange returns a > > list) > > >>> print(arr_15_dps) # Must find the correct precision from the > > objects: > > array(['0.0', '1.0', '2.0'], dtype=mpf[dps=15]) > > > > We should also be able to specify the desired precision when > > creating the datatype for an array. Here, we use ``np.dtype[mp.mpf]`` > > to find the DType class (the notation is not part of this NEP), > > which is then instantiated with the desired parameter. > > This could also be written as ``MpfDType`` class:: > > > > >>> arr_100_dps = np.array([1, 2, 3], > > dtype=np.dtype[mp.mpf](dps=100)) > > >>> print(arr_15_dps + arr_100_dps) > > array(['0.0', '2.0', '4.0'], dtype=mpf[dps=100]) > > > > The ``mpf`` datatype can decide that the result of the operation > > should be the > > higher precision one of the two, so uses a precision of 100. > > Furthermore, we should be able to define casting, for example as in:: > > > > >>> np.can_cast(arr_15_dps.dtype, arr_100_dps.dtype, > > casting="safe") > > True > > >>> np.can_cast(arr_100_dps.dtype, arr_15_dps.dtype, > > casting="safe") > > False # loses precision > > >>> np.can_cast(arr_100_dps.dtype, arr_100_dps.dtype, > > casting="same_kind") > > True > > > > Casting from float is a probably always at least a ``same_kind`` > > cast, but > > in general, it is not safe:: > > > > >>> np.can_cast(np.float64, np.dtype[mp.mpf](dps=4), > > casting="safe") > > False > > > > since a float64 has a higer precision than the ``mpf`` datatype with > > ``dps=4``. > > > > Alternatively, we can say that:: > > > > >>> np.common_type(np.dtype[mp.mpf](dps=5), > > np.dtype[mp.mpf](dps=10)) > > np.dtype[mp.mpf](dps=10) > > > > And possibly even:: > > > > >>> np.common_type(np.dtype[mp.mpf](dps=5), np.float64) > > np.dtype[mp.mpf](dps=16) # equivalent precision to float64 (I > > believe) > > > > since ``np.float64`` can be cast to a ``np.dtype[mp.mpf](dps=16)`` > > safely. > > > > > > Categoricals > > """""""""""" > > > > Categoricals are interesting in that they can have fixed, predefined > > values, > > or can be dynamic with the ability to modify categories when > > necessary. > > The fixed categories (defined ahead of time) is the most straight > > forward > > categorical definition. > > Categoricals are *hard*, since there are many strategies to implement > > them, > > suggesting NumPy should only provide the scaffolding for user-defined > > categorical types. For instance:: > > > > >>> cat = Categorical(["eggs", "spam", "toast"]) > > >>> breakfast = array(["eggs", "spam", "eggs", "toast"], > > dtype=cat) > > > > could store the array very efficiently, since it knows that there are > > only 3 > > categories. > > Since a categorical in this sense knows almost nothing about the data > > stored > > in it, few operations makes, sense, although equality does: > > > > >>> breakfast2 = array(["eggs", "eggs", "eggs", "eggs"], > > dtype=cat) > > >>> breakfast == breakfast2 > > array[True, False, True, False]) > > > > The categorical datatype could work like a dictionary: no two > > items names can be equal (checked on dtype creation), so that the > > equality > > operation above can be performed very efficiently. > > If the values define an order, the category labels (internally > > integers) could > > be ordered the same way to allow efficient sorting and comparison. > > > > Whether or not casting is defined from one categorical with less to > > one with > > strictly more values defined, is something that the Categorical > > datatype would > > need to decide. Both options should be available. > > > > > > Unit on the Datatype > > """""""""""""""""""" > > > > There are different ways to define Units, depending on how the > > internal > > machinery would be organized, one way is to have a single Unit > > datatype > > for every existing numerical type. > > This will be written as ``Unit[float64]``, the unit itself is part of > > the > > DType instance ``Unit[float64]("m")`` is a ``float64`` with meters > > attached:: > > > > >>> from astropy import units > > >>> meters = np.array([1, 2, 3], dtype=np.float64) * units.m # > > meters > > >>> print(meters) > > array([1.0, 2.0, 3.0], dtype=Unit[float64]("m")) > > > > Note that units are a bit tricky. It is debatable, whether:: > > > > >>> np.array([1.0, 2.0, 3.0], dtype=Unit[float64]("m")) > > > > should be valid syntax (coercing the float scalars without a unit to > > meters). > > Once the array is created, math will work without any issue:: > > > > >>> meters / (2 * unit.seconds) > > array([0.5, 1.0, 1.5], dtype=Unit[float64]("m/s")) > > > > Casting is not valid from one unit to the other, but can be valid > > between > > different scales of the same dimensionality (although this may be > > "unsafe"):: > > > > >>> meters.astype(Unit[float64]("s")) > > TypeError: Cannot cast meters to seconds. > > >>> meters.astype(Unit[float64]("km")) > > >>> # Convert to centimeter-gram-second (cgs) units: > > >>> meters.astype(meters.dtype.to_cgs()) > > > > The above notation is somewhat clumsy. Functions > > could be used instead to convert between units. > > There may be ways to make these more convenient, but those must be > > left > > for future discussions:: > > > > >>> units.convert(meters, "km") > > >>> units.to_cgs(meters) > > > > There are some open questions. For example, whether additional > > methods > > on the array object could exist to simplify some of the notions, and > > how these > > would percolate from the datatype to the ``ndarray``. > > > > The interaction with other scalars would likely be defined through:: > > > > >>> np.common_type(np.float64, Unit) > > Unit[np.float64](dimensionless) > > > > Ufunc output datatype determination can be more involved than for > > simple > > numerical dtypes since there is no "universal" output type:: > > > > >>> np.multiply(meters, seconds).dtype != np.result_type(meters, > > seconds) > > > > In fact ``np.result_type(meters, seconds)`` must error without > > context > > of the operation being done. > > This example highlights how the specific ufunc loop > > (loop with known, specific DTypes as inputs), has to be able to to > > make > > certain decisions before the actual calculation can start. > > > > > > > > Implementation > > -------------- > > > > Plan to Approach the Full Refactor > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > To address these issues in NumPy and enable new datatypes, > > multiple development stages are required: > > > > * Phase I: Restructure and extend the datatype infrastructure (This > > NEP) > > > > * Organize Datatypes like normal Python classes [`PR 15508`]_ > > > > * Phase II: Incrementally define or rework API > > > > * Create a new and easily extensible API for defining new datatypes > > and related functionality. (NEP 42) > > > > * Incrementally define all necessary functionality through the new > > API (NEP 42): > > > > * Defining operations such as ``np.common_type``. > > * Allowing to define casting between datatypes. > > * Add functionality necessary to create a numpy array from Python > > scalars > > (i.e. ``np.array(...)``). > > * ? > > > > * Restructure how universal functions work (NEP 43), in order to: > > > > * make it possible to allow a `~numpy.ufunc` such as ``np.add`` > > to be > > extended by user-defined datatypes such as Units. > > > > * allow efficient lookup for the correct implementation for user- > > defined > > datatypes. > > > > * enable reuse of existing code. Units should be able to use the > > normal math loops and add additional logic to determine output > > type. > > > > * Phase III: Growth of NumPy and Scientific Python Ecosystem > > capabilities: > > > > * Cleanup of legacy behaviour where it is considered buggy or > > undesirable. > > * Provide a path to define new datatypes from Python. > > * Assist the community in creating types such as Units or > > Categoricals > > * Allow strings to be used in functions such as ``np.equal`` or > > ``np.add``. > > * Remove legacy code paths within NumPy to improve long term > > maintainability > > > > This document serves as a basis for phase I and provides the vision > > and > > motivation for the full project. > > Phase I does not introduce any new user-facing features, > > but is concerned with the necessary conceptual cleanup of the current > > datatype system. > > It provides a more "pythonic" datatype Python type object, with a > > clear class hierarchy. > > > > The second phase is the incremental creation of all APIs necessary to > > define > > fully featured datatypes and reorganization of the NumPy datatype > > system. > > This phase will thus be primarily concerned with defining an, > > initially preliminary, stable public API. > > > > Some of the benefits of a large refactor may only become evident > > after the full > > deprecation of the current legacy implementation (i.e. larger code > > removals). > > However, these steps are necessary for improvements to many parts of > > the > > core NumPy API, and are expected to make the implementation generally > > easier to understand. > > > > The following figure illustrates the proposed design at a high level, > > and roughly delineates the components of the overall design. > > Note that this NEP only regards Phase I (shaded area), > > the rest encompasses Phase II and the design choices are up for > > discussion, > > however, it highlights that the DType datatype class is the central, > > necessary > > concept: > > > > .. image:: _static/nep-0041-mindmap.svg > > > > > > First steps directly related to this NEP > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > The required changes necessary to NumPy are large and touch many > > areas > > of the code base > > but many of these changes can be addressed incrementally. > > > > To enable an incremental approach we will start by creating a C > > defined > > ``PyArray_DTypeMeta`` class with its instances being the ``DType`` > > classes, > > subclasses of ``np.dtype``. > > This is necessary to add the ability of storing custom slots on the > > DType in C. > > This ``DTypeMeta`` will be implemented first to then enable > > incremental > > restructuring of current code. > > > > The addition of ``DType`` will then enable addressing other changes > > incrementally, some of which may begin before the settling the full > > internal > > API: > > > > 1. New machinery for array coercion, with the goal of enabling user > > DTypes > > with appropriate class methods. > > 2. The replacement or wrapping of the current casting machinery. > > 3. Incremental redefinition of the current ``PyArray_ArrFuncs`` slots > > into > > DType method slots. > > > > At this point, no or only very limited new public API will be added > > and > > the internal API is considered to be in flux. > > Any new public API may be set up give warnings and will have leading > > underscores > > to indicate that it is not finalized and can be changed without > > warning. > > > > > > Backward compatibility > > ---------------------- > > > > While the actual backward compatibility impact of implementing Phase > > I and II > > are not yet fully clear, we anticipate, and accept the following > > changes: > > > > * **Python API**: > > > > * ``type(np.dtype("f8"))`` will be a subclass of ``np.dtype``, > > while right > > now ``type(np.dtype("f8")) is np.dtype``. > > Code should use ``isinstance`` checks, and in very rare cases may > > have to > > be adapted to use it. > > > > * **C-API**: > > > > * In old versions of NumPy ``PyArray_DescrCheck`` is a macro > > which uses > > ``type(dtype) is np.dtype``. When compiling against an old > > NumPy version, > > the macro may have to be replaced with the corresponding > > ``PyObject_IsInstance`` call. (If this is a problem, we could > > backport > > fixing the macro) > > > > * The UFunc machinery changes will break *limited* parts of the > > current > > implementation. Replacing e.g. the default ``TypeResolver`` is > > expected > > to remain supported for a time, although optimized masked inner > > loop iteration > > (which is not even used *within* NumPy) will no longer be > > supported. > > > > * All functions currently defined on the dtypes, such as > > ``PyArray_Descr->f->nonzero``, will be defined and accessed > > differently. > > This means that in the long run lowlevel access code will > > have to be changed to use the new API. Such changes are expected > > to be > > necessary in very few project. > > > > * **dtype implementors (C-API)**: > > > > * The array which is currently provided to some functions (such as > > cast functions), > > will no longer be provided. > > For example ``PyArray_Descr->f->nonzero`` or ``PyArray_Descr->f- > > >copyswapn``, > > may instead receive a dummy array object with only some fields > > (mainly the > > dtype), being valid. > > At least in some code paths, a similar mechanism is already used. > > > > * The ``scalarkind`` slot and registration of scalar casting will > > be > > removed/ignored without replacement. > > It currently allows partial value-based casting. > > The ``PyArray_ScalarKind`` function will continue to work for > > builtin types, > > but will not be used internally and be deprecated. > > > > * Currently user dtypes are defined as instances of ``np.dtype``. > > The creation works by the user providing a prototype instance. > > NumPy will need to modify at least the type during registration. > > This has no effect for either ``rational`` or ``quaternion`` and > > mutation > > of the structure seems unlikely after registration. > > > > Since there is a fairly large API surface concerning datatypes, > > further changes > > or the limitation certain function to currently existing datatypes is > > likely to occur. > > For example functions which use the type number as input > > should be replaced with functions taking DType classes instead. > > Although public, large parts of this C-API seem to be used rarely, > > possibly never, by downstream projects. > > > > > > > > Detailed Description > > -------------------- > > > > This section details the design decisions covered by this NEP. > > The subsections correspond to the list of design choices presented > > in the Scope section. > > > > Datatypes as Python Classes (1) > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > The current NumPy datatypes are not full scale python classes. > > They are instead (prototype) instances of a single ``np.dtype`` > > class. > > Changing this means that any special handling, e.g. for ``datetime`` > > can be moved to the Datetime DType class instead, away from > > monolithic general > > code (e.g. current ``PyArray_AdjustFlexibleDType``). > > > > The main consequence of this change with respect to the API is that > > special methods move from the dtype instances to methods on the new > > DType class. > > This is the typical design pattern used in Python. > > Organizing these methods and information in a more Pythonic way > > provides a > > solid foundation for refining and extending the API in the future. > > The current API cannot be extended due to how it is exposed > > publically. > > This means for example that the methods currently stored in > > ``PyArray_ArrFuncs`` > > on each datatype (see NEP 40) will be defined differently in the > > future and > > deprecated in the long run. > > > > The most prominent visible side effect of this will be that > > ``type(np.dtype(np.float64))`` will not be ``np.dtype`` anymore. > > Instead it will be a subclass of ``np.dtype`` meaning that > > ``isinstance(np.dtype(np.float64), np.dtype)`` will remain true. > > This will also add the ability to use ``isinstance(dtype, > > np.dtype[float64])`` > > thus removing the need to use ``dtype.kind``, ``dtype.char``, or > > ``dtype.type`` > > to do this check. > > > > With the design decision of DTypes as full-scale Python classes, > > the question of subclassing arises. > > Inheritance, however, appears problematic and a complexity best > > avoided > > (at least initially) for container datatypes. > > Further, subclasses may be more interesting for interoperability for > > example with GPU backends (CuPy) storing additional methods related > > to the > > GPU rather than as a mechanism to define new datatypes. > > A class hierarchy does provides value, this may be achieved by > > allowing the creation of *abstract* datatypes. > > An example for an abstract datatype would be the datatype equivalent > > of > > ``np.floating``, representing any floating point number. > > These can serve the same purpose as Python's abstract base classes. > > > > > > Scalars should not be instances of the datatypes (2) > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > For simple datatypes such as ``float64`` (see also below), it seems > > tempting that the instance of a ``np.dtype("float64")`` can be the > > scalar. > > This idea may be even more appealing due to the fact that scalars, > > rather than datatypes, currently define a useful type hierarchy. > > > > However, we have specifically decided against this for a number of > > reasons. > > First, the new datatypes described herein would be instances of DType > > classes. > > Making these instances themselves classes, while possible, adds > > additional > > complexity that users need to understand. > > It would also mean that scalars must have storage information (such > > as byteorder) > > which is generally unnecessary and currently is not used. > > Second, while the simple NumPy scalars such as ``float64`` may be > > such instances, > > it should be possible to create datatypes for Python objects without > > enforcing > > NumPy as a dependency. > > However, Python objects that do not depend on NumPy cannot be > > instances of a NumPy DType. > > Third, there is a mismatch between the methods and attributes which > > are useful > > for scalars and datatypes. For instance ``to_float()`` makes sense > > for a scalar > > but not for a datatype and ``newbyteorder`` is not useful on a scalar > > (or has > > a different meaning). > > > > Overall, it seem rather than reducing the complexity, i.e. by merging > > the two distinct type hierarchies, making scalars instances of DTypes > > would > > increase the complexity of both the design and implementation. > > > > A possible future path may be to instead simplify the current NumPy > > scalars to > > be much simpler objects which largely derive their behaviour from the > > datatypes. > > > > C-API for creating new Datatypes (3) > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > The current C-API with which users can create new datatypes > > is limited in scope, and requires use of "private" structures. This > > means > > the API is not extensible: no new members can be added to the > > structure > > without losing binary compatibility. > > This has already limited the inclusion of new sorting methods into > > NumPy [new_sort]_. > > > > The new version shall thus replace the current ``PyArray_ArrFuncs`` > > structure used > > to define new datatypes. > > Datatypes that currently exist and are defined using these slots will > > be > > supported during a deprecation period. > > > > The most likely solution is to hide the implementation from the user > > and thus make > > it extensible in the future is to model the API after Python's stable > > API [PEP-384]_: > > > > .. code-block:: C > > > > static struct PyArrayMethodDef slots[] = { > > {NPY_dt_method, method_implementation}, > > ..., > > {0, NULL} > > } > > > > typedef struct{ > > PyTypeObject *typeobj; /* type of python scalar */ > > ...; > > PyType_Slot *slots; > > } PyArrayDTypeMeta_Spec; > > > > PyObject* PyArray_InitDTypeMetaFromSpec( > > PyArray_DTypeMeta *user_dtype, PyArrayDTypeMeta_Spec > > *dtype_spec); > > > > The C-side slots should be designed to mirror Python side methods > > such as ``dtype.__dtype_method__``, although the exposure to Python > > is > > a later step in the implementation to reduce the complexity of the > > initial > > implementation. > > > > > > C-API Changes to the UFunc Machinery (4) > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > Proposed changes to the UFunc machinery will be part of NEP 43. > > However, the following changes will be necessary (see NEP 40 for a > > detailed > > description of the current implementation and its issues): > > > > * The current UFunc type resolution must be adapted to allow better > > control > > for user-defined dtypes as well as resolve current inconsistencies. > > * The inner-loop used in UFuncs must be expanded to include a return > > value. > > Further, error reporting must be improved, and passing in dtype- > > specific > > information enabled. > > This requires the modification of the inner-loop function signature > > and > > addition of new hooks called before and after the inner-loop is > > used. > > > > An important goal for any changes to the universal functions will be > > to > > allow the reuse of existing loops. > > It should be easy for a new units datatype to fall back to existing > > math > > functions after handling the unit related computations. > > > > > > Discussion > > ---------- > > > > See NEP 40 for a list of previous meetings and discussions. > > > > > > References > > ---------- > > > > .. [pandas_extension_arrays] > > > https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types > > > > .. _xarray_dtype_issue: https://github.com/pydata/xarray/issues/1262 > > > > .. [pygeos] https://github.com/caspervdw/pygeos > > > > .. [new_sort] https://github.com/numpy/numpy/pull/12945 > > > > .. [PEP-384] https://www.python.org/dev/peps/pep-0384/ > > > > .. [PR 15508] https://github.com/numpy/numpy/pull/15508 > > > > > > Copyright > > --------- > > > > This document has been placed in the public domain. > > > > > > Acknowledgments > > --------------- > > > > The effort to create new datatypes for NumPy has been discussed for > > several > > years in many different contexts and settings, making it impossible > > to list everyone involved. > > We would like to thank especially Stephan Hoyer, Nathaniel Smith, and > > Eric Wieser > > for repeated in-depth discussion about datatype design. > > We are very grateful for the community input in reviewing and > > revising this > > NEP and would like to thank especially Ross Barnowski and Ralf > > Gommers. > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pierre.debuyl at kuleuven.be Wed Mar 18 18:17:35 2020 From: pierre.debuyl at kuleuven.be (Pierre de Buyl) Date: Wed, 18 Mar 2020 23:17:35 +0100 Subject: [Numpy-discussion] Cancellation of EuroSciPy 2020 Message-ID: <20200318221735.hjohcs2wzvxkt6qt@pierre-hp.fys.kuleuven.be> Dear NumPy and SciPy communities, It is with regret that we announce the cancellation of #euroscipy 2020 due to the coronavirus outbreak, see https://euroscipy.org/2020/program.html Stay safe, hasta pronto and remember social distancing and to wash your hands for 20 seconds. The EuroSciPy team. From meissner at hawaii.edu Fri Mar 20 14:05:21 2020 From: meissner at hawaii.edu (Gunter Meissner) Date: Fri, 20 Mar 2020 08:05:21 -1000 Subject: [Numpy-discussion] Error in Covariance and Variance calculation Message-ID: <007001d5fee2$22062630$66127290$@hawaii.edu> Dear Programmers, This is Gunter Meissner. I am currently writing a book on Forecasting and derived the regression coefficient with Numpy: import numpy as np X=[1,2,3,4] Y=[10000,8000,5000,1000] print(np.cov(X,Y)) print(np.var(X)) Beta1 = np.cov(X,Y)/np.var(X) print(Beta1) However, Numpy is using the SAMPLE covariance , (which divides by n-1) and the POPULATION variance VarX = (which divides by n). Therefore the regression coefficient BETA1 is not correct. The solution is easy: Please use the population approach (dividing by n) for BOTH covariance and variance or use the sample approach (dividing by n-1) for BOTH covariance and variance. You may also allow the user to use both as in EXCEL, where the user can choose between Var.S and Var.P and Cov.P and Var.P. Thanks!!! Gunter Gunter Meissner, PhD University of Hawaii Adjunct Professor of MathFinance at Columbia University and NYU President of Derivatives Software www.dersoft.com CEO Cassandra Capital Management www.cassandracm.com CV: www.dersoft.com/cv.pdf Email: meissner at hawaii.edu Tel: USA (808) 779 3660 From: NumPy-Discussion On Behalf Of Ralf Gommers Sent: Wednesday, March 18, 2020 5:16 AM To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] Proposal: NEP 41 -- First step towards a new Datatype System On Tue, Mar 17, 2020 at 9:03 PM Sebastian Berg > wrote: Hi all, in the spirit of trying to keep this moving, can I assume that the main reason for little discussion is that the actual changes proposed are not very far reaching as of now? Or is the reason that this is a fairly complex topic that you need more time to think about it? Probably (a) it's a long NEP on a complex topic, (b) the past week has been a very weird week for everyone (in the extra-news-reading-time I could easily have re-reviewed the NEP), and (c) the amount of feedback one expects to get on a NEP is roughly inversely proportional to the scope and complexity of the NEP contents. Today I re-read the parts I commented on before. This version is a big improvement over the previous ones. Thanks in particular for adding clear examples and the diagram, it helps a lot. If it is the latter, is there some way I can help with it? I tried to minimize how much is part of this initial NEP. If there is not much need for discussion, I would like to officially accept the NEP very soon, sending out an official one week notice in the next days. I agree. I think I would like to keep the option open though to come back to the NEP later to improve the clarity of the text about motivation/plan/examples/scope, given that this will be the reference for a major amount of work for a long time to come. To summarize one more time, the main point is that: This point seems fine, and I'm +1 for going ahead with the described parts of the technical design. Cheers, Ralf type(np.dtype(np.float64)) will be `np.dtype[float64]`, a subclass of dtype, so that: issubclass(np.dtype[float64], np.dtype) is true. This means that we will have one class for every current type number: `dtype.num`. The implementation of these subclasses will be a C-written (extension) MetaClass, all details of this class are supposed to remain experimental in flux at this time. Cheers Sebastian On Wed, 2020-03-11 at 17:02 -0700, Sebastian Berg wrote: > Hi all, > > I am pleased to propose NEP 41: First step towards a new Datatype > System https://numpy.org/neps/nep-0041-improved-dtype-support.html > > This NEP motivates the larger restructure of the datatype machinery > in > NumPy and defines a few fundamental design aspects. The long term > user > impact will be allowing easier and more rich featured user defined > datatypes. > > As this is a large restructure, the NEP represents only the first > steps > with some additional information in further NEPs being drafted [1] > (this may be helpful to look at depending on the level of detail you > are interested in). > The NEP itself does not propose to add significant new public API. > Instead it proposes to move forward with an incremental internal > refactor and lays the foundation for this process. > > The main user facing change at this time is that datatypes will > become > classes (e.g. ``type(np.dtype("float64"))`` will be a float64 > specific > class. > For most users, the main impact should be many new datatypes in the > long run (see the user impact section). However, for those interested > in API design within NumPy or with respect to implementing new > datatypes, this and the following NEPs are important decisions in the > future roadmap for NumPy. > > The current full text is reproduced below, although the above link is > probably a better way to read it. > > Cheers > > Sebastian > > > [1] NEP 40 gives some background information about the current > systems > and issues with it: > https://github.com/numpy/numpy/blob/1248cf7a8765b7b53d883f9e7061173817533aac/doc/neps/nep-0040-legacy-datatype-impl.rst > and NEP 42 being a first draft of how the new API may look like: > > https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38f53cee3/doc/neps/nep-0042-new-dtypes.rst > (links to current rendered versions, check > https://github.com/numpy/numpy/pull/15505 and > https://github.com/numpy/numpy/pull/15507 for updates) > > > ------------------------------------------------------------------- > --- > > > ================================================= > NEP 41 ? First step towards a new Datatype System > ================================================= > > :title: Improved Datatype Support > :Author: Sebastian Berg > :Author: St?fan van der Walt > :Author: Matti Picus > :Status: Draft > :Type: Standard Track > :Created: 2020-02-03 > > > .. note:: > > This NEP is part of a series of NEPs encompassing first > information > about the previous dtype implementation and issues with it in NEP > 40. > NEP 41 (this document) then provides an overview and generic > design > choices for the refactor. > Further NEPs 42 and 43 go into the technical details of the > datatype > and universal function related internal and external API changes. > In some cases it may be necessary to consult the other NEPs for a > full > picture of the desired changes and why these changes are > necessary. > > > Abstract > -------- > > `Datatypes ` in NumPy describe how to > interpret each > element in arrays. NumPy provides ``int``, ``float``, and ``complex`` > numerical > types, as well as string, datetime, and structured datatype > capabilities. > The growing Python community, however, has need for more diverse > datatypes. > Examples are datatypes with unit information attached (such as > meters) or > categorical datatypes (fixed set of possible values). > However, the current NumPy datatype API is too limited to allow the > creation > of these. > > This NEP is the first step to enable such growth; it will lead to > a simpler development path for new datatypes. > In the long run the new datatype system will also support the > creation > of datatypes directly from Python rather than C. > Refactoring the datatype API will improve maintainability and > facilitate > development of both user-defined external datatypes, > as well as new features for existing datatypes internal to NumPy. > > > Motivation and Scope > -------------------- > > .. seealso:: > > The user impact section includes examples of what kind of new > datatypes > will be enabled by the proposed changes in the long run. > It may thus help to read these section out of order. > > Motivation > ^^^^^^^^^^ > > One of the main issues with the current API is the definition of > typical > functions such as addition and multiplication for parametric > datatypes > (see also NEP 40) which require additional steps to determine the > output type. > For example when adding two strings of length 4, the result is a > string > of length 8, which is different from the input. > Similarly, a datatype which embeds a physical unit must calculate the > new unit > information: dividing a distance by a time results in a speed. > A related difficulty is that the :ref:`current casting rules > <_ufuncs.casting>` > -- the conversion between different datatypes -- > cannot describe casting for such parametric datatypes implemented > outside of NumPy. > > This additional functionality for supporting parametric datatypes > introduces > increased complexity within NumPy itself, > and furthermore is not available to external user-defined datatypes. > In general the concerns of different datatypes are not well well- > encapsulated. > This burden is exacerbated by the exposure of internal C structures, > limiting the addition of new fields > (for example to support new sorting methods [new_sort]_). > > Currently there are many factors which limit the creation of new > user-defined > datatypes: > > * Creating casting rules for parametric user-defined dtypes is either > impossible > or so complex that it has never been attempted. > * Type promotion, e.g. the operation deciding that adding float and > integer > values should return a float value, is very valuable for numeric > datatypes > but is limited in scope for user-defined and especially parametric > datatypes. > * Much of the logic (e.g. promotion) is written in single functions > instead of being split as methods on the datatype itself. > * In the current design datatypes cannot have methods that do not > generalize > to other datatypes. For example a unit datatype cannot have a > ``.to_si()`` method to > easily find the datatype which would represent the same values in > SI units. > > The large need to solve these issues has driven the scientific > community > to create work-arounds in multiple projects implementing physical > units as an > array-like class instead of a datatype, which would generalize better > across > multiple array-likes (Dask, pandas, etc.). > Already, Pandas has made a push into the same direction with its > extension arrays [pandas_extension_arrays]_ and undoubtedly > the community would be best served if such new features could be > common > between NumPy, Pandas, and other projects. > > Scope > ^^^^^ > > The proposed refactoring of the datatype system is a large > undertaking and > thus is proposed to be split into various phases, roughly: > > * Phase I: Restructure and extend the datatype infrastructure (This > NEP 41) > * Phase II: Incrementally define or rework API (Detailed largely in > NEPs 42/43) > * Phase III: Growth of NumPy and Scientific Python Ecosystem > capabilities. > > For a more detailed accounting of the various phases, see > "Plan to Approach the Full Refactor" in the Implementation section > below. > This NEP proposes to move ahead with the necessary creation of new > dtype > subclasses (Phase I), > and start working on implementing current functionality. > Within the context of this NEP all development will be fully private > API or > use preliminary underscored names which must be changed in the > future. > Most of the internal and public API choices are part of a second > Phase > and will be discussed in more detail in the following NEPs 42 and 43. > The initial implementation of this NEP will have little or no effect > on users, > but provides the necessary ground work for incrementally addressing > the > full rework. > > The implementation of this NEP and the following, implied large > rework of how > datatypes are defined in NumPy is expected to create small > incompatibilities > (see backward compatibility section). > However, a transition requiring large code adaption is not > anticipated and not > within scope. > > Specifically, this NEP makes the following design choices which are > discussed > in more details in the detailed description section: > > 1. Each datatype will be an instance of a subclass of ``np.dtype``, > with most of the > datatype-specific logic being implemented > as special methods on the class. In the C-API, these correspond to > specific > slots. In short, for ``f = np.dtype("f8")``, ``isinstance(f, > np.dtype)`` will remain true, > but ``type(f)`` will be a subclass of ``np.dtype`` rather than > just ``np.dtype`` itself. > The ``PyArray_ArrFuncs`` which are currently stored as a pointer > on the instance (as ``PyArray_Descr->f``), > should instead be stored on the class as typically done in Python. > In the future these may correspond to python side dunder methods. > Storage information such as itemsize and byteorder can differ > between > different dtype instances (e.g. "S3" vs. "S8") and will remain > part of the instance. > This means that in the long run the current lowlevel access to > dtype methods > will be removed (see ``PyArray_ArrFuncs`` in NEP 40). > > 2. The current NumPy scalars will *not* change, they will not be > instances of > datatypes. This will also be true for new datatypes, scalars will > not be > instances of a dtype (although ``isinstance(scalar, dtype)`` may > be made > to return ``True`` when appropriate). > > Detailed technical decisions to follow in NEP 42. > > Further, the public API will be designed in a way that is extensible > in the future: > > 3. All new C-API functions provided to the user will hide > implementation details > as much as possible. The public API should be an identical, but > limited, > version of the C-API used for the internal NumPy datatypes. > > The changes to the datatype system in Phase II must include a large > refactor of the > UFunc machinery, which will be further defined in NEP 43: > > 4. To enable all of the desired functionality for new user-defined > datatypes, > the UFunc machinery will be changed to replace the current > dispatching > and type resolution system. > The old system should be *mostly* supported as a legacy version > for some time. > > Additionally, as a general design principle, the addition of new > user-defined > datatypes will *not* change the behaviour of programs. > For example ``common_dtype(a, b)`` must not be ``c`` unless ``a`` or > ``b`` know > that ``c`` exists. > > > User Impact > ----------- > > The current ecosystem has very few user-defined datatypes using > NumPy, the > two most prominent being: ``rational`` and ``quaternion``. > These represent fairly simple datatypes which are not strongly > impacted > by the current limitations. > However, we have identified a need for datatypes such as: > > * bfloat16, used in deep learning > * categorical types > * physical units (such as meters) > * datatypes for tracing/automatic differentiation > * high, fixed precision math > * specialized integer types such as int2, int24 > * new, better datetime representations > * extending e.g. integer dtypes to have a sentinel NA value > * geometrical objects [pygeos]_ > > Some of these are partially solved; for example unit capability is > provided > in ``astropy.units``, ``unyt``, or ``pint``, as `numpy.ndarray` > subclasses. > Most of these datatypes, however, simply cannot be reasonably defined > right now. > An advantage of having such datatypes in NumPy is that they should > integrate > seamlessly with other array or array-like packages such as Pandas, > ``xarray`` [xarray_dtype_issue]_, or ``Dask``. > > The long term user impact of implementing this NEP will be to allow > both > the growth of the whole ecosystem by having such new datatypes, as > well as > consolidating implementation of such datatypes within NumPy to > achieve > better interoperability. > > > Examples > ^^^^^^^^ > > The following examples represent future user-defined datatypes we > wish to enable. > These datatypes are not part the NEP and choices (e.g. choice of > casting rules) > are possibilities we wish to enable and do not represent > recommendations. > > Simple Numerical Types > """""""""""""""""""""" > > Mainly used where memory is a consideration, lower-precision numeric > types > such as :ref:```bfloat16`` < > https://en.wikipedia.org/wiki/Bfloat16_floating-point_format>` > are common in other computational frameworks. > For these types the definitions of things such as ``np.common_type`` > and > ``np.can_cast`` are some of the most important interfaces. Once they > support ``np.common_type``, it is (for the most part) possible to > find > the correct ufunc loop to call, since most ufuncs -- such as add -- > effectively > only require ``np.result_type``:: > > >>> np.add(arr1, arr2).dtype == np.result_type(arr1, arr2) > > and `~numpy.result_type` is largely identical to > `~numpy.common_type`. > > > Fixed, high precision math > """""""""""""""""""""""""" > > Allowing arbitrary precision or higher precision math is important in > simulations. For instance ``mpmath`` defines a precision:: > > >>> import mpmath as mp > >>> print(mp.dps) # the current (default) precision > 15 > > NumPy should be able to construct a native, memory-efficient array > from > a list of ``mpmath.mpf`` floating point objects:: > > >>> arr_15_dps = np.array(mp.arange(3)) # (mp.arange returns a > list) > >>> print(arr_15_dps) # Must find the correct precision from the > objects: > array(['0.0', '1.0', '2.0'], dtype=mpf[dps=15]) > > We should also be able to specify the desired precision when > creating the datatype for an array. Here, we use ``np.dtype[mp.mpf]`` > to find the DType class (the notation is not part of this NEP), > which is then instantiated with the desired parameter. > This could also be written as ``MpfDType`` class:: > > >>> arr_100_dps = np.array([1, 2, 3], > dtype=np.dtype[mp.mpf](dps=100)) > >>> print(arr_15_dps + arr_100_dps) > array(['0.0', '2.0', '4.0'], dtype=mpf[dps=100]) > > The ``mpf`` datatype can decide that the result of the operation > should be the > higher precision one of the two, so uses a precision of 100. > Furthermore, we should be able to define casting, for example as in:: > > >>> np.can_cast(arr_15_dps.dtype, arr_100_dps.dtype, > casting="safe") > True > >>> np.can_cast(arr_100_dps.dtype, arr_15_dps.dtype, > casting="safe") > False # loses precision > >>> np.can_cast(arr_100_dps.dtype, arr_100_dps.dtype, > casting="same_kind") > True > > Casting from float is a probably always at least a ``same_kind`` > cast, but > in general, it is not safe:: > > >>> np.can_cast(np.float64, np.dtype[mp.mpf](dps=4), > casting="safe") > False > > since a float64 has a higer precision than the ``mpf`` datatype with > ``dps=4``. > > Alternatively, we can say that:: > > >>> np.common_type(np.dtype[mp.mpf](dps=5), > np.dtype[mp.mpf](dps=10)) > np.dtype[mp.mpf](dps=10) > > And possibly even:: > > >>> np.common_type(np.dtype[mp.mpf](dps=5), np.float64) > np.dtype[mp.mpf](dps=16) # equivalent precision to float64 (I > believe) > > since ``np.float64`` can be cast to a ``np.dtype[mp.mpf](dps=16)`` > safely. > > > Categoricals > """""""""""" > > Categoricals are interesting in that they can have fixed, predefined > values, > or can be dynamic with the ability to modify categories when > necessary. > The fixed categories (defined ahead of time) is the most straight > forward > categorical definition. > Categoricals are *hard*, since there are many strategies to implement > them, > suggesting NumPy should only provide the scaffolding for user-defined > categorical types. For instance:: > > >>> cat = Categorical(["eggs", "spam", "toast"]) > >>> breakfast = array(["eggs", "spam", "eggs", "toast"], > dtype=cat) > > could store the array very efficiently, since it knows that there are > only 3 > categories. > Since a categorical in this sense knows almost nothing about the data > stored > in it, few operations makes, sense, although equality does: > > >>> breakfast2 = array(["eggs", "eggs", "eggs", "eggs"], > dtype=cat) > >>> breakfast == breakfast2 > array[True, False, True, False]) > > The categorical datatype could work like a dictionary: no two > items names can be equal (checked on dtype creation), so that the > equality > operation above can be performed very efficiently. > If the values define an order, the category labels (internally > integers) could > be ordered the same way to allow efficient sorting and comparison. > > Whether or not casting is defined from one categorical with less to > one with > strictly more values defined, is something that the Categorical > datatype would > need to decide. Both options should be available. > > > Unit on the Datatype > """""""""""""""""""" > > There are different ways to define Units, depending on how the > internal > machinery would be organized, one way is to have a single Unit > datatype > for every existing numerical type. > This will be written as ``Unit[float64]``, the unit itself is part of > the > DType instance ``Unit[float64]("m")`` is a ``float64`` with meters > attached:: > > >>> from astropy import units > >>> meters = np.array([1, 2, 3], dtype=np.float64) * units.m # > meters > >>> print(meters) > array([1.0, 2.0, 3.0], dtype=Unit[float64]("m")) > > Note that units are a bit tricky. It is debatable, whether:: > > >>> np.array([1.0, 2.0, 3.0], dtype=Unit[float64]("m")) > > should be valid syntax (coercing the float scalars without a unit to > meters). > Once the array is created, math will work without any issue:: > > >>> meters / (2 * unit.seconds) > array([0.5, 1.0, 1.5], dtype=Unit[float64]("m/s")) > > Casting is not valid from one unit to the other, but can be valid > between > different scales of the same dimensionality (although this may be > "unsafe"):: > > >>> meters.astype(Unit[float64]("s")) > TypeError: Cannot cast meters to seconds. > >>> meters.astype(Unit[float64]("km")) > >>> # Convert to centimeter-gram-second (cgs) units: > >>> meters.astype(meters.dtype.to_cgs()) > > The above notation is somewhat clumsy. Functions > could be used instead to convert between units. > There may be ways to make these more convenient, but those must be > left > for future discussions:: > > >>> units.convert(meters, "km") > >>> units.to_cgs(meters) > > There are some open questions. For example, whether additional > methods > on the array object could exist to simplify some of the notions, and > how these > would percolate from the datatype to the ``ndarray``. > > The interaction with other scalars would likely be defined through:: > > >>> np.common_type(np.float64, Unit) > Unit[np.float64](dimensionless) > > Ufunc output datatype determination can be more involved than for > simple > numerical dtypes since there is no "universal" output type:: > > >>> np.multiply(meters, seconds).dtype != np.result_type(meters, > seconds) > > In fact ``np.result_type(meters, seconds)`` must error without > context > of the operation being done. > This example highlights how the specific ufunc loop > (loop with known, specific DTypes as inputs), has to be able to to > make > certain decisions before the actual calculation can start. > > > > Implementation > -------------- > > Plan to Approach the Full Refactor > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > To address these issues in NumPy and enable new datatypes, > multiple development stages are required: > > * Phase I: Restructure and extend the datatype infrastructure (This > NEP) > > * Organize Datatypes like normal Python classes [`PR 15508`]_ > > * Phase II: Incrementally define or rework API > > * Create a new and easily extensible API for defining new datatypes > and related functionality. (NEP 42) > > * Incrementally define all necessary functionality through the new > API (NEP 42): > > * Defining operations such as ``np.common_type``. > * Allowing to define casting between datatypes. > * Add functionality necessary to create a numpy array from Python > scalars > (i.e. ``np.array(...)``). > * ? > > * Restructure how universal functions work (NEP 43), in order to: > > * make it possible to allow a `~numpy.ufunc` such as ``np.add`` > to be > extended by user-defined datatypes such as Units. > > * allow efficient lookup for the correct implementation for user- > defined > datatypes. > > * enable reuse of existing code. Units should be able to use the > normal math loops and add additional logic to determine output > type. > > * Phase III: Growth of NumPy and Scientific Python Ecosystem > capabilities: > > * Cleanup of legacy behaviour where it is considered buggy or > undesirable. > * Provide a path to define new datatypes from Python. > * Assist the community in creating types such as Units or > Categoricals > * Allow strings to be used in functions such as ``np.equal`` or > ``np.add``. > * Remove legacy code paths within NumPy to improve long term > maintainability > > This document serves as a basis for phase I and provides the vision > and > motivation for the full project. > Phase I does not introduce any new user-facing features, > but is concerned with the necessary conceptual cleanup of the current > datatype system. > It provides a more "pythonic" datatype Python type object, with a > clear class hierarchy. > > The second phase is the incremental creation of all APIs necessary to > define > fully featured datatypes and reorganization of the NumPy datatype > system. > This phase will thus be primarily concerned with defining an, > initially preliminary, stable public API. > > Some of the benefits of a large refactor may only become evident > after the full > deprecation of the current legacy implementation (i.e. larger code > removals). > However, these steps are necessary for improvements to many parts of > the > core NumPy API, and are expected to make the implementation generally > easier to understand. > > The following figure illustrates the proposed design at a high level, > and roughly delineates the components of the overall design. > Note that this NEP only regards Phase I (shaded area), > the rest encompasses Phase II and the design choices are up for > discussion, > however, it highlights that the DType datatype class is the central, > necessary > concept: > > .. image:: _static/nep-0041-mindmap.svg > > > First steps directly related to this NEP > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > The required changes necessary to NumPy are large and touch many > areas > of the code base > but many of these changes can be addressed incrementally. > > To enable an incremental approach we will start by creating a C > defined > ``PyArray_DTypeMeta`` class with its instances being the ``DType`` > classes, > subclasses of ``np.dtype``. > This is necessary to add the ability of storing custom slots on the > DType in C. > This ``DTypeMeta`` will be implemented first to then enable > incremental > restructuring of current code. > > The addition of ``DType`` will then enable addressing other changes > incrementally, some of which may begin before the settling the full > internal > API: > > 1. New machinery for array coercion, with the goal of enabling user > DTypes > with appropriate class methods. > 2. The replacement or wrapping of the current casting machinery. > 3. Incremental redefinition of the current ``PyArray_ArrFuncs`` slots > into > DType method slots. > > At this point, no or only very limited new public API will be added > and > the internal API is considered to be in flux. > Any new public API may be set up give warnings and will have leading > underscores > to indicate that it is not finalized and can be changed without > warning. > > > Backward compatibility > ---------------------- > > While the actual backward compatibility impact of implementing Phase > I and II > are not yet fully clear, we anticipate, and accept the following > changes: > > * **Python API**: > > * ``type(np.dtype("f8"))`` will be a subclass of ``np.dtype``, > while right > now ``type(np.dtype("f8")) is np.dtype``. > Code should use ``isinstance`` checks, and in very rare cases may > have to > be adapted to use it. > > * **C-API**: > > * In old versions of NumPy ``PyArray_DescrCheck`` is a macro > which uses > ``type(dtype) is np.dtype``. When compiling against an old > NumPy version, > the macro may have to be replaced with the corresponding > ``PyObject_IsInstance`` call. (If this is a problem, we could > backport > fixing the macro) > > * The UFunc machinery changes will break *limited* parts of the > current > implementation. Replacing e.g. the default ``TypeResolver`` is > expected > to remain supported for a time, although optimized masked inner > loop iteration > (which is not even used *within* NumPy) will no longer be > supported. > > * All functions currently defined on the dtypes, such as > ``PyArray_Descr->f->nonzero``, will be defined and accessed > differently. > This means that in the long run lowlevel access code will > have to be changed to use the new API. Such changes are expected > to be > necessary in very few project. > > * **dtype implementors (C-API)**: > > * The array which is currently provided to some functions (such as > cast functions), > will no longer be provided. > For example ``PyArray_Descr->f->nonzero`` or ``PyArray_Descr->f- > >copyswapn``, > may instead receive a dummy array object with only some fields > (mainly the > dtype), being valid. > At least in some code paths, a similar mechanism is already used. > > * The ``scalarkind`` slot and registration of scalar casting will > be > removed/ignored without replacement. > It currently allows partial value-based casting. > The ``PyArray_ScalarKind`` function will continue to work for > builtin types, > but will not be used internally and be deprecated. > > * Currently user dtypes are defined as instances of ``np.dtype``. > The creation works by the user providing a prototype instance. > NumPy will need to modify at least the type during registration. > This has no effect for either ``rational`` or ``quaternion`` and > mutation > of the structure seems unlikely after registration. > > Since there is a fairly large API surface concerning datatypes, > further changes > or the limitation certain function to currently existing datatypes is > likely to occur. > For example functions which use the type number as input > should be replaced with functions taking DType classes instead. > Although public, large parts of this C-API seem to be used rarely, > possibly never, by downstream projects. > > > > Detailed Description > -------------------- > > This section details the design decisions covered by this NEP. > The subsections correspond to the list of design choices presented > in the Scope section. > > Datatypes as Python Classes (1) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > The current NumPy datatypes are not full scale python classes. > They are instead (prototype) instances of a single ``np.dtype`` > class. > Changing this means that any special handling, e.g. for ``datetime`` > can be moved to the Datetime DType class instead, away from > monolithic general > code (e.g. current ``PyArray_AdjustFlexibleDType``). > > The main consequence of this change with respect to the API is that > special methods move from the dtype instances to methods on the new > DType class. > This is the typical design pattern used in Python. > Organizing these methods and information in a more Pythonic way > provides a > solid foundation for refining and extending the API in the future. > The current API cannot be extended due to how it is exposed > publically. > This means for example that the methods currently stored in > ``PyArray_ArrFuncs`` > on each datatype (see NEP 40) will be defined differently in the > future and > deprecated in the long run. > > The most prominent visible side effect of this will be that > ``type(np.dtype(np.float64))`` will not be ``np.dtype`` anymore. > Instead it will be a subclass of ``np.dtype`` meaning that > ``isinstance(np.dtype(np.float64), np.dtype)`` will remain true. > This will also add the ability to use ``isinstance(dtype, > np.dtype[float64])`` > thus removing the need to use ``dtype.kind``, ``dtype.char``, or > ``dtype.type`` > to do this check. > > With the design decision of DTypes as full-scale Python classes, > the question of subclassing arises. > Inheritance, however, appears problematic and a complexity best > avoided > (at least initially) for container datatypes. > Further, subclasses may be more interesting for interoperability for > example with GPU backends (CuPy) storing additional methods related > to the > GPU rather than as a mechanism to define new datatypes. > A class hierarchy does provides value, this may be achieved by > allowing the creation of *abstract* datatypes. > An example for an abstract datatype would be the datatype equivalent > of > ``np.floating``, representing any floating point number. > These can serve the same purpose as Python's abstract base classes. > > > Scalars should not be instances of the datatypes (2) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > For simple datatypes such as ``float64`` (see also below), it seems > tempting that the instance of a ``np.dtype("float64")`` can be the > scalar. > This idea may be even more appealing due to the fact that scalars, > rather than datatypes, currently define a useful type hierarchy. > > However, we have specifically decided against this for a number of > reasons. > First, the new datatypes described herein would be instances of DType > classes. > Making these instances themselves classes, while possible, adds > additional > complexity that users need to understand. > It would also mean that scalars must have storage information (such > as byteorder) > which is generally unnecessary and currently is not used. > Second, while the simple NumPy scalars such as ``float64`` may be > such instances, > it should be possible to create datatypes for Python objects without > enforcing > NumPy as a dependency. > However, Python objects that do not depend on NumPy cannot be > instances of a NumPy DType. > Third, there is a mismatch between the methods and attributes which > are useful > for scalars and datatypes. For instance ``to_float()`` makes sense > for a scalar > but not for a datatype and ``newbyteorder`` is not useful on a scalar > (or has > a different meaning). > > Overall, it seem rather than reducing the complexity, i.e. by merging > the two distinct type hierarchies, making scalars instances of DTypes > would > increase the complexity of both the design and implementation. > > A possible future path may be to instead simplify the current NumPy > scalars to > be much simpler objects which largely derive their behaviour from the > datatypes. > > C-API for creating new Datatypes (3) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > The current C-API with which users can create new datatypes > is limited in scope, and requires use of "private" structures. This > means > the API is not extensible: no new members can be added to the > structure > without losing binary compatibility. > This has already limited the inclusion of new sorting methods into > NumPy [new_sort]_. > > The new version shall thus replace the current ``PyArray_ArrFuncs`` > structure used > to define new datatypes. > Datatypes that currently exist and are defined using these slots will > be > supported during a deprecation period. > > The most likely solution is to hide the implementation from the user > and thus make > it extensible in the future is to model the API after Python's stable > API [PEP-384]_: > > .. code-block:: C > > static struct PyArrayMethodDef slots[] = { > {NPY_dt_method, method_implementation}, > ..., > {0, NULL} > } > > typedef struct{ > PyTypeObject *typeobj; /* type of python scalar */ > ...; > PyType_Slot *slots; > } PyArrayDTypeMeta_Spec; > > PyObject* PyArray_InitDTypeMetaFromSpec( > PyArray_DTypeMeta *user_dtype, PyArrayDTypeMeta_Spec > *dtype_spec); > > The C-side slots should be designed to mirror Python side methods > such as ``dtype.__dtype_method__``, although the exposure to Python > is > a later step in the implementation to reduce the complexity of the > initial > implementation. > > > C-API Changes to the UFunc Machinery (4) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Proposed changes to the UFunc machinery will be part of NEP 43. > However, the following changes will be necessary (see NEP 40 for a > detailed > description of the current implementation and its issues): > > * The current UFunc type resolution must be adapted to allow better > control > for user-defined dtypes as well as resolve current inconsistencies. > * The inner-loop used in UFuncs must be expanded to include a return > value. > Further, error reporting must be improved, and passing in dtype- > specific > information enabled. > This requires the modification of the inner-loop function signature > and > addition of new hooks called before and after the inner-loop is > used. > > An important goal for any changes to the universal functions will be > to > allow the reuse of existing loops. > It should be easy for a new units datatype to fall back to existing > math > functions after handling the unit related computations. > > > Discussion > ---------- > > See NEP 40 for a list of previous meetings and discussions. > > > References > ---------- > > .. [pandas_extension_arrays] > https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types > > .. _xarray_dtype_issue: https://github.com/pydata/xarray/issues/1262 > > .. [pygeos] https://github.com/caspervdw/pygeos > > .. [new_sort] https://github.com/numpy/numpy/pull/12945 > > .. [PEP-384] https://www.python.org/dev/peps/pep-0384/ > > .. [PR 15508] https://github.com/numpy/numpy/pull/15508 > > > Copyright > --------- > > This document has been placed in the public domain. > > > Acknowledgments > --------------- > > The effort to create new datatypes for NumPy has been discussed for > several > years in many different contexts and settings, making it impossible > to list everyone involved. > We would like to thank especially Stephan Hoyer, Nathaniel Smith, and > Eric Wieser > for repeated in-depth discussion about datatype design. > We are very grateful for the community input in reviewing and > revising this > NEP and would like to thank especially Ross Barnowski and Ralf > Gommers. > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion at python.org https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image005.png Type: image/png Size: 1725 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image006.png Type: image/png Size: 1182 bytes Desc: not available URL: From warren.weckesser at gmail.com Fri Mar 20 14:45:15 2020 From: warren.weckesser at gmail.com (Warren Weckesser) Date: Fri, 20 Mar 2020 14:45:15 -0400 Subject: [Numpy-discussion] Error in Covariance and Variance calculation In-Reply-To: <007001d5fee2$22062630$66127290$@hawaii.edu> References: <007001d5fee2$22062630$66127290$@hawaii.edu> Message-ID: On 3/20/20, Gunter Meissner wrote: > Dear Programmers, > > > > This is Gunter Meissner. I am currently writing a book on Forecasting and > derived the regression coefficient with Numpy: > > > > import numpy as np > X=[1,2,3,4] > Y=[10000,8000,5000,1000] > print(np.cov(X,Y)) > print(np.var(X)) > Beta1 = np.cov(X,Y)/np.var(X) > print(Beta1) > > > > However, Numpy is using the SAMPLE covariance , (which divides by n-1) and > the POPULATION variance > > VarX = (which divides by n). Therefore the regression coefficient BETA1 is > not correct. > > The solution is easy: Please use the population approach (dividing by n) for > BOTH covariance and variance or use the sample approach (dividing by n-1) > > for BOTH covariance and variance. You may also allow the user to use both as > in EXCEL, where the user can choose between Var.S and Var.P > > and Cov.P and Var.P. > > > > Thanks!!! > > Gunter > Gunter, This is an unfortunate discrepancy in the API: `var` uses the default `ddof=0`, while `cov` uses, in effect, `ddof=1` by default. You can get the consistent behavior you want by using `ddof=1` in both functions. E.g. Beta1 = np.cov(X,Y, ddof=1) / np.var(X, ddof=1) Using `ddof=1` in `np.cov` is redundant, but in this context, it is probably useful to make explicit to the reader of the code that both functions are using the same convention. Changing the default in either function breaks backwards compatibility. That would require a long and potentially painful deprecation process. Warren > > > > > Gunter Meissner, PhD > > University of Hawaii > > Adjunct Professor of MathFinance at Columbia University and NYU > > President of Derivatives Software www.dersoft.com > > > CEO Cassandra Capital Management > www.cassandracm.com > > CV: www.dersoft.com/cv.pdf > > Email: meissner at hawaii.edu > > Tel: USA (808) 779 3660 > > > > > > > > > > From: NumPy-Discussion > On Behalf Of Ralf > Gommers > Sent: Wednesday, March 18, 2020 5:16 AM > To: Discussion of Numerical Python > Subject: Re: [Numpy-discussion] Proposal: NEP 41 -- First step towards a new > Datatype System > > > > > > > > On Tue, Mar 17, 2020 at 9:03 PM Sebastian Berg > wrote: > > Hi all, > > in the spirit of trying to keep this moving, can I assume that the main > reason for little discussion is that the actual changes proposed are > not very far reaching as of now? Or is the reason that this is a > fairly complex topic that you need more time to think about it? > > > > Probably (a) it's a long NEP on a complex topic, (b) the past week has been > a very weird week for everyone (in the extra-news-reading-time I could > easily have re-reviewed the NEP), and (c) the amount of feedback one expects > to get on a NEP is roughly inversely proportional to the scope and > complexity of the NEP contents. > > > > Today I re-read the parts I commented on before. This version is a big > improvement over the previous ones. Thanks in particular for adding clear > examples and the diagram, it helps a lot. > > > > If it is the latter, is there some way I can help with it? I tried to > minimize how much is part of this initial NEP. > > If there is not much need for discussion, I would like to officially > accept the NEP very soon, sending out an official one week notice in > the next days. > > > > I agree. I think I would like to keep the option open though to come back to > the NEP later to improve the clarity of the text about > motivation/plan/examples/scope, given that this will be the reference for a > major amount of work for a long time to come. > > > > To summarize one more time, the main point is that: > > > > This point seems fine, and I'm +1 for going ahead with the described parts > of the technical design. > > > > Cheers, > > Ralf > > > > > type(np.dtype(np.float64)) > > will be `np.dtype[float64]`, a subclass of dtype, so that: > > issubclass(np.dtype[float64], np.dtype) > > is true. This means that we will have one class for every current type > number: `dtype.num`. The implementation of these subclasses will be a > C-written (extension) MetaClass, all details of this class are supposed > to remain experimental in flux at this time. > > Cheers > > Sebastian > > > On Wed, 2020-03-11 at 17:02 -0700, Sebastian Berg wrote: >> Hi all, >> >> I am pleased to propose NEP 41: First step towards a new Datatype >> System https://numpy.org/neps/nep-0041-improved-dtype-support.html >> >> This NEP motivates the larger restructure of the datatype machinery >> in >> NumPy and defines a few fundamental design aspects. The long term >> user >> impact will be allowing easier and more rich featured user defined >> datatypes. >> >> As this is a large restructure, the NEP represents only the first >> steps >> with some additional information in further NEPs being drafted [1] >> (this may be helpful to look at depending on the level of detail you >> are interested in). >> The NEP itself does not propose to add significant new public API. >> Instead it proposes to move forward with an incremental internal >> refactor and lays the foundation for this process. >> >> The main user facing change at this time is that datatypes will >> become >> classes (e.g. ``type(np.dtype("float64"))`` will be a float64 >> specific >> class. >> For most users, the main impact should be many new datatypes in the >> long run (see the user impact section). However, for those interested >> in API design within NumPy or with respect to implementing new >> datatypes, this and the following NEPs are important decisions in the >> future roadmap for NumPy. >> >> The current full text is reproduced below, although the above link is >> probably a better way to read it. >> >> Cheers >> >> Sebastian >> >> >> [1] NEP 40 gives some background information about the current >> systems >> and issues with it: >> https://github.com/numpy/numpy/blob/1248cf7a8765b7b53d883f9e7061173817533aac/doc/neps/nep-0040-legacy-datatype-impl.rst >> and NEP 42 being a first draft of how the new API may look like: >> >> https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38f53cee3/doc/neps/nep-0042-new-dtypes.rst >> (links to current rendered versions, check >> https://github.com/numpy/numpy/pull/15505 and >> https://github.com/numpy/numpy/pull/15507 for updates) >> >> >> ------------------------------------------------------------------- >> --- >> >> >> ================================================= >> NEP 41 ? First step towards a new Datatype System >> ================================================= >> >> :title: Improved Datatype Support >> :Author: Sebastian Berg >> :Author: St?fan van der Walt >> :Author: Matti Picus >> :Status: Draft >> :Type: Standard Track >> :Created: 2020-02-03 >> >> >> .. note:: >> >> This NEP is part of a series of NEPs encompassing first >> information >> about the previous dtype implementation and issues with it in NEP >> 40. >> NEP 41 (this document) then provides an overview and generic >> design >> choices for the refactor. >> Further NEPs 42 and 43 go into the technical details of the >> datatype >> and universal function related internal and external API changes. >> In some cases it may be necessary to consult the other NEPs for a >> full >> picture of the desired changes and why these changes are >> necessary. >> >> >> Abstract >> -------- >> >> `Datatypes ` in NumPy describe how to >> interpret each >> element in arrays. NumPy provides ``int``, ``float``, and ``complex`` >> numerical >> types, as well as string, datetime, and structured datatype >> capabilities. >> The growing Python community, however, has need for more diverse >> datatypes. >> Examples are datatypes with unit information attached (such as >> meters) or >> categorical datatypes (fixed set of possible values). >> However, the current NumPy datatype API is too limited to allow the >> creation >> of these. >> >> This NEP is the first step to enable such growth; it will lead to >> a simpler development path for new datatypes. >> In the long run the new datatype system will also support the >> creation >> of datatypes directly from Python rather than C. >> Refactoring the datatype API will improve maintainability and >> facilitate >> development of both user-defined external datatypes, >> as well as new features for existing datatypes internal to NumPy. >> >> >> Motivation and Scope >> -------------------- >> >> .. seealso:: >> >> The user impact section includes examples of what kind of new >> datatypes >> will be enabled by the proposed changes in the long run. >> It may thus help to read these section out of order. >> >> Motivation >> ^^^^^^^^^^ >> >> One of the main issues with the current API is the definition of >> typical >> functions such as addition and multiplication for parametric >> datatypes >> (see also NEP 40) which require additional steps to determine the >> output type. >> For example when adding two strings of length 4, the result is a >> string >> of length 8, which is different from the input. >> Similarly, a datatype which embeds a physical unit must calculate the >> new unit >> information: dividing a distance by a time results in a speed. >> A related difficulty is that the :ref:`current casting rules >> <_ufuncs.casting>` >> -- the conversion between different datatypes -- >> cannot describe casting for such parametric datatypes implemented >> outside of NumPy. >> >> This additional functionality for supporting parametric datatypes >> introduces >> increased complexity within NumPy itself, >> and furthermore is not available to external user-defined datatypes. >> In general the concerns of different datatypes are not well well- >> encapsulated. >> This burden is exacerbated by the exposure of internal C structures, >> limiting the addition of new fields >> (for example to support new sorting methods [new_sort]_). >> >> Currently there are many factors which limit the creation of new >> user-defined >> datatypes: >> >> * Creating casting rules for parametric user-defined dtypes is either >> impossible >> or so complex that it has never been attempted. >> * Type promotion, e.g. the operation deciding that adding float and >> integer >> values should return a float value, is very valuable for numeric >> datatypes >> but is limited in scope for user-defined and especially parametric >> datatypes. >> * Much of the logic (e.g. promotion) is written in single functions >> instead of being split as methods on the datatype itself. >> * In the current design datatypes cannot have methods that do not >> generalize >> to other datatypes. For example a unit datatype cannot have a >> ``.to_si()`` method to >> easily find the datatype which would represent the same values in >> SI units. >> >> The large need to solve these issues has driven the scientific >> community >> to create work-arounds in multiple projects implementing physical >> units as an >> array-like class instead of a datatype, which would generalize better >> across >> multiple array-likes (Dask, pandas, etc.). >> Already, Pandas has made a push into the same direction with its >> extension arrays [pandas_extension_arrays]_ and undoubtedly >> the community would be best served if such new features could be >> common >> between NumPy, Pandas, and other projects. >> >> Scope >> ^^^^^ >> >> The proposed refactoring of the datatype system is a large >> undertaking and >> thus is proposed to be split into various phases, roughly: >> >> * Phase I: Restructure and extend the datatype infrastructure (This >> NEP 41) >> * Phase II: Incrementally define or rework API (Detailed largely in >> NEPs 42/43) >> * Phase III: Growth of NumPy and Scientific Python Ecosystem >> capabilities. >> >> For a more detailed accounting of the various phases, see >> "Plan to Approach the Full Refactor" in the Implementation section >> below. >> This NEP proposes to move ahead with the necessary creation of new >> dtype >> subclasses (Phase I), >> and start working on implementing current functionality. >> Within the context of this NEP all development will be fully private >> API or >> use preliminary underscored names which must be changed in the >> future. >> Most of the internal and public API choices are part of a second >> Phase >> and will be discussed in more detail in the following NEPs 42 and 43. >> The initial implementation of this NEP will have little or no effect >> on users, >> but provides the necessary ground work for incrementally addressing >> the >> full rework. >> >> The implementation of this NEP and the following, implied large >> rework of how >> datatypes are defined in NumPy is expected to create small >> incompatibilities >> (see backward compatibility section). >> However, a transition requiring large code adaption is not >> anticipated and not >> within scope. >> >> Specifically, this NEP makes the following design choices which are >> discussed >> in more details in the detailed description section: >> >> 1. Each datatype will be an instance of a subclass of ``np.dtype``, >> with most of the >> datatype-specific logic being implemented >> as special methods on the class. In the C-API, these correspond to >> specific >> slots. In short, for ``f = np.dtype("f8")``, ``isinstance(f, >> np.dtype)`` will remain true, >> but ``type(f)`` will be a subclass of ``np.dtype`` rather than >> just ``np.dtype`` itself. >> The ``PyArray_ArrFuncs`` which are currently stored as a pointer >> on the instance (as ``PyArray_Descr->f``), >> should instead be stored on the class as typically done in Python. >> In the future these may correspond to python side dunder methods. >> Storage information such as itemsize and byteorder can differ >> between >> different dtype instances (e.g. "S3" vs. "S8") and will remain >> part of the instance. >> This means that in the long run the current lowlevel access to >> dtype methods >> will be removed (see ``PyArray_ArrFuncs`` in NEP 40). >> >> 2. The current NumPy scalars will *not* change, they will not be >> instances of >> datatypes. This will also be true for new datatypes, scalars will >> not be >> instances of a dtype (although ``isinstance(scalar, dtype)`` may >> be made >> to return ``True`` when appropriate). >> >> Detailed technical decisions to follow in NEP 42. >> >> Further, the public API will be designed in a way that is extensible >> in the future: >> >> 3. All new C-API functions provided to the user will hide >> implementation details >> as much as possible. The public API should be an identical, but >> limited, >> version of the C-API used for the internal NumPy datatypes. >> >> The changes to the datatype system in Phase II must include a large >> refactor of the >> UFunc machinery, which will be further defined in NEP 43: >> >> 4. To enable all of the desired functionality for new user-defined >> datatypes, >> the UFunc machinery will be changed to replace the current >> dispatching >> and type resolution system. >> The old system should be *mostly* supported as a legacy version >> for some time. >> >> Additionally, as a general design principle, the addition of new >> user-defined >> datatypes will *not* change the behaviour of programs. >> For example ``common_dtype(a, b)`` must not be ``c`` unless ``a`` or >> ``b`` know >> that ``c`` exists. >> >> >> User Impact >> ----------- >> >> The current ecosystem has very few user-defined datatypes using >> NumPy, the >> two most prominent being: ``rational`` and ``quaternion``. >> These represent fairly simple datatypes which are not strongly >> impacted >> by the current limitations. >> However, we have identified a need for datatypes such as: >> >> * bfloat16, used in deep learning >> * categorical types >> * physical units (such as meters) >> * datatypes for tracing/automatic differentiation >> * high, fixed precision math >> * specialized integer types such as int2, int24 >> * new, better datetime representations >> * extending e.g. integer dtypes to have a sentinel NA value >> * geometrical objects [pygeos]_ >> >> Some of these are partially solved; for example unit capability is >> provided >> in ``astropy.units``, ``unyt``, or ``pint``, as `numpy.ndarray` >> subclasses. >> Most of these datatypes, however, simply cannot be reasonably defined >> right now. >> An advantage of having such datatypes in NumPy is that they should >> integrate >> seamlessly with other array or array-like packages such as Pandas, >> ``xarray`` [xarray_dtype_issue]_, or ``Dask``. >> >> The long term user impact of implementing this NEP will be to allow >> both >> the growth of the whole ecosystem by having such new datatypes, as >> well as >> consolidating implementation of such datatypes within NumPy to >> achieve >> better interoperability. >> >> >> Examples >> ^^^^^^^^ >> >> The following examples represent future user-defined datatypes we >> wish to enable. >> These datatypes are not part the NEP and choices (e.g. choice of >> casting rules) >> are possibilities we wish to enable and do not represent >> recommendations. >> >> Simple Numerical Types >> """""""""""""""""""""" >> >> Mainly used where memory is a consideration, lower-precision numeric >> types >> such as :ref:```bfloat16`` < >> https://en.wikipedia.org/wiki/Bfloat16_floating-point_format>` >> are common in other computational frameworks. >> For these types the definitions of things such as ``np.common_type`` >> and >> ``np.can_cast`` are some of the most important interfaces. Once they >> support ``np.common_type``, it is (for the most part) possible to >> find >> the correct ufunc loop to call, since most ufuncs -- such as add -- >> effectively >> only require ``np.result_type``:: >> >> >>> np.add(arr1, arr2).dtype == np.result_type(arr1, arr2) >> >> and `~numpy.result_type` is largely identical to >> `~numpy.common_type`. >> >> >> Fixed, high precision math >> """""""""""""""""""""""""" >> >> Allowing arbitrary precision or higher precision math is important in >> simulations. For instance ``mpmath`` defines a precision:: >> >> >>> import mpmath as mp >> >>> print(mp.dps) # the current (default) precision >> 15 >> >> NumPy should be able to construct a native, memory-efficient array >> from >> a list of ``mpmath.mpf`` floating point objects:: >> >> >>> arr_15_dps = np.array(mp.arange(3)) # (mp.arange returns a >> list) >> >>> print(arr_15_dps) # Must find the correct precision from the >> objects: >> array(['0.0', '1.0', '2.0'], dtype=mpf[dps=15]) >> >> We should also be able to specify the desired precision when >> creating the datatype for an array. Here, we use ``np.dtype[mp.mpf]`` >> to find the DType class (the notation is not part of this NEP), >> which is then instantiated with the desired parameter. >> This could also be written as ``MpfDType`` class:: >> >> >>> arr_100_dps = np.array([1, 2, 3], >> dtype=np.dtype[mp.mpf](dps=100)) >> >>> print(arr_15_dps + arr_100_dps) >> array(['0.0', '2.0', '4.0'], dtype=mpf[dps=100]) >> >> The ``mpf`` datatype can decide that the result of the operation >> should be the >> higher precision one of the two, so uses a precision of 100. >> Furthermore, we should be able to define casting, for example as in:: >> >> >>> np.can_cast(arr_15_dps.dtype, arr_100_dps.dtype, >> casting="safe") >> True >> >>> np.can_cast(arr_100_dps.dtype, arr_15_dps.dtype, >> casting="safe") >> False # loses precision >> >>> np.can_cast(arr_100_dps.dtype, arr_100_dps.dtype, >> casting="same_kind") >> True >> >> Casting from float is a probably always at least a ``same_kind`` >> cast, but >> in general, it is not safe:: >> >> >>> np.can_cast(np.float64, np.dtype[mp.mpf](dps=4), >> casting="safe") >> False >> >> since a float64 has a higer precision than the ``mpf`` datatype with >> ``dps=4``. >> >> Alternatively, we can say that:: >> >> >>> np.common_type(np.dtype[mp.mpf](dps=5), >> np.dtype[mp.mpf](dps=10)) >> np.dtype[mp.mpf](dps=10) >> >> And possibly even:: >> >> >>> np.common_type(np.dtype[mp.mpf](dps=5), np.float64) >> np.dtype[mp.mpf](dps=16) # equivalent precision to float64 (I >> believe) >> >> since ``np.float64`` can be cast to a ``np.dtype[mp.mpf](dps=16)`` >> safely. >> >> >> Categoricals >> """""""""""" >> >> Categoricals are interesting in that they can have fixed, predefined >> values, >> or can be dynamic with the ability to modify categories when >> necessary. >> The fixed categories (defined ahead of time) is the most straight >> forward >> categorical definition. >> Categoricals are *hard*, since there are many strategies to implement >> them, >> suggesting NumPy should only provide the scaffolding for user-defined >> categorical types. For instance:: >> >> >>> cat = Categorical(["eggs", "spam", "toast"]) >> >>> breakfast = array(["eggs", "spam", "eggs", "toast"], >> dtype=cat) >> >> could store the array very efficiently, since it knows that there are >> only 3 >> categories. >> Since a categorical in this sense knows almost nothing about the data >> stored >> in it, few operations makes, sense, although equality does: >> >> >>> breakfast2 = array(["eggs", "eggs", "eggs", "eggs"], >> dtype=cat) >> >>> breakfast == breakfast2 >> array[True, False, True, False]) >> >> The categorical datatype could work like a dictionary: no two >> items names can be equal (checked on dtype creation), so that the >> equality >> operation above can be performed very efficiently. >> If the values define an order, the category labels (internally >> integers) could >> be ordered the same way to allow efficient sorting and comparison. >> >> Whether or not casting is defined from one categorical with less to >> one with >> strictly more values defined, is something that the Categorical >> datatype would >> need to decide. Both options should be available. >> >> >> Unit on the Datatype >> """""""""""""""""""" >> >> There are different ways to define Units, depending on how the >> internal >> machinery would be organized, one way is to have a single Unit >> datatype >> for every existing numerical type. >> This will be written as ``Unit[float64]``, the unit itself is part of >> the >> DType instance ``Unit[float64]("m")`` is a ``float64`` with meters >> attached:: >> >> >>> from astropy import units >> >>> meters = np.array([1, 2, 3], dtype=np.float64) * units.m # >> meters >> >>> print(meters) >> array([1.0, 2.0, 3.0], dtype=Unit[float64]("m")) >> >> Note that units are a bit tricky. It is debatable, whether:: >> >> >>> np.array([1.0, 2.0, 3.0], dtype=Unit[float64]("m")) >> >> should be valid syntax (coercing the float scalars without a unit to >> meters). >> Once the array is created, math will work without any issue:: >> >> >>> meters / (2 * unit.seconds) >> array([0.5, 1.0, 1.5], dtype=Unit[float64]("m/s")) >> >> Casting is not valid from one unit to the other, but can be valid >> between >> different scales of the same dimensionality (although this may be >> "unsafe"):: >> >> >>> meters.astype(Unit[float64]("s")) >> TypeError: Cannot cast meters to seconds. >> >>> meters.astype(Unit[float64]("km")) >> >>> # Convert to centimeter-gram-second (cgs) units: >> >>> meters.astype(meters.dtype.to_cgs()) >> >> The above notation is somewhat clumsy. Functions >> could be used instead to convert between units. >> There may be ways to make these more convenient, but those must be >> left >> for future discussions:: >> >> >>> units.convert(meters, "km") >> >>> units.to_cgs(meters) >> >> There are some open questions. For example, whether additional >> methods >> on the array object could exist to simplify some of the notions, and >> how these >> would percolate from the datatype to the ``ndarray``. >> >> The interaction with other scalars would likely be defined through:: >> >> >>> np.common_type(np.float64, Unit) >> Unit[np.float64](dimensionless) >> >> Ufunc output datatype determination can be more involved than for >> simple >> numerical dtypes since there is no "universal" output type:: >> >> >>> np.multiply(meters, seconds).dtype != np.result_type(meters, >> seconds) >> >> In fact ``np.result_type(meters, seconds)`` must error without >> context >> of the operation being done. >> This example highlights how the specific ufunc loop >> (loop with known, specific DTypes as inputs), has to be able to to >> make >> certain decisions before the actual calculation can start. >> >> >> >> Implementation >> -------------- >> >> Plan to Approach the Full Refactor >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> >> To address these issues in NumPy and enable new datatypes, >> multiple development stages are required: >> >> * Phase I: Restructure and extend the datatype infrastructure (This >> NEP) >> >> * Organize Datatypes like normal Python classes [`PR 15508`]_ >> >> * Phase II: Incrementally define or rework API >> >> * Create a new and easily extensible API for defining new datatypes >> and related functionality. (NEP 42) >> >> * Incrementally define all necessary functionality through the new >> API (NEP 42): >> >> * Defining operations such as ``np.common_type``. >> * Allowing to define casting between datatypes. >> * Add functionality necessary to create a numpy array from Python >> scalars >> (i.e. ``np.array(...)``). >> * ? >> >> * Restructure how universal functions work (NEP 43), in order to: >> >> * make it possible to allow a `~numpy.ufunc` such as ``np.add`` >> to be >> extended by user-defined datatypes such as Units. >> >> * allow efficient lookup for the correct implementation for user- >> defined >> datatypes. >> >> * enable reuse of existing code. Units should be able to use the >> normal math loops and add additional logic to determine output >> type. >> >> * Phase III: Growth of NumPy and Scientific Python Ecosystem >> capabilities: >> >> * Cleanup of legacy behaviour where it is considered buggy or >> undesirable. >> * Provide a path to define new datatypes from Python. >> * Assist the community in creating types such as Units or >> Categoricals >> * Allow strings to be used in functions such as ``np.equal`` or >> ``np.add``. >> * Remove legacy code paths within NumPy to improve long term >> maintainability >> >> This document serves as a basis for phase I and provides the vision >> and >> motivation for the full project. >> Phase I does not introduce any new user-facing features, >> but is concerned with the necessary conceptual cleanup of the current >> datatype system. >> It provides a more "pythonic" datatype Python type object, with a >> clear class hierarchy. >> >> The second phase is the incremental creation of all APIs necessary to >> define >> fully featured datatypes and reorganization of the NumPy datatype >> system. >> This phase will thus be primarily concerned with defining an, >> initially preliminary, stable public API. >> >> Some of the benefits of a large refactor may only become evident >> after the full >> deprecation of the current legacy implementation (i.e. larger code >> removals). >> However, these steps are necessary for improvements to many parts of >> the >> core NumPy API, and are expected to make the implementation generally >> easier to understand. >> >> The following figure illustrates the proposed design at a high level, >> and roughly delineates the components of the overall design. >> Note that this NEP only regards Phase I (shaded area), >> the rest encompasses Phase II and the design choices are up for >> discussion, >> however, it highlights that the DType datatype class is the central, >> necessary >> concept: >> >> .. image:: _static/nep-0041-mindmap.svg >> >> >> First steps directly related to this NEP >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> >> The required changes necessary to NumPy are large and touch many >> areas >> of the code base >> but many of these changes can be addressed incrementally. >> >> To enable an incremental approach we will start by creating a C >> defined >> ``PyArray_DTypeMeta`` class with its instances being the ``DType`` >> classes, >> subclasses of ``np.dtype``. >> This is necessary to add the ability of storing custom slots on the >> DType in C. >> This ``DTypeMeta`` will be implemented first to then enable >> incremental >> restructuring of current code. >> >> The addition of ``DType`` will then enable addressing other changes >> incrementally, some of which may begin before the settling the full >> internal >> API: >> >> 1. New machinery for array coercion, with the goal of enabling user >> DTypes >> with appropriate class methods. >> 2. The replacement or wrapping of the current casting machinery. >> 3. Incremental redefinition of the current ``PyArray_ArrFuncs`` slots >> into >> DType method slots. >> >> At this point, no or only very limited new public API will be added >> and >> the internal API is considered to be in flux. >> Any new public API may be set up give warnings and will have leading >> underscores >> to indicate that it is not finalized and can be changed without >> warning. >> >> >> Backward compatibility >> ---------------------- >> >> While the actual backward compatibility impact of implementing Phase >> I and II >> are not yet fully clear, we anticipate, and accept the following >> changes: >> >> * **Python API**: >> >> * ``type(np.dtype("f8"))`` will be a subclass of ``np.dtype``, >> while right >> now ``type(np.dtype("f8")) is np.dtype``. >> Code should use ``isinstance`` checks, and in very rare cases may >> have to >> be adapted to use it. >> >> * **C-API**: >> >> * In old versions of NumPy ``PyArray_DescrCheck`` is a macro >> which uses >> ``type(dtype) is np.dtype``. When compiling against an old >> NumPy version, >> the macro may have to be replaced with the corresponding >> ``PyObject_IsInstance`` call. (If this is a problem, we could >> backport >> fixing the macro) >> >> * The UFunc machinery changes will break *limited* parts of the >> current >> implementation. Replacing e.g. the default ``TypeResolver`` is >> expected >> to remain supported for a time, although optimized masked inner >> loop iteration >> (which is not even used *within* NumPy) will no longer be >> supported. >> >> * All functions currently defined on the dtypes, such as >> ``PyArray_Descr->f->nonzero``, will be defined and accessed >> differently. >> This means that in the long run lowlevel access code will >> have to be changed to use the new API. Such changes are expected >> to be >> necessary in very few project. >> >> * **dtype implementors (C-API)**: >> >> * The array which is currently provided to some functions (such as >> cast functions), >> will no longer be provided. >> For example ``PyArray_Descr->f->nonzero`` or ``PyArray_Descr->f- >> >copyswapn``, >> may instead receive a dummy array object with only some fields >> (mainly the >> dtype), being valid. >> At least in some code paths, a similar mechanism is already used. >> >> * The ``scalarkind`` slot and registration of scalar casting will >> be >> removed/ignored without replacement. >> It currently allows partial value-based casting. >> The ``PyArray_ScalarKind`` function will continue to work for >> builtin types, >> but will not be used internally and be deprecated. >> >> * Currently user dtypes are defined as instances of ``np.dtype``. >> The creation works by the user providing a prototype instance. >> NumPy will need to modify at least the type during registration. >> This has no effect for either ``rational`` or ``quaternion`` and >> mutation >> of the structure seems unlikely after registration. >> >> Since there is a fairly large API surface concerning datatypes, >> further changes >> or the limitation certain function to currently existing datatypes is >> likely to occur. >> For example functions which use the type number as input >> should be replaced with functions taking DType classes instead. >> Although public, large parts of this C-API seem to be used rarely, >> possibly never, by downstream projects. >> >> >> >> Detailed Description >> -------------------- >> >> This section details the design decisions covered by this NEP. >> The subsections correspond to the list of design choices presented >> in the Scope section. >> >> Datatypes as Python Classes (1) >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> >> The current NumPy datatypes are not full scale python classes. >> They are instead (prototype) instances of a single ``np.dtype`` >> class. >> Changing this means that any special handling, e.g. for ``datetime`` >> can be moved to the Datetime DType class instead, away from >> monolithic general >> code (e.g. current ``PyArray_AdjustFlexibleDType``). >> >> The main consequence of this change with respect to the API is that >> special methods move from the dtype instances to methods on the new >> DType class. >> This is the typical design pattern used in Python. >> Organizing these methods and information in a more Pythonic way >> provides a >> solid foundation for refining and extending the API in the future. >> The current API cannot be extended due to how it is exposed >> publically. >> This means for example that the methods currently stored in >> ``PyArray_ArrFuncs`` >> on each datatype (see NEP 40) will be defined differently in the >> future and >> deprecated in the long run. >> >> The most prominent visible side effect of this will be that >> ``type(np.dtype(np.float64))`` will not be ``np.dtype`` anymore. >> Instead it will be a subclass of ``np.dtype`` meaning that >> ``isinstance(np.dtype(np.float64), np.dtype)`` will remain true. >> This will also add the ability to use ``isinstance(dtype, >> np.dtype[float64])`` >> thus removing the need to use ``dtype.kind``, ``dtype.char``, or >> ``dtype.type`` >> to do this check. >> >> With the design decision of DTypes as full-scale Python classes, >> the question of subclassing arises. >> Inheritance, however, appears problematic and a complexity best >> avoided >> (at least initially) for container datatypes. >> Further, subclasses may be more interesting for interoperability for >> example with GPU backends (CuPy) storing additional methods related >> to the >> GPU rather than as a mechanism to define new datatypes. >> A class hierarchy does provides value, this may be achieved by >> allowing the creation of *abstract* datatypes. >> An example for an abstract datatype would be the datatype equivalent >> of >> ``np.floating``, representing any floating point number. >> These can serve the same purpose as Python's abstract base classes. >> >> >> Scalars should not be instances of the datatypes (2) >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> >> For simple datatypes such as ``float64`` (see also below), it seems >> tempting that the instance of a ``np.dtype("float64")`` can be the >> scalar. >> This idea may be even more appealing due to the fact that scalars, >> rather than datatypes, currently define a useful type hierarchy. >> >> However, we have specifically decided against this for a number of >> reasons. >> First, the new datatypes described herein would be instances of DType >> classes. >> Making these instances themselves classes, while possible, adds >> additional >> complexity that users need to understand. >> It would also mean that scalars must have storage information (such >> as byteorder) >> which is generally unnecessary and currently is not used. >> Second, while the simple NumPy scalars such as ``float64`` may be >> such instances, >> it should be possible to create datatypes for Python objects without >> enforcing >> NumPy as a dependency. >> However, Python objects that do not depend on NumPy cannot be >> instances of a NumPy DType. >> Third, there is a mismatch between the methods and attributes which >> are useful >> for scalars and datatypes. For instance ``to_float()`` makes sense >> for a scalar >> but not for a datatype and ``newbyteorder`` is not useful on a scalar >> (or has >> a different meaning). >> >> Overall, it seem rather than reducing the complexity, i.e. by merging >> the two distinct type hierarchies, making scalars instances of DTypes >> would >> increase the complexity of both the design and implementation. >> >> A possible future path may be to instead simplify the current NumPy >> scalars to >> be much simpler objects which largely derive their behaviour from the >> datatypes. >> >> C-API for creating new Datatypes (3) >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> >> The current C-API with which users can create new datatypes >> is limited in scope, and requires use of "private" structures. This >> means >> the API is not extensible: no new members can be added to the >> structure >> without losing binary compatibility. >> This has already limited the inclusion of new sorting methods into >> NumPy [new_sort]_. >> >> The new version shall thus replace the current ``PyArray_ArrFuncs`` >> structure used >> to define new datatypes. >> Datatypes that currently exist and are defined using these slots will >> be >> supported during a deprecation period. >> >> The most likely solution is to hide the implementation from the user >> and thus make >> it extensible in the future is to model the API after Python's stable >> API [PEP-384]_: >> >> .. code-block:: C >> >> static struct PyArrayMethodDef slots[] = { >> {NPY_dt_method, method_implementation}, >> ..., >> {0, NULL} >> } >> >> typedef struct{ >> PyTypeObject *typeobj; /* type of python scalar */ >> ...; >> PyType_Slot *slots; >> } PyArrayDTypeMeta_Spec; >> >> PyObject* PyArray_InitDTypeMetaFromSpec( >> PyArray_DTypeMeta *user_dtype, PyArrayDTypeMeta_Spec >> *dtype_spec); >> >> The C-side slots should be designed to mirror Python side methods >> such as ``dtype.__dtype_method__``, although the exposure to Python >> is >> a later step in the implementation to reduce the complexity of the >> initial >> implementation. >> >> >> C-API Changes to the UFunc Machinery (4) >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> >> Proposed changes to the UFunc machinery will be part of NEP 43. >> However, the following changes will be necessary (see NEP 40 for a >> detailed >> description of the current implementation and its issues): >> >> * The current UFunc type resolution must be adapted to allow better >> control >> for user-defined dtypes as well as resolve current inconsistencies. >> * The inner-loop used in UFuncs must be expanded to include a return >> value. >> Further, error reporting must be improved, and passing in dtype- >> specific >> information enabled. >> This requires the modification of the inner-loop function signature >> and >> addition of new hooks called before and after the inner-loop is >> used. >> >> An important goal for any changes to the universal functions will be >> to >> allow the reuse of existing loops. >> It should be easy for a new units datatype to fall back to existing >> math >> functions after handling the unit related computations. >> >> >> Discussion >> ---------- >> >> See NEP 40 for a list of previous meetings and discussions. >> >> >> References >> ---------- >> >> .. [pandas_extension_arrays] >> https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types >> >> .. _xarray_dtype_issue: https://github.com/pydata/xarray/issues/1262 >> >> .. [pygeos] https://github.com/caspervdw/pygeos >> >> .. [new_sort] https://github.com/numpy/numpy/pull/12945 >> >> .. [PEP-384] https://www.python.org/dev/peps/pep-0384/ >> >> .. [PR 15508] https://github.com/numpy/numpy/pull/15508 >> >> >> Copyright >> --------- >> >> This document has been placed in the public domain. >> >> >> Acknowledgments >> --------------- >> >> The effort to create new datatypes for NumPy has been discussed for >> several >> years in many different contexts and settings, making it impossible >> to list everyone involved. >> We would like to thank especially Stephan Hoyer, Nathaniel Smith, and >> Eric Wieser >> for repeated in-depth discussion about datatype design. >> We are very grateful for the community input in reviewing and >> revising this >> NEP and would like to thank especially Ross Barnowski and Ralf >> Gommers. >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > From sebastian at sipsolutions.net Fri Mar 20 15:02:52 2020 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Fri, 20 Mar 2020 14:02:52 -0500 Subject: [Numpy-discussion] New Wednesday Community Meeting Time slot (Current 11:00am California time) Message-ID: Hi all, since we are spread out all over the world, we are considering moving the Community meeting time, if you are interested in joining in occasionally, please fill out the doodle: https://doodle.com/poll/p3gik4xxdra93cwt I have currently limited the times to hourly times in the California work day. The most likely change will be a slight shift to make it later than the current 11:00am California time to still be reasonable for those from Europe. But if you are generally interested and never joined due to the time, now is a good time to bring it up :). Note that this meeting is bi-weekly on Wednesday, I did not anticipate moving the day itself for now. This is primarily for the Community and not the Triage meeting. All the best, Sebastian -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From meissner at hawaii.edu Fri Mar 20 18:20:18 2020 From: meissner at hawaii.edu (Gunter Meissner) Date: Fri, 20 Mar 2020 12:20:18 -1000 Subject: [Numpy-discussion] Error in Covariance and Variance calculation In-Reply-To: References: <007001d5fee2$22062630$66127290$@hawaii.edu> Message-ID: <00fc01d5ff05$bd920380$38b60a80$@hawaii.edu> Thanks Warren! Worked like a charm ? Will mention you in the book... Gunter Meissner, PhD University of Hawaii Adjunct Professor of MathFinance at Columbia University and NYU President of Derivatives Software www.dersoft.com CEO Cassandra Capital Management www.cassandracm.com CV: www.dersoft.com/cv.pdf Email: meissner at hawaii.edu Tel: USA (808) 779 3660 -----Original Message----- From: NumPy-Discussion On Behalf Of Warren Weckesser Sent: Friday, March 20, 2020 8:45 AM To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] Error in Covariance and Variance calculation On 3/20/20, Gunter Meissner wrote: > Dear Programmers, > > > > This is Gunter Meissner. I am currently writing a book on Forecasting > and derived the regression coefficient with Numpy: > > > > import numpy as np > X=[1,2,3,4] > Y=[10000,8000,5000,1000] > print(np.cov(X,Y)) > print(np.var(X)) > Beta1 = np.cov(X,Y)/np.var(X) > print(Beta1) > > > > However, Numpy is using the SAMPLE covariance , (which divides by n-1) > and the POPULATION variance > > VarX = (which divides by n). Therefore the regression coefficient BETA1 is > not correct. > > The solution is easy: Please use the population approach (dividing by > n) for BOTH covariance and variance or use the sample approach > (dividing by n-1) > > for BOTH covariance and variance. You may also allow the user to use > both as in EXCEL, where the user can choose between Var.S and Var.P > > and Cov.P and Var.P. > > > > Thanks!!! > > Gunter > Gunter, This is an unfortunate discrepancy in the API: `var` uses the default `ddof=0`, while `cov` uses, in effect, `ddof=1` by default. You can get the consistent behavior you want by using `ddof=1` in both functions. E.g. Beta1 = np.cov(X,Y, ddof=1) / np.var(X, ddof=1) Using `ddof=1` in `np.cov` is redundant, but in this context, it is probably useful to make explicit to the reader of the code that both functions are using the same convention. Changing the default in either function breaks backwards compatibility. That would require a long and potentially painful deprecation process. Warren > > > > > Gunter Meissner, PhD > > University of Hawaii > > Adjunct Professor of MathFinance at Columbia University and NYU > > President of Derivatives Software www.dersoft.com > > > > CEO Cassandra Capital Management > www.cassandracm.com > > CV: www.dersoft.com/cv.pdf > > Email: meissner at hawaii.edu > > Tel: USA (808) 779 3660 > > > > > > > > > > From: NumPy-Discussion > On Behalf Of > Ralf Gommers > Sent: Wednesday, March 18, 2020 5:16 AM > To: Discussion of Numerical Python > Subject: Re: [Numpy-discussion] Proposal: NEP 41 -- First step towards > a new Datatype System > > > > > > > > On Tue, Mar 17, 2020 at 9:03 PM Sebastian Berg > > wrote: > > Hi all, > > in the spirit of trying to keep this moving, can I assume that the > main reason for little discussion is that the actual changes proposed > are not very far reaching as of now? Or is the reason that this is a > fairly complex topic that you need more time to think about it? > > > > Probably (a) it's a long NEP on a complex topic, (b) the past week has > been a very weird week for everyone (in the extra-news-reading-time I > could easily have re-reviewed the NEP), and (c) the amount of feedback > one expects to get on a NEP is roughly inversely proportional to the > scope and complexity of the NEP contents. > > > > Today I re-read the parts I commented on before. This version is a big > improvement over the previous ones. Thanks in particular for adding > clear examples and the diagram, it helps a lot. > > > > If it is the latter, is there some way I can help with it? I tried to > minimize how much is part of this initial NEP. > > If there is not much need for discussion, I would like to officially > accept the NEP very soon, sending out an official one week notice in > the next days. > > > > I agree. I think I would like to keep the option open though to come > back to the NEP later to improve the clarity of the text about > motivation/plan/examples/scope, given that this will be the reference > for a major amount of work for a long time to come. > > > > To summarize one more time, the main point is that: > > > > This point seems fine, and I'm +1 for going ahead with the described > parts of the technical design. > > > > Cheers, > > Ralf > > > > > type(np.dtype(np.float64)) > > will be `np.dtype[float64]`, a subclass of dtype, so that: > > issubclass(np.dtype[float64], np.dtype) > > is true. This means that we will have one class for every current type > number: `dtype.num`. The implementation of these subclasses will be a > C-written (extension) MetaClass, all details of this class are > supposed to remain experimental in flux at this time. > > Cheers > > Sebastian > > > On Wed, 2020-03-11 at 17:02 -0700, Sebastian Berg wrote: >> Hi all, >> >> I am pleased to propose NEP 41: First step towards a new Datatype >> System https://numpy.org/neps/nep-0041-improved-dtype-support.html >> >> This NEP motivates the larger restructure of the datatype machinery >> in NumPy and defines a few fundamental design aspects. The long term >> user impact will be allowing easier and more rich featured user >> defined datatypes. >> >> As this is a large restructure, the NEP represents only the first >> steps with some additional information in further NEPs being drafted >> [1] (this may be helpful to look at depending on the level of detail >> you are interested in). >> The NEP itself does not propose to add significant new public API. >> Instead it proposes to move forward with an incremental internal >> refactor and lays the foundation for this process. >> >> The main user facing change at this time is that datatypes will >> become classes (e.g. ``type(np.dtype("float64"))`` will be a float64 >> specific class. >> For most users, the main impact should be many new datatypes in the >> long run (see the user impact section). However, for those interested >> in API design within NumPy or with respect to implementing new >> datatypes, this and the following NEPs are important decisions in the >> future roadmap for NumPy. >> >> The current full text is reproduced below, although the above link is >> probably a better way to read it. >> >> Cheers >> >> Sebastian >> >> >> [1] NEP 40 gives some background information about the current >> systems and issues with it: >> https://github.com/numpy/numpy/blob/1248cf7a8765b7b53d883f9e706117381 >> 7533aac/doc/neps/nep-0040-legacy-datatype-impl.rst >> and NEP 42 being a first draft of how the new API may look like: >> >> https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38 >> f53cee3/doc/neps/nep-0042-new-dtypes.rst >> (links to current rendered versions, check >> https://github.com/numpy/numpy/pull/15505 and >> https://github.com/numpy/numpy/pull/15507 for updates) >> >> >> ------------------------------------------------------------------- >> --- >> >> >> ================================================= >> NEP 41 ? First step towards a new Datatype System >> ================================================= >> >> :title: Improved Datatype Support >> :Author: Sebastian Berg >> :Author: St?fan van der Walt >> :Author: Matti Picus >> :Status: Draft >> :Type: Standard Track >> :Created: 2020-02-03 >> >> >> .. note:: >> >> This NEP is part of a series of NEPs encompassing first >> information >> about the previous dtype implementation and issues with it in NEP >> 40. >> NEP 41 (this document) then provides an overview and generic >> design >> choices for the refactor. >> Further NEPs 42 and 43 go into the technical details of the >> datatype >> and universal function related internal and external API changes. >> In some cases it may be necessary to consult the other NEPs for a >> full >> picture of the desired changes and why these changes are >> necessary. >> >> >> Abstract >> -------- >> >> `Datatypes ` in NumPy describe how to >> interpret each element in arrays. NumPy provides ``int``, ``float``, >> and ``complex`` numerical types, as well as string, datetime, and >> structured datatype capabilities. >> The growing Python community, however, has need for more diverse >> datatypes. >> Examples are datatypes with unit information attached (such as >> meters) or >> categorical datatypes (fixed set of possible values). >> However, the current NumPy datatype API is too limited to allow the >> creation of these. >> >> This NEP is the first step to enable such growth; it will lead to a >> simpler development path for new datatypes. >> In the long run the new datatype system will also support the >> creation of datatypes directly from Python rather than C. >> Refactoring the datatype API will improve maintainability and >> facilitate development of both user-defined external datatypes, as >> well as new features for existing datatypes internal to NumPy. >> >> >> Motivation and Scope >> -------------------- >> >> .. seealso:: >> >> The user impact section includes examples of what kind of new >> datatypes >> will be enabled by the proposed changes in the long run. >> It may thus help to read these section out of order. >> >> Motivation >> ^^^^^^^^^^ >> >> One of the main issues with the current API is the definition of >> typical functions such as addition and multiplication for parametric >> datatypes (see also NEP 40) which require additional steps to >> determine the output type. >> For example when adding two strings of length 4, the result is a >> string of length 8, which is different from the input. >> Similarly, a datatype which embeds a physical unit must calculate the >> new unit >> information: dividing a distance by a time results in a speed. >> A related difficulty is that the :ref:`current casting rules >> <_ufuncs.casting>` >> -- the conversion between different datatypes -- cannot describe >> casting for such parametric datatypes implemented outside of NumPy. >> >> This additional functionality for supporting parametric datatypes >> introduces increased complexity within NumPy itself, and furthermore >> is not available to external user-defined datatypes. >> In general the concerns of different datatypes are not well well- >> encapsulated. >> This burden is exacerbated by the exposure of internal C structures, >> limiting the addition of new fields (for example to support new >> sorting methods [new_sort]_). >> >> Currently there are many factors which limit the creation of new >> user-defined >> datatypes: >> >> * Creating casting rules for parametric user-defined dtypes is either >> impossible >> or so complex that it has never been attempted. >> * Type promotion, e.g. the operation deciding that adding float and >> integer >> values should return a float value, is very valuable for numeric >> datatypes >> but is limited in scope for user-defined and especially parametric >> datatypes. >> * Much of the logic (e.g. promotion) is written in single functions >> instead of being split as methods on the datatype itself. >> * In the current design datatypes cannot have methods that do not >> generalize >> to other datatypes. For example a unit datatype cannot have a >> ``.to_si()`` method to >> easily find the datatype which would represent the same values in >> SI units. >> >> The large need to solve these issues has driven the scientific >> community to create work-arounds in multiple projects implementing >> physical units as an array-like class instead of a datatype, which >> would generalize better across multiple array-likes (Dask, pandas, >> etc.). >> Already, Pandas has made a push into the same direction with its >> extension arrays [pandas_extension_arrays]_ and undoubtedly the >> community would be best served if such new features could be common >> between NumPy, Pandas, and other projects. >> >> Scope >> ^^^^^ >> >> The proposed refactoring of the datatype system is a large >> undertaking and thus is proposed to be split into various phases, >> roughly: >> >> * Phase I: Restructure and extend the datatype infrastructure (This >> NEP 41) >> * Phase II: Incrementally define or rework API (Detailed largely in >> NEPs 42/43) >> * Phase III: Growth of NumPy and Scientific Python Ecosystem >> capabilities. >> >> For a more detailed accounting of the various phases, see "Plan to >> Approach the Full Refactor" in the Implementation section below. >> This NEP proposes to move ahead with the necessary creation of new >> dtype subclasses (Phase I), and start working on implementing current >> functionality. >> Within the context of this NEP all development will be fully private >> API or use preliminary underscored names which must be changed in the >> future. >> Most of the internal and public API choices are part of a second >> Phase and will be discussed in more detail in the following NEPs 42 >> and 43. >> The initial implementation of this NEP will have little or no effect >> on users, but provides the necessary ground work for incrementally >> addressing the full rework. >> >> The implementation of this NEP and the following, implied large >> rework of how datatypes are defined in NumPy is expected to create >> small incompatibilities (see backward compatibility section). >> However, a transition requiring large code adaption is not >> anticipated and not within scope. >> >> Specifically, this NEP makes the following design choices which are >> discussed in more details in the detailed description section: >> >> 1. Each datatype will be an instance of a subclass of ``np.dtype``, >> with most of the >> datatype-specific logic being implemented >> as special methods on the class. In the C-API, these correspond to >> specific >> slots. In short, for ``f = np.dtype("f8")``, ``isinstance(f, >> np.dtype)`` will remain true, >> but ``type(f)`` will be a subclass of ``np.dtype`` rather than >> just ``np.dtype`` itself. >> The ``PyArray_ArrFuncs`` which are currently stored as a pointer >> on the instance (as ``PyArray_Descr->f``), >> should instead be stored on the class as typically done in Python. >> In the future these may correspond to python side dunder methods. >> Storage information such as itemsize and byteorder can differ >> between >> different dtype instances (e.g. "S3" vs. "S8") and will remain >> part of the instance. >> This means that in the long run the current lowlevel access to >> dtype methods >> will be removed (see ``PyArray_ArrFuncs`` in NEP 40). >> >> 2. The current NumPy scalars will *not* change, they will not be >> instances of >> datatypes. This will also be true for new datatypes, scalars will >> not be >> instances of a dtype (although ``isinstance(scalar, dtype)`` may >> be made >> to return ``True`` when appropriate). >> >> Detailed technical decisions to follow in NEP 42. >> >> Further, the public API will be designed in a way that is extensible >> in the future: >> >> 3. All new C-API functions provided to the user will hide >> implementation details >> as much as possible. The public API should be an identical, but >> limited, >> version of the C-API used for the internal NumPy datatypes. >> >> The changes to the datatype system in Phase II must include a large >> refactor of the UFunc machinery, which will be further defined in NEP >> 43: >> >> 4. To enable all of the desired functionality for new user-defined >> datatypes, >> the UFunc machinery will be changed to replace the current >> dispatching >> and type resolution system. >> The old system should be *mostly* supported as a legacy version >> for some time. >> >> Additionally, as a general design principle, the addition of new >> user-defined datatypes will *not* change the behaviour of programs. >> For example ``common_dtype(a, b)`` must not be ``c`` unless ``a`` or >> ``b`` know that ``c`` exists. >> >> >> User Impact >> ----------- >> >> The current ecosystem has very few user-defined datatypes using >> NumPy, the two most prominent being: ``rational`` and ``quaternion``. >> These represent fairly simple datatypes which are not strongly >> impacted by the current limitations. >> However, we have identified a need for datatypes such as: >> >> * bfloat16, used in deep learning >> * categorical types >> * physical units (such as meters) >> * datatypes for tracing/automatic differentiation >> * high, fixed precision math >> * specialized integer types such as int2, int24 >> * new, better datetime representations >> * extending e.g. integer dtypes to have a sentinel NA value >> * geometrical objects [pygeos]_ >> >> Some of these are partially solved; for example unit capability is >> provided in ``astropy.units``, ``unyt``, or ``pint``, as >> `numpy.ndarray` subclasses. >> Most of these datatypes, however, simply cannot be reasonably defined >> right now. >> An advantage of having such datatypes in NumPy is that they should >> integrate seamlessly with other array or array-like packages such as >> Pandas, ``xarray`` [xarray_dtype_issue]_, or ``Dask``. >> >> The long term user impact of implementing this NEP will be to allow >> both the growth of the whole ecosystem by having such new datatypes, >> as well as consolidating implementation of such datatypes within >> NumPy to achieve better interoperability. >> >> >> Examples >> ^^^^^^^^ >> >> The following examples represent future user-defined datatypes we >> wish to enable. >> These datatypes are not part the NEP and choices (e.g. choice of >> casting rules) are possibilities we wish to enable and do not >> represent recommendations. >> >> Simple Numerical Types >> """""""""""""""""""""" >> >> Mainly used where memory is a consideration, lower-precision numeric >> types such as :ref:```bfloat16`` < >> https://en.wikipedia.org/wiki/Bfloat16_floating-point_format>` >> are common in other computational frameworks. >> For these types the definitions of things such as ``np.common_type`` >> and ``np.can_cast`` are some of the most important interfaces. Once >> they support ``np.common_type``, it is (for the most part) possible >> to find the correct ufunc loop to call, since most ufuncs -- such as >> add -- effectively only require ``np.result_type``:: >> >> >>> np.add(arr1, arr2).dtype == np.result_type(arr1, arr2) >> >> and `~numpy.result_type` is largely identical to >> `~numpy.common_type`. >> >> >> Fixed, high precision math >> """""""""""""""""""""""""" >> >> Allowing arbitrary precision or higher precision math is important in >> simulations. For instance ``mpmath`` defines a precision:: >> >> >>> import mpmath as mp >> >>> print(mp.dps) # the current (default) precision >> 15 >> >> NumPy should be able to construct a native, memory-efficient array >> from a list of ``mpmath.mpf`` floating point objects:: >> >> >>> arr_15_dps = np.array(mp.arange(3)) # (mp.arange returns a >> list) >> >>> print(arr_15_dps) # Must find the correct precision from the >> objects: >> array(['0.0', '1.0', '2.0'], dtype=mpf[dps=15]) >> >> We should also be able to specify the desired precision when creating >> the datatype for an array. Here, we use ``np.dtype[mp.mpf]`` to find >> the DType class (the notation is not part of this NEP), which is then >> instantiated with the desired parameter. >> This could also be written as ``MpfDType`` class:: >> >> >>> arr_100_dps = np.array([1, 2, 3], >> dtype=np.dtype[mp.mpf](dps=100)) >> >>> print(arr_15_dps + arr_100_dps) >> array(['0.0', '2.0', '4.0'], dtype=mpf[dps=100]) >> >> The ``mpf`` datatype can decide that the result of the operation >> should be the higher precision one of the two, so uses a precision of >> 100. >> Furthermore, we should be able to define casting, for example as in:: >> >> >>> np.can_cast(arr_15_dps.dtype, arr_100_dps.dtype, >> casting="safe") >> True >> >>> np.can_cast(arr_100_dps.dtype, arr_15_dps.dtype, >> casting="safe") >> False # loses precision >> >>> np.can_cast(arr_100_dps.dtype, arr_100_dps.dtype, >> casting="same_kind") >> True >> >> Casting from float is a probably always at least a ``same_kind`` >> cast, but in general, it is not safe:: >> >> >>> np.can_cast(np.float64, np.dtype[mp.mpf](dps=4), >> casting="safe") >> False >> >> since a float64 has a higer precision than the ``mpf`` datatype with >> ``dps=4``. >> >> Alternatively, we can say that:: >> >> >>> np.common_type(np.dtype[mp.mpf](dps=5), >> np.dtype[mp.mpf](dps=10)) >> np.dtype[mp.mpf](dps=10) >> >> And possibly even:: >> >> >>> np.common_type(np.dtype[mp.mpf](dps=5), np.float64) >> np.dtype[mp.mpf](dps=16) # equivalent precision to float64 (I >> believe) >> >> since ``np.float64`` can be cast to a ``np.dtype[mp.mpf](dps=16)`` >> safely. >> >> >> Categoricals >> """""""""""" >> >> Categoricals are interesting in that they can have fixed, predefined >> values, or can be dynamic with the ability to modify categories when >> necessary. >> The fixed categories (defined ahead of time) is the most straight >> forward categorical definition. >> Categoricals are *hard*, since there are many strategies to implement >> them, suggesting NumPy should only provide the scaffolding for >> user-defined categorical types. For instance:: >> >> >>> cat = Categorical(["eggs", "spam", "toast"]) >> >>> breakfast = array(["eggs", "spam", "eggs", "toast"], >> dtype=cat) >> >> could store the array very efficiently, since it knows that there are >> only 3 categories. >> Since a categorical in this sense knows almost nothing about the data >> stored in it, few operations makes, sense, although equality does: >> >> >>> breakfast2 = array(["eggs", "eggs", "eggs", "eggs"], >> dtype=cat) >> >>> breakfast == breakfast2 >> array[True, False, True, False]) >> >> The categorical datatype could work like a dictionary: no two items >> names can be equal (checked on dtype creation), so that the equality >> operation above can be performed very efficiently. >> If the values define an order, the category labels (internally >> integers) could >> be ordered the same way to allow efficient sorting and comparison. >> >> Whether or not casting is defined from one categorical with less to >> one with strictly more values defined, is something that the >> Categorical datatype would need to decide. Both options should be >> available. >> >> >> Unit on the Datatype >> """""""""""""""""""" >> >> There are different ways to define Units, depending on how the >> internal machinery would be organized, one way is to have a single >> Unit datatype for every existing numerical type. >> This will be written as ``Unit[float64]``, the unit itself is part of >> the DType instance ``Unit[float64]("m")`` is a ``float64`` with >> meters >> attached:: >> >> >>> from astropy import units >> >>> meters = np.array([1, 2, 3], dtype=np.float64) * units.m # >> meters >> >>> print(meters) >> array([1.0, 2.0, 3.0], dtype=Unit[float64]("m")) >> >> Note that units are a bit tricky. It is debatable, whether:: >> >> >>> np.array([1.0, 2.0, 3.0], dtype=Unit[float64]("m")) >> >> should be valid syntax (coercing the float scalars without a unit to >> meters). >> Once the array is created, math will work without any issue:: >> >> >>> meters / (2 * unit.seconds) >> array([0.5, 1.0, 1.5], dtype=Unit[float64]("m/s")) >> >> Casting is not valid from one unit to the other, but can be valid >> between different scales of the same dimensionality (although this >> may be >> "unsafe"):: >> >> >>> meters.astype(Unit[float64]("s")) >> TypeError: Cannot cast meters to seconds. >> >>> meters.astype(Unit[float64]("km")) >> >>> # Convert to centimeter-gram-second (cgs) units: >> >>> meters.astype(meters.dtype.to_cgs()) >> >> The above notation is somewhat clumsy. Functions could be used >> instead to convert between units. >> There may be ways to make these more convenient, but those must be >> left for future discussions:: >> >> >>> units.convert(meters, "km") >> >>> units.to_cgs(meters) >> >> There are some open questions. For example, whether additional >> methods on the array object could exist to simplify some of the >> notions, and how these would percolate from the datatype to the >> ``ndarray``. >> >> The interaction with other scalars would likely be defined through:: >> >> >>> np.common_type(np.float64, Unit) >> Unit[np.float64](dimensionless) >> >> Ufunc output datatype determination can be more involved than for >> simple numerical dtypes since there is no "universal" output type:: >> >> >>> np.multiply(meters, seconds).dtype != np.result_type(meters, >> seconds) >> >> In fact ``np.result_type(meters, seconds)`` must error without >> context of the operation being done. >> This example highlights how the specific ufunc loop (loop with known, >> specific DTypes as inputs), has to be able to to make certain >> decisions before the actual calculation can start. >> >> >> >> Implementation >> -------------- >> >> Plan to Approach the Full Refactor >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> >> To address these issues in NumPy and enable new datatypes, >> multiple development stages are required: >> >> * Phase I: Restructure and extend the datatype infrastructure (This >> NEP) >> >> * Organize Datatypes like normal Python classes [`PR 15508`]_ >> >> * Phase II: Incrementally define or rework API >> >> * Create a new and easily extensible API for defining new datatypes >> and related functionality. (NEP 42) >> >> * Incrementally define all necessary functionality through the new >> API (NEP 42): >> >> * Defining operations such as ``np.common_type``. >> * Allowing to define casting between datatypes. >> * Add functionality necessary to create a numpy array from Python >> scalars >> (i.e. ``np.array(...)``). >> * ? >> >> * Restructure how universal functions work (NEP 43), in order to: >> >> * make it possible to allow a `~numpy.ufunc` such as ``np.add`` >> to be >> extended by user-defined datatypes such as Units. >> >> * allow efficient lookup for the correct implementation for user- >> defined >> datatypes. >> >> * enable reuse of existing code. Units should be able to use the >> normal math loops and add additional logic to determine output >> type. >> >> * Phase III: Growth of NumPy and Scientific Python Ecosystem >> capabilities: >> >> * Cleanup of legacy behaviour where it is considered buggy or >> undesirable. >> * Provide a path to define new datatypes from Python. >> * Assist the community in creating types such as Units or >> Categoricals >> * Allow strings to be used in functions such as ``np.equal`` or >> ``np.add``. >> * Remove legacy code paths within NumPy to improve long term >> maintainability >> >> This document serves as a basis for phase I and provides the vision >> and >> motivation for the full project. >> Phase I does not introduce any new user-facing features, >> but is concerned with the necessary conceptual cleanup of the current >> datatype system. >> It provides a more "pythonic" datatype Python type object, with a >> clear class hierarchy. >> >> The second phase is the incremental creation of all APIs necessary to >> define >> fully featured datatypes and reorganization of the NumPy datatype >> system. >> This phase will thus be primarily concerned with defining an, >> initially preliminary, stable public API. >> >> Some of the benefits of a large refactor may only become evident >> after the full >> deprecation of the current legacy implementation (i.e. larger code >> removals). >> However, these steps are necessary for improvements to many parts of >> the >> core NumPy API, and are expected to make the implementation generally >> easier to understand. >> >> The following figure illustrates the proposed design at a high level, >> and roughly delineates the components of the overall design. >> Note that this NEP only regards Phase I (shaded area), >> the rest encompasses Phase II and the design choices are up for >> discussion, >> however, it highlights that the DType datatype class is the central, >> necessary >> concept: >> >> .. image:: _static/nep-0041-mindmap.svg >> >> >> First steps directly related to this NEP >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> >> The required changes necessary to NumPy are large and touch many >> areas >> of the code base >> but many of these changes can be addressed incrementally. >> >> To enable an incremental approach we will start by creating a C >> defined >> ``PyArray_DTypeMeta`` class with its instances being the ``DType`` >> classes, >> subclasses of ``np.dtype``. >> This is necessary to add the ability of storing custom slots on the >> DType in C. >> This ``DTypeMeta`` will be implemented first to then enable >> incremental >> restructuring of current code. >> >> The addition of ``DType`` will then enable addressing other changes >> incrementally, some of which may begin before the settling the full >> internal >> API: >> >> 1. New machinery for array coercion, with the goal of enabling user >> DTypes >> with appropriate class methods. >> 2. The replacement or wrapping of the current casting machinery. >> 3. Incremental redefinition of the current ``PyArray_ArrFuncs`` slots >> into >> DType method slots. >> >> At this point, no or only very limited new public API will be added >> and >> the internal API is considered to be in flux. >> Any new public API may be set up give warnings and will have leading >> underscores >> to indicate that it is not finalized and can be changed without >> warning. >> >> >> Backward compatibility >> ---------------------- >> >> While the actual backward compatibility impact of implementing Phase >> I and II >> are not yet fully clear, we anticipate, and accept the following >> changes: >> >> * **Python API**: >> >> * ``type(np.dtype("f8"))`` will be a subclass of ``np.dtype``, >> while right >> now ``type(np.dtype("f8")) is np.dtype``. >> Code should use ``isinstance`` checks, and in very rare cases may >> have to >> be adapted to use it. >> >> * **C-API**: >> >> * In old versions of NumPy ``PyArray_DescrCheck`` is a macro >> which uses >> ``type(dtype) is np.dtype``. When compiling against an old >> NumPy version, >> the macro may have to be replaced with the corresponding >> ``PyObject_IsInstance`` call. (If this is a problem, we could >> backport >> fixing the macro) >> >> * The UFunc machinery changes will break *limited* parts of the >> current >> implementation. Replacing e.g. the default ``TypeResolver`` is >> expected >> to remain supported for a time, although optimized masked inner >> loop iteration >> (which is not even used *within* NumPy) will no longer be >> supported. >> >> * All functions currently defined on the dtypes, such as >> ``PyArray_Descr->f->nonzero``, will be defined and accessed >> differently. >> This means that in the long run lowlevel access code will >> have to be changed to use the new API. Such changes are expected >> to be >> necessary in very few project. >> >> * **dtype implementors (C-API)**: >> >> * The array which is currently provided to some functions (such as >> cast functions), >> will no longer be provided. >> For example ``PyArray_Descr->f->nonzero`` or ``PyArray_Descr->f- >> >copyswapn``, >> may instead receive a dummy array object with only some fields >> (mainly the >> dtype), being valid. >> At least in some code paths, a similar mechanism is already used. >> >> * The ``scalarkind`` slot and registration of scalar casting will >> be >> removed/ignored without replacement. >> It currently allows partial value-based casting. >> The ``PyArray_ScalarKind`` function will continue to work for >> builtin types, >> but will not be used internally and be deprecated. >> >> * Currently user dtypes are defined as instances of ``np.dtype``. >> The creation works by the user providing a prototype instance. >> NumPy will need to modify at least the type during registration. >> This has no effect for either ``rational`` or ``quaternion`` and >> mutation >> of the structure seems unlikely after registration. >> >> Since there is a fairly large API surface concerning datatypes, >> further changes >> or the limitation certain function to currently existing datatypes is >> likely to occur. >> For example functions which use the type number as input >> should be replaced with functions taking DType classes instead. >> Although public, large parts of this C-API seem to be used rarely, >> possibly never, by downstream projects. >> >> >> >> Detailed Description >> -------------------- >> >> This section details the design decisions covered by this NEP. >> The subsections correspond to the list of design choices presented >> in the Scope section. >> >> Datatypes as Python Classes (1) >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> >> The current NumPy datatypes are not full scale python classes. >> They are instead (prototype) instances of a single ``np.dtype`` >> class. >> Changing this means that any special handling, e.g. for ``datetime`` >> can be moved to the Datetime DType class instead, away from >> monolithic general >> code (e.g. current ``PyArray_AdjustFlexibleDType``). >> >> The main consequence of this change with respect to the API is that >> special methods move from the dtype instances to methods on the new >> DType class. >> This is the typical design pattern used in Python. >> Organizing these methods and information in a more Pythonic way >> provides a >> solid foundation for refining and extending the API in the future. >> The current API cannot be extended due to how it is exposed >> publically. >> This means for example that the methods currently stored in >> ``PyArray_ArrFuncs`` >> on each datatype (see NEP 40) will be defined differently in the >> future and >> deprecated in the long run. >> >> The most prominent visible side effect of this will be that >> ``type(np.dtype(np.float64))`` will not be ``np.dtype`` anymore. >> Instead it will be a subclass of ``np.dtype`` meaning that >> ``isinstance(np.dtype(np.float64), np.dtype)`` will remain true. >> This will also add the ability to use ``isinstance(dtype, >> np.dtype[float64])`` >> thus removing the need to use ``dtype.kind``, ``dtype.char``, or >> ``dtype.type`` >> to do this check. >> >> With the design decision of DTypes as full-scale Python classes, >> the question of subclassing arises. >> Inheritance, however, appears problematic and a complexity best >> avoided >> (at least initially) for container datatypes. >> Further, subclasses may be more interesting for interoperability for >> example with GPU backends (CuPy) storing additional methods related >> to the >> GPU rather than as a mechanism to define new datatypes. >> A class hierarchy does provides value, this may be achieved by >> allowing the creation of *abstract* datatypes. >> An example for an abstract datatype would be the datatype equivalent >> of >> ``np.floating``, representing any floating point number. >> These can serve the same purpose as Python's abstract base classes. >> >> >> Scalars should not be instances of the datatypes (2) >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> >> For simple datatypes such as ``float64`` (see also below), it seems >> tempting that the instance of a ``np.dtype("float64")`` can be the >> scalar. >> This idea may be even more appealing due to the fact that scalars, >> rather than datatypes, currently define a useful type hierarchy. >> >> However, we have specifically decided against this for a number of >> reasons. >> First, the new datatypes described herein would be instances of DType >> classes. >> Making these instances themselves classes, while possible, adds >> additional >> complexity that users need to understand. >> It would also mean that scalars must have storage information (such >> as byteorder) >> which is generally unnecessary and currently is not used. >> Second, while the simple NumPy scalars such as ``float64`` may be >> such instances, >> it should be possible to create datatypes for Python objects without >> enforcing >> NumPy as a dependency. >> However, Python objects that do not depend on NumPy cannot be >> instances of a NumPy DType. >> Third, there is a mismatch between the methods and attributes which >> are useful >> for scalars and datatypes. For instance ``to_float()`` makes sense >> for a scalar >> but not for a datatype and ``newbyteorder`` is not useful on a scalar >> (or has >> a different meaning). >> >> Overall, it seem rather than reducing the complexity, i.e. by merging >> the two distinct type hierarchies, making scalars instances of DTypes >> would >> increase the complexity of both the design and implementation. >> >> A possible future path may be to instead simplify the current NumPy >> scalars to >> be much simpler objects which largely derive their behaviour from the >> datatypes. >> >> C-API for creating new Datatypes (3) >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> >> The current C-API with which users can create new datatypes >> is limited in scope, and requires use of "private" structures. This >> means >> the API is not extensible: no new members can be added to the >> structure >> without losing binary compatibility. >> This has already limited the inclusion of new sorting methods into >> NumPy [new_sort]_. >> >> The new version shall thus replace the current ``PyArray_ArrFuncs`` >> structure used >> to define new datatypes. >> Datatypes that currently exist and are defined using these slots will >> be >> supported during a deprecation period. >> >> The most likely solution is to hide the implementation from the user >> and thus make >> it extensible in the future is to model the API after Python's stable >> API [PEP-384]_: >> >> .. code-block:: C >> >> static struct PyArrayMethodDef slots[] = { >> {NPY_dt_method, method_implementation}, >> ..., >> {0, NULL} >> } >> >> typedef struct{ >> PyTypeObject *typeobj; /* type of python scalar */ >> ...; >> PyType_Slot *slots; >> } PyArrayDTypeMeta_Spec; >> >> PyObject* PyArray_InitDTypeMetaFromSpec( >> PyArray_DTypeMeta *user_dtype, PyArrayDTypeMeta_Spec >> *dtype_spec); >> >> The C-side slots should be designed to mirror Python side methods >> such as ``dtype.__dtype_method__``, although the exposure to Python >> is >> a later step in the implementation to reduce the complexity of the >> initial >> implementation. >> >> >> C-API Changes to the UFunc Machinery (4) >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> >> Proposed changes to the UFunc machinery will be part of NEP 43. >> However, the following changes will be necessary (see NEP 40 for a >> detailed >> description of the current implementation and its issues): >> >> * The current UFunc type resolution must be adapted to allow better >> control >> for user-defined dtypes as well as resolve current inconsistencies. >> * The inner-loop used in UFuncs must be expanded to include a return >> value. >> Further, error reporting must be improved, and passing in dtype- >> specific >> information enabled. >> This requires the modification of the inner-loop function signature >> and >> addition of new hooks called before and after the inner-loop is >> used. >> >> An important goal for any changes to the universal functions will be >> to >> allow the reuse of existing loops. >> It should be easy for a new units datatype to fall back to existing >> math >> functions after handling the unit related computations. >> >> >> Discussion >> ---------- >> >> See NEP 40 for a list of previous meetings and discussions. >> >> >> References >> ---------- >> >> .. [pandas_extension_arrays] >> https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types >> >> .. _xarray_dtype_issue: https://github.com/pydata/xarray/issues/1262 >> >> .. [pygeos] https://github.com/caspervdw/pygeos >> >> .. [new_sort] https://github.com/numpy/numpy/pull/12945 >> >> .. [PEP-384] https://www.python.org/dev/peps/pep-0384/ >> >> .. [PR 15508] https://github.com/numpy/numpy/pull/15508 >> >> >> Copyright >> --------- >> >> This document has been placed in the public domain. >> >> >> Acknowledgments >> --------------- >> >> The effort to create new datatypes for NumPy has been discussed for >> several >> years in many different contexts and settings, making it impossible >> to list everyone involved. >> We would like to thank especially Stephan Hoyer, Nathaniel Smith, and >> Eric Wieser >> for repeated in-depth discussion about datatype design. >> We are very grateful for the community input in reviewing and >> revising this >> NEP and would like to thank especially Ross Barnowski and Ralf >> Gommers. >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion at python.org https://mail.python.org/mailman/listinfo/numpy-discussion From teoliphant at gmail.com Sat Mar 21 16:58:05 2020 From: teoliphant at gmail.com (Travis Oliphant) Date: Sat, 21 Mar 2020 15:58:05 -0500 Subject: [Numpy-discussion] Proposal: NEP 41 -- First step towards a new Datatype System In-Reply-To: <3ee04012570e418a874e4c436b5b6f8ac5f56cdb.camel@sipsolutions.net> References: <3ee04012570e418a874e4c436b5b6f8ac5f56cdb.camel@sipsolutions.net> Message-ID: Thanks for publicizing this and all the work that has gone into getting this far. I'm extremely supportive of the foundational DType meta-type and making dtypes classes. This was the epiphany I had in 2015 that led me to experiment with xnd and later mtypes. I have not had the funding to work on it much since that time directly. But, this is the right way to connect the data type system with the rest of Python typing. NumPy's current dtypes are currently analogous to Python 1's user-defined classes. In Python 1 *all* user-defined classes were instances of a single Class Type at the C-level, just like currently all NumPy dtypes are instances of a single Dtype "Type" in Python. Shifting Dtypes to be true types (by making them instances of a single low-level MetaType) is (IMHO) exactly the right approach. Doing this first while trying to minimize other changes will help a lot. I'm very excited by the work being done in this direction. I can appreciate the desire to be cautious on some of the other issues (like removing numpy array scalars). I do still think that eventually removing numpy array scalars in lieu of instances of dtype objects will be less complex approach and am not sold generally by the reasons listed in the NEP (though I can appreciate that it's not something to do as part of *this* NEP) as getting there might take more effort than desired at this point. What I would *strongly* recommend right now, however, is to make the new NumPy dtype system a separately-installable module (kept in the NumPy GitHub organization). In that way, people can depend on the NumPy type system without depending on NumPy itself. I think this will become more and more important in the future. It will help the design as you see NumPy as one of many *consumers* of the type system instead of the only one. It would also help projects like arrow and xnd and others in the future that might only want to depend on NumPy's type system but otherwise implement their own computations. This might require a little more work to provide an adaptor layer in NumPy itself to use the new system instead of its current dtypes, but I think it will also help ensure that the datatype API is cleaner and more useful to the Python ecosystem as a whole. Thanks, -Travis On Wed, Mar 11, 2020 at 7:08 PM Sebastian Berg wrote: > Hi all, > > I am pleased to propose NEP 41: First step towards a new Datatype > System https://numpy.org/neps/nep-0041-improved-dtype-support.html > > This NEP motivates the larger restructure of the datatype machinery in > NumPy and defines a few fundamental design aspects. The long term user > impact will be allowing easier and more rich featured user defined > datatypes. > > As this is a large restructure, the NEP represents only the first steps > with some additional information in further NEPs being drafted [1] > (this may be helpful to look at depending on the level of detail you are > interested in). > The NEP itself does not propose to add significant new public API. > Instead it proposes to move forward with an incremental internal > refactor and lays the foundation for this process. > > The main user facing change at this time is that datatypes will become > classes (e.g. ``type(np.dtype("float64"))`` will be a float64 specific > class. > For most users, the main impact should be many new datatypes in the > long run (see the user impact section). However, for those interested > in API design within NumPy or with respect to implementing new > datatypes, this and the following NEPs are important decisions in the > future roadmap for NumPy. > > The current full text is reproduced below, although the above link is > probably a better way to read it. > > Cheers > > Sebastian > > > [1] NEP 40 gives some background information about the current systems > and issues with it: > > https://github.com/numpy/numpy/blob/1248cf7a8765b7b53d883f9e7061173817533aac/doc/neps/nep-0040-legacy-datatype-impl.rst > and NEP 42 being a first draft of how the new API may look like: > > > https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38f53cee3/doc/neps/nep-0042-new-dtypes.rst > (links to current rendered versions, check > https://github.com/numpy/numpy/pull/15505 and > https://github.com/numpy/numpy/pull/15507 for updates) > > > ---------------------------------------------------------------------- > > > ================================================= > NEP 41 ? First step towards a new Datatype System > ================================================= > > :title: Improved Datatype Support > :Author: Sebastian Berg > :Author: St?fan van der Walt > :Author: Matti Picus > :Status: Draft > :Type: Standard Track > :Created: 2020-02-03 > > > .. note:: > > This NEP is part of a series of NEPs encompassing first information > about the previous dtype implementation and issues with it in NEP 40. > NEP 41 (this document) then provides an overview and generic design > choices for the refactor. > Further NEPs 42 and 43 go into the technical details of the datatype > and universal function related internal and external API changes. > In some cases it may be necessary to consult the other NEPs for a full > picture of the desired changes and why these changes are necessary. > > > Abstract > -------- > > `Datatypes ` in NumPy describe how to interpret > each > element in arrays. NumPy provides ``int``, ``float``, and ``complex`` > numerical > types, as well as string, datetime, and structured datatype capabilities. > The growing Python community, however, has need for more diverse datatypes. > Examples are datatypes with unit information attached (such as meters) or > categorical datatypes (fixed set of possible values). > However, the current NumPy datatype API is too limited to allow the > creation > of these. > > This NEP is the first step to enable such growth; it will lead to > a simpler development path for new datatypes. > In the long run the new datatype system will also support the creation > of datatypes directly from Python rather than C. > Refactoring the datatype API will improve maintainability and facilitate > development of both user-defined external datatypes, > as well as new features for existing datatypes internal to NumPy. > > > Motivation and Scope > -------------------- > > .. seealso:: > > The user impact section includes examples of what kind of new datatypes > will be enabled by the proposed changes in the long run. > It may thus help to read these section out of order. > > Motivation > ^^^^^^^^^^ > > One of the main issues with the current API is the definition of typical > functions such as addition and multiplication for parametric datatypes > (see also NEP 40) which require additional steps to determine the output > type. > For example when adding two strings of length 4, the result is a string > of length 8, which is different from the input. > Similarly, a datatype which embeds a physical unit must calculate the new > unit > information: dividing a distance by a time results in a speed. > A related difficulty is that the :ref:`current casting rules > <_ufuncs.casting>` > -- the conversion between different datatypes -- > cannot describe casting for such parametric datatypes implemented outside > of NumPy. > > This additional functionality for supporting parametric datatypes > introduces > increased complexity within NumPy itself, > and furthermore is not available to external user-defined datatypes. > In general the concerns of different datatypes are not well > well-encapsulated. > This burden is exacerbated by the exposure of internal C structures, > limiting the addition of new fields > (for example to support new sorting methods [new_sort]_). > > Currently there are many factors which limit the creation of new > user-defined > datatypes: > > * Creating casting rules for parametric user-defined dtypes is either > impossible > or so complex that it has never been attempted. > * Type promotion, e.g. the operation deciding that adding float and integer > values should return a float value, is very valuable for numeric > datatypes > but is limited in scope for user-defined and especially parametric > datatypes. > * Much of the logic (e.g. promotion) is written in single functions > instead of being split as methods on the datatype itself. > * In the current design datatypes cannot have methods that do not > generalize > to other datatypes. For example a unit datatype cannot have a > ``.to_si()`` method to > easily find the datatype which would represent the same values in SI > units. > > The large need to solve these issues has driven the scientific community > to create work-arounds in multiple projects implementing physical units as > an > array-like class instead of a datatype, which would generalize better > across > multiple array-likes (Dask, pandas, etc.). > Already, Pandas has made a push into the same direction with its > extension arrays [pandas_extension_arrays]_ and undoubtedly > the community would be best served if such new features could be common > between NumPy, Pandas, and other projects. > > Scope > ^^^^^ > > The proposed refactoring of the datatype system is a large undertaking and > thus is proposed to be split into various phases, roughly: > > * Phase I: Restructure and extend the datatype infrastructure (This NEP 41) > * Phase II: Incrementally define or rework API (Detailed largely in NEPs > 42/43) > * Phase III: Growth of NumPy and Scientific Python Ecosystem capabilities. > > For a more detailed accounting of the various phases, see > "Plan to Approach the Full Refactor" in the Implementation section below. > This NEP proposes to move ahead with the necessary creation of new dtype > subclasses (Phase I), > and start working on implementing current functionality. > Within the context of this NEP all development will be fully private API or > use preliminary underscored names which must be changed in the future. > Most of the internal and public API choices are part of a second Phase > and will be discussed in more detail in the following NEPs 42 and 43. > The initial implementation of this NEP will have little or no effect on > users, > but provides the necessary ground work for incrementally addressing the > full rework. > > The implementation of this NEP and the following, implied large rework of > how > datatypes are defined in NumPy is expected to create small > incompatibilities > (see backward compatibility section). > However, a transition requiring large code adaption is not anticipated and > not > within scope. > > Specifically, this NEP makes the following design choices which are > discussed > in more details in the detailed description section: > > 1. Each datatype will be an instance of a subclass of ``np.dtype``, with > most of the > datatype-specific logic being implemented > as special methods on the class. In the C-API, these correspond to > specific > slots. In short, for ``f = np.dtype("f8")``, ``isinstance(f, > np.dtype)`` will remain true, > but ``type(f)`` will be a subclass of ``np.dtype`` rather than just > ``np.dtype`` itself. > The ``PyArray_ArrFuncs`` which are currently stored as a pointer on the > instance (as ``PyArray_Descr->f``), > should instead be stored on the class as typically done in Python. > In the future these may correspond to python side dunder methods. > Storage information such as itemsize and byteorder can differ between > different dtype instances (e.g. "S3" vs. "S8") and will remain part of > the instance. > This means that in the long run the current lowlevel access to dtype > methods > will be removed (see ``PyArray_ArrFuncs`` in NEP 40). > > 2. The current NumPy scalars will *not* change, they will not be instances > of > datatypes. This will also be true for new datatypes, scalars will not be > instances of a dtype (although ``isinstance(scalar, dtype)`` may be made > to return ``True`` when appropriate). > > Detailed technical decisions to follow in NEP 42. > > Further, the public API will be designed in a way that is extensible in > the future: > > 3. All new C-API functions provided to the user will hide implementation > details > as much as possible. The public API should be an identical, but limited, > version of the C-API used for the internal NumPy datatypes. > > The changes to the datatype system in Phase II must include a large > refactor of the > UFunc machinery, which will be further defined in NEP 43: > > 4. To enable all of the desired functionality for new user-defined > datatypes, > the UFunc machinery will be changed to replace the current dispatching > and type resolution system. > The old system should be *mostly* supported as a legacy version for > some time. > > Additionally, as a general design principle, the addition of new > user-defined > datatypes will *not* change the behaviour of programs. > For example ``common_dtype(a, b)`` must not be ``c`` unless ``a`` or ``b`` > know > that ``c`` exists. > > > User Impact > ----------- > > The current ecosystem has very few user-defined datatypes using NumPy, the > two most prominent being: ``rational`` and ``quaternion``. > These represent fairly simple datatypes which are not strongly impacted > by the current limitations. > However, we have identified a need for datatypes such as: > > * bfloat16, used in deep learning > * categorical types > * physical units (such as meters) > * datatypes for tracing/automatic differentiation > * high, fixed precision math > * specialized integer types such as int2, int24 > * new, better datetime representations > * extending e.g. integer dtypes to have a sentinel NA value > * geometrical objects [pygeos]_ > > Some of these are partially solved; for example unit capability is provided > in ``astropy.units``, ``unyt``, or ``pint``, as `numpy.ndarray` subclasses. > Most of these datatypes, however, simply cannot be reasonably defined > right now. > An advantage of having such datatypes in NumPy is that they should > integrate > seamlessly with other array or array-like packages such as Pandas, > ``xarray`` [xarray_dtype_issue]_, or ``Dask``. > > The long term user impact of implementing this NEP will be to allow both > the growth of the whole ecosystem by having such new datatypes, as well as > consolidating implementation of such datatypes within NumPy to achieve > better interoperability. > > > Examples > ^^^^^^^^ > > The following examples represent future user-defined datatypes we wish to > enable. > These datatypes are not part the NEP and choices (e.g. choice of casting > rules) > are possibilities we wish to enable and do not represent recommendations. > > Simple Numerical Types > """""""""""""""""""""" > > Mainly used where memory is a consideration, lower-precision numeric types > such as :ref:```bfloat16`` < > https://en.wikipedia.org/wiki/Bfloat16_floating-point_format>` > are common in other computational frameworks. > For these types the definitions of things such as ``np.common_type`` and > ``np.can_cast`` are some of the most important interfaces. Once they > support ``np.common_type``, it is (for the most part) possible to find > the correct ufunc loop to call, since most ufuncs -- such as add -- > effectively > only require ``np.result_type``:: > > >>> np.add(arr1, arr2).dtype == np.result_type(arr1, arr2) > > and `~numpy.result_type` is largely identical to `~numpy.common_type`. > > > Fixed, high precision math > """""""""""""""""""""""""" > > Allowing arbitrary precision or higher precision math is important in > simulations. For instance ``mpmath`` defines a precision:: > > >>> import mpmath as mp > >>> print(mp.dps) # the current (default) precision > 15 > > NumPy should be able to construct a native, memory-efficient array from > a list of ``mpmath.mpf`` floating point objects:: > > >>> arr_15_dps = np.array(mp.arange(3)) # (mp.arange returns a list) > >>> print(arr_15_dps) # Must find the correct precision from the > objects: > array(['0.0', '1.0', '2.0'], dtype=mpf[dps=15]) > > We should also be able to specify the desired precision when > creating the datatype for an array. Here, we use ``np.dtype[mp.mpf]`` > to find the DType class (the notation is not part of this NEP), > which is then instantiated with the desired parameter. > This could also be written as ``MpfDType`` class:: > > >>> arr_100_dps = np.array([1, 2, 3], dtype=np.dtype[mp.mpf](dps=100)) > >>> print(arr_15_dps + arr_100_dps) > array(['0.0', '2.0', '4.0'], dtype=mpf[dps=100]) > > The ``mpf`` datatype can decide that the result of the operation should be > the > higher precision one of the two, so uses a precision of 100. > Furthermore, we should be able to define casting, for example as in:: > > >>> np.can_cast(arr_15_dps.dtype, arr_100_dps.dtype, casting="safe") > True > >>> np.can_cast(arr_100_dps.dtype, arr_15_dps.dtype, casting="safe") > False # loses precision > >>> np.can_cast(arr_100_dps.dtype, arr_100_dps.dtype, > casting="same_kind") > True > > Casting from float is a probably always at least a ``same_kind`` cast, but > in general, it is not safe:: > > >>> np.can_cast(np.float64, np.dtype[mp.mpf](dps=4), casting="safe") > False > > since a float64 has a higer precision than the ``mpf`` datatype with > ``dps=4``. > > Alternatively, we can say that:: > > >>> np.common_type(np.dtype[mp.mpf](dps=5), np.dtype[mp.mpf](dps=10)) > np.dtype[mp.mpf](dps=10) > > And possibly even:: > > >>> np.common_type(np.dtype[mp.mpf](dps=5), np.float64) > np.dtype[mp.mpf](dps=16) # equivalent precision to float64 (I believe) > > since ``np.float64`` can be cast to a ``np.dtype[mp.mpf](dps=16)`` safely. > > > Categoricals > """""""""""" > > Categoricals are interesting in that they can have fixed, predefined > values, > or can be dynamic with the ability to modify categories when necessary. > The fixed categories (defined ahead of time) is the most straight forward > categorical definition. > Categoricals are *hard*, since there are many strategies to implement them, > suggesting NumPy should only provide the scaffolding for user-defined > categorical types. For instance:: > > >>> cat = Categorical(["eggs", "spam", "toast"]) > >>> breakfast = array(["eggs", "spam", "eggs", "toast"], dtype=cat) > > could store the array very efficiently, since it knows that there are only > 3 > categories. > Since a categorical in this sense knows almost nothing about the data > stored > in it, few operations makes, sense, although equality does: > > >>> breakfast2 = array(["eggs", "eggs", "eggs", "eggs"], dtype=cat) > >>> breakfast == breakfast2 > array[True, False, True, False]) > > The categorical datatype could work like a dictionary: no two > items names can be equal (checked on dtype creation), so that the equality > operation above can be performed very efficiently. > If the values define an order, the category labels (internally integers) > could > be ordered the same way to allow efficient sorting and comparison. > > Whether or not casting is defined from one categorical with less to one > with > strictly more values defined, is something that the Categorical datatype > would > need to decide. Both options should be available. > > > Unit on the Datatype > """""""""""""""""""" > > There are different ways to define Units, depending on how the internal > machinery would be organized, one way is to have a single Unit datatype > for every existing numerical type. > This will be written as ``Unit[float64]``, the unit itself is part of the > DType instance ``Unit[float64]("m")`` is a ``float64`` with meters > attached:: > > >>> from astropy import units > >>> meters = np.array([1, 2, 3], dtype=np.float64) * units.m # meters > >>> print(meters) > array([1.0, 2.0, 3.0], dtype=Unit[float64]("m")) > > Note that units are a bit tricky. It is debatable, whether:: > > >>> np.array([1.0, 2.0, 3.0], dtype=Unit[float64]("m")) > > should be valid syntax (coercing the float scalars without a unit to > meters). > Once the array is created, math will work without any issue:: > > >>> meters / (2 * unit.seconds) > array([0.5, 1.0, 1.5], dtype=Unit[float64]("m/s")) > > Casting is not valid from one unit to the other, but can be valid between > different scales of the same dimensionality (although this may be > "unsafe"):: > > >>> meters.astype(Unit[float64]("s")) > TypeError: Cannot cast meters to seconds. > >>> meters.astype(Unit[float64]("km")) > >>> # Convert to centimeter-gram-second (cgs) units: > >>> meters.astype(meters.dtype.to_cgs()) > > The above notation is somewhat clumsy. Functions > could be used instead to convert between units. > There may be ways to make these more convenient, but those must be left > for future discussions:: > > >>> units.convert(meters, "km") > >>> units.to_cgs(meters) > > There are some open questions. For example, whether additional methods > on the array object could exist to simplify some of the notions, and how > these > would percolate from the datatype to the ``ndarray``. > > The interaction with other scalars would likely be defined through:: > > >>> np.common_type(np.float64, Unit) > Unit[np.float64](dimensionless) > > Ufunc output datatype determination can be more involved than for simple > numerical dtypes since there is no "universal" output type:: > > >>> np.multiply(meters, seconds).dtype != np.result_type(meters, > seconds) > > In fact ``np.result_type(meters, seconds)`` must error without context > of the operation being done. > This example highlights how the specific ufunc loop > (loop with known, specific DTypes as inputs), has to be able to to make > certain decisions before the actual calculation can start. > > > > Implementation > -------------- > > Plan to Approach the Full Refactor > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > To address these issues in NumPy and enable new datatypes, > multiple development stages are required: > > * Phase I: Restructure and extend the datatype infrastructure (This NEP) > > * Organize Datatypes like normal Python classes [`PR 15508`]_ > > * Phase II: Incrementally define or rework API > > * Create a new and easily extensible API for defining new datatypes > and related functionality. (NEP 42) > > * Incrementally define all necessary functionality through the new API > (NEP 42): > > * Defining operations such as ``np.common_type``. > * Allowing to define casting between datatypes. > * Add functionality necessary to create a numpy array from Python > scalars > (i.e. ``np.array(...)``). > * ? > > * Restructure how universal functions work (NEP 43), in order to: > > * make it possible to allow a `~numpy.ufunc` such as ``np.add`` to be > extended by user-defined datatypes such as Units. > > * allow efficient lookup for the correct implementation for > user-defined > datatypes. > > * enable reuse of existing code. Units should be able to use the > normal math loops and add additional logic to determine output type. > > * Phase III: Growth of NumPy and Scientific Python Ecosystem capabilities: > > * Cleanup of legacy behaviour where it is considered buggy or > undesirable. > * Provide a path to define new datatypes from Python. > * Assist the community in creating types such as Units or Categoricals > * Allow strings to be used in functions such as ``np.equal`` or > ``np.add``. > * Remove legacy code paths within NumPy to improve long term > maintainability > > This document serves as a basis for phase I and provides the vision and > motivation for the full project. > Phase I does not introduce any new user-facing features, > but is concerned with the necessary conceptual cleanup of the current > datatype system. > It provides a more "pythonic" datatype Python type object, with a clear > class hierarchy. > > The second phase is the incremental creation of all APIs necessary to > define > fully featured datatypes and reorganization of the NumPy datatype system. > This phase will thus be primarily concerned with defining an, > initially preliminary, stable public API. > > Some of the benefits of a large refactor may only become evident after the > full > deprecation of the current legacy implementation (i.e. larger code > removals). > However, these steps are necessary for improvements to many parts of the > core NumPy API, and are expected to make the implementation generally > easier to understand. > > The following figure illustrates the proposed design at a high level, > and roughly delineates the components of the overall design. > Note that this NEP only regards Phase I (shaded area), > the rest encompasses Phase II and the design choices are up for discussion, > however, it highlights that the DType datatype class is the central, > necessary > concept: > > .. image:: _static/nep-0041-mindmap.svg > > > First steps directly related to this NEP > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > The required changes necessary to NumPy are large and touch many areas > of the code base > but many of these changes can be addressed incrementally. > > To enable an incremental approach we will start by creating a C defined > ``PyArray_DTypeMeta`` class with its instances being the ``DType`` classes, > subclasses of ``np.dtype``. > This is necessary to add the ability of storing custom slots on the DType > in C. > This ``DTypeMeta`` will be implemented first to then enable incremental > restructuring of current code. > > The addition of ``DType`` will then enable addressing other changes > incrementally, some of which may begin before the settling the full > internal > API: > > 1. New machinery for array coercion, with the goal of enabling user DTypes > with appropriate class methods. > 2. The replacement or wrapping of the current casting machinery. > 3. Incremental redefinition of the current ``PyArray_ArrFuncs`` slots into > DType method slots. > > At this point, no or only very limited new public API will be added and > the internal API is considered to be in flux. > Any new public API may be set up give warnings and will have leading > underscores > to indicate that it is not finalized and can be changed without warning. > > > Backward compatibility > ---------------------- > > While the actual backward compatibility impact of implementing Phase I and > II > are not yet fully clear, we anticipate, and accept the following changes: > > * **Python API**: > > * ``type(np.dtype("f8"))`` will be a subclass of ``np.dtype``, while > right > now ``type(np.dtype("f8")) is np.dtype``. > Code should use ``isinstance`` checks, and in very rare cases may have > to > be adapted to use it. > > * **C-API**: > > * In old versions of NumPy ``PyArray_DescrCheck`` is a macro which uses > ``type(dtype) is np.dtype``. When compiling against an old NumPy > version, > the macro may have to be replaced with the corresponding > ``PyObject_IsInstance`` call. (If this is a problem, we could > backport > fixing the macro) > > * The UFunc machinery changes will break *limited* parts of the current > implementation. Replacing e.g. the default ``TypeResolver`` is > expected > to remain supported for a time, although optimized masked inner loop > iteration > (which is not even used *within* NumPy) will no longer be supported. > > * All functions currently defined on the dtypes, such as > ``PyArray_Descr->f->nonzero``, will be defined and accessed > differently. > This means that in the long run lowlevel access code will > have to be changed to use the new API. Such changes are expected to be > necessary in very few project. > > * **dtype implementors (C-API)**: > > * The array which is currently provided to some functions (such as cast > functions), > will no longer be provided. > For example ``PyArray_Descr->f->nonzero`` or > ``PyArray_Descr->f->copyswapn``, > may instead receive a dummy array object with only some fields (mainly > the > dtype), being valid. > At least in some code paths, a similar mechanism is already used. > > * The ``scalarkind`` slot and registration of scalar casting will be > removed/ignored without replacement. > It currently allows partial value-based casting. > The ``PyArray_ScalarKind`` function will continue to work for builtin > types, > but will not be used internally and be deprecated. > > * Currently user dtypes are defined as instances of ``np.dtype``. > The creation works by the user providing a prototype instance. > NumPy will need to modify at least the type during registration. > This has no effect for either ``rational`` or ``quaternion`` and > mutation > of the structure seems unlikely after registration. > > Since there is a fairly large API surface concerning datatypes, further > changes > or the limitation certain function to currently existing datatypes is > likely to occur. > For example functions which use the type number as input > should be replaced with functions taking DType classes instead. > Although public, large parts of this C-API seem to be used rarely, > possibly never, by downstream projects. > > > > Detailed Description > -------------------- > > This section details the design decisions covered by this NEP. > The subsections correspond to the list of design choices presented > in the Scope section. > > Datatypes as Python Classes (1) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > The current NumPy datatypes are not full scale python classes. > They are instead (prototype) instances of a single ``np.dtype`` class. > Changing this means that any special handling, e.g. for ``datetime`` > can be moved to the Datetime DType class instead, away from monolithic > general > code (e.g. current ``PyArray_AdjustFlexibleDType``). > > The main consequence of this change with respect to the API is that > special methods move from the dtype instances to methods on the new DType > class. > This is the typical design pattern used in Python. > Organizing these methods and information in a more Pythonic way provides a > solid foundation for refining and extending the API in the future. > The current API cannot be extended due to how it is exposed publically. > This means for example that the methods currently stored in > ``PyArray_ArrFuncs`` > on each datatype (see NEP 40) will be defined differently in the future and > deprecated in the long run. > > The most prominent visible side effect of this will be that > ``type(np.dtype(np.float64))`` will not be ``np.dtype`` anymore. > Instead it will be a subclass of ``np.dtype`` meaning that > ``isinstance(np.dtype(np.float64), np.dtype)`` will remain true. > This will also add the ability to use ``isinstance(dtype, > np.dtype[float64])`` > thus removing the need to use ``dtype.kind``, ``dtype.char``, or > ``dtype.type`` > to do this check. > > With the design decision of DTypes as full-scale Python classes, > the question of subclassing arises. > Inheritance, however, appears problematic and a complexity best avoided > (at least initially) for container datatypes. > Further, subclasses may be more interesting for interoperability for > example with GPU backends (CuPy) storing additional methods related to the > GPU rather than as a mechanism to define new datatypes. > A class hierarchy does provides value, this may be achieved by > allowing the creation of *abstract* datatypes. > An example for an abstract datatype would be the datatype equivalent of > ``np.floating``, representing any floating point number. > These can serve the same purpose as Python's abstract base classes. > > > Scalars should not be instances of the datatypes (2) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > For simple datatypes such as ``float64`` (see also below), it seems > tempting that the instance of a ``np.dtype("float64")`` can be the scalar. > This idea may be even more appealing due to the fact that scalars, > rather than datatypes, currently define a useful type hierarchy. > > However, we have specifically decided against this for a number of reasons. > First, the new datatypes described herein would be instances of DType > classes. > Making these instances themselves classes, while possible, adds additional > complexity that users need to understand. > It would also mean that scalars must have storage information (such as > byteorder) > which is generally unnecessary and currently is not used. > Second, while the simple NumPy scalars such as ``float64`` may be such > instances, > it should be possible to create datatypes for Python objects without > enforcing > NumPy as a dependency. > However, Python objects that do not depend on NumPy cannot be instances of > a NumPy DType. > Third, there is a mismatch between the methods and attributes which are > useful > for scalars and datatypes. For instance ``to_float()`` makes sense for a > scalar > but not for a datatype and ``newbyteorder`` is not useful on a scalar (or > has > a different meaning). > > Overall, it seem rather than reducing the complexity, i.e. by merging > the two distinct type hierarchies, making scalars instances of DTypes would > increase the complexity of both the design and implementation. > > A possible future path may be to instead simplify the current NumPy > scalars to > be much simpler objects which largely derive their behaviour from the > datatypes. > > C-API for creating new Datatypes (3) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > The current C-API with which users can create new datatypes > is limited in scope, and requires use of "private" structures. This means > the API is not extensible: no new members can be added to the structure > without losing binary compatibility. > This has already limited the inclusion of new sorting methods into > NumPy [new_sort]_. > > The new version shall thus replace the current ``PyArray_ArrFuncs`` > structure used > to define new datatypes. > Datatypes that currently exist and are defined using these slots will be > supported during a deprecation period. > > The most likely solution is to hide the implementation from the user and > thus make > it extensible in the future is to model the API after Python's stable > API [PEP-384]_: > > .. code-block:: C > > static struct PyArrayMethodDef slots[] = { > {NPY_dt_method, method_implementation}, > ..., > {0, NULL} > } > > typedef struct{ > PyTypeObject *typeobj; /* type of python scalar */ > ...; > PyType_Slot *slots; > } PyArrayDTypeMeta_Spec; > > PyObject* PyArray_InitDTypeMetaFromSpec( > PyArray_DTypeMeta *user_dtype, PyArrayDTypeMeta_Spec > *dtype_spec); > > The C-side slots should be designed to mirror Python side methods > such as ``dtype.__dtype_method__``, although the exposure to Python is > a later step in the implementation to reduce the complexity of the initial > implementation. > > > C-API Changes to the UFunc Machinery (4) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Proposed changes to the UFunc machinery will be part of NEP 43. > However, the following changes will be necessary (see NEP 40 for a detailed > description of the current implementation and its issues): > > * The current UFunc type resolution must be adapted to allow better control > for user-defined dtypes as well as resolve current inconsistencies. > * The inner-loop used in UFuncs must be expanded to include a return value. > Further, error reporting must be improved, and passing in dtype-specific > information enabled. > This requires the modification of the inner-loop function signature and > addition of new hooks called before and after the inner-loop is used. > > An important goal for any changes to the universal functions will be to > allow the reuse of existing loops. > It should be easy for a new units datatype to fall back to existing math > functions after handling the unit related computations. > > > Discussion > ---------- > > See NEP 40 for a list of previous meetings and discussions. > > > References > ---------- > > .. [pandas_extension_arrays] > https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types > > .. _xarray_dtype_issue: https://github.com/pydata/xarray/issues/1262 > > .. [pygeos] https://github.com/caspervdw/pygeos > > .. [new_sort] https://github.com/numpy/numpy/pull/12945 > > .. [PEP-384] https://www.python.org/dev/peps/pep-0384/ > > .. [PR 15508] https://github.com/numpy/numpy/pull/15508 > > > Copyright > --------- > > This document has been placed in the public domain. > > > Acknowledgments > --------------- > > The effort to create new datatypes for NumPy has been discussed for several > years in many different contexts and settings, making it impossible to > list everyone involved. > We would like to thank especially Stephan Hoyer, Nathaniel Smith, and Eric > Wieser > for repeated in-depth discussion about datatype design. > We are very grateful for the community input in reviewing and revising this > NEP and would like to thank especially Ross Barnowski and Ralf Gommers. > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Sun Mar 22 14:27:25 2020 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Sun, 22 Mar 2020 13:27:25 -0500 Subject: [Numpy-discussion] Proposal: NEP 41 -- First step towards a new Datatype System In-Reply-To: References: <3ee04012570e418a874e4c436b5b6f8ac5f56cdb.camel@sipsolutions.net> Message-ID: <162390781577dd2c2916f4dc600b65cf7f09abb1.camel@sipsolutions.net> Hi, thanks for the feedback! On Sat, 2020-03-21 at 15:58 -0500, Travis Oliphant wrote: > Thanks for publicizing this and all the work that has gone into > getting > this far. > > I'm extremely supportive of the foundational DType meta-type and > making > dtypes classes. This was the epiphany I had in 2015 that led me to > experiment with xnd and later mtypes. I have not had the funding to > work > on it much since that time directly. Right, I realize it is an old idea, if you have any references I am missing (I am sure there are many), I am happy to add them. > But, this is the right way to connect the data type system with the > rest of > Python typing. NumPy's current dtypes are currently analogous to > Python > 1's user-defined classes. In Python 1 *all* user-defined classes > were > instances of a single Class Type at the C-level, just like currently > all > NumPy dtypes are instances of a single Dtype "Type" in Python. > > Shifting Dtypes to be true types (by making them instances of a > single > low-level MetaType) is (IMHO) exactly the right approach. Doing > this > first while trying to minimize other changes will help a lot. I'm > very > excited by the work being done in this direction. > > I can appreciate the desire to be cautious on some of the other > issues > (like removing numpy array scalars). I do still think that > eventually > removing numpy array scalars in lieu of instances of dtype objects > will be > less complex approach and am not sold generally by the reasons listed > in > the NEP (though I can appreciate that it's not something to do as > part of > *this* NEP) as getting there might take more effort than desired at > this > point. Well, I do think it is a pretty strong design decision here though. If instances of DType classes are the actual dtypes (and not themselves classes, then it seems strange if scalars are also (direct) instances of the same DType class? Of course we can and probably will allow `isinstance(scalar, DType)` to work in either case. I do not see a problem with that, although I do not feel like making that decision right now. If we can agree on still going this direction for now I am happy of course. Nothing stops us from amending or finding new solutions in the future after all. I used to love the idea, but to be honest, I currently do not see: 1. How to approach it. It would have to be within Python itself, or we would need more shims for Python builtin types? 2. That it is actually helpful for users. If we were designing a new programming language around array computing principles, I do think that would be the approach I would want to take/consider. But I simply lack the vision of how marrying the idea with the scalar language Python would work out well... > What I would *strongly* recommend right now, however, is to make the > new > NumPy dtype system a separately-installable module (kept in the NumPy > GitHub organization). In that way, people can depend on the NumPy > type > system without depending on NumPy itself. I think this will become > more > and more important in the future. It will help the design as you see > NumPy > as one of many *consumers* of the type system instead of the only > one. It > would also help projects like arrow and xnd and others in the future > that > might only want to depend on NumPy's type system but otherwise > implement > their own computations. > Right, I agree that is the correct long term direction to see the DTypes as distinct from the NumPy array, and maybe I should add that to the NEP. What I am unsure about is the feasibility? If we develop it outside of NumPy, it harder to: 1. Use the new system without actually exposing it as public API in order to incrementally replace the old with a newer machinery. 2. It may require either exposing subclassing capabilities to NumPy to add shims for legacy DTypes right from the start, or add a bunch of public API which is only meant to be used within NumPy to that project? I suppose, I am also not sure that having it in NumPy (at least for now) is actually all that bad? For array-likes it is probably not a the most heavy dependency (and it could be slimmed down into a core). Since the intention is to dog-feed the API as much as possible and to limit the public API, it should be plausible to rip it out later of course. I am sure that will be more overall effort, but I suppose I feel it is much more approachable effort. One thing I would like is for projects such as CuPy to be able to subclass DTypes at some point to tag on the GPU aware things they need. But in some sense the basic DTypes seem to require being tied in with NumPy? They must be associated with the NumPy scalars, and the basic methods defined for all DTypes (also user DTypes) will probably be strided-inner-loops on the CPU. > This might require a little more work to provide an adaptor layer in > NumPy > itself to use the new system instead of its current dtypes, but I > think it > will also help ensure that the datatype API is cleaner and more > useful to > the Python ecosystem as a whole. While I fully agree with the sentiment, I suppose I am scared that the little more work will end up being too much :(. We have pretty limited resources and the most difficult work will not be writing the DType API itself. It will be wrangling it into NumPy and the associated huge review effort to get it right. Only by actually wrangling it into NumPy, I think we can also get the API fully right to begin with. So, I am scared that moving development outside and trying to add the more global scope at this time as will make the NumPy side much more difficult :(. Maybe not even because it is actually much trickier, but again because it seems less tangible/approachable. So, my main point here is that we have to make this large refactor as approachable as possible, and if that means that at some point someone has to spend a huge, but hopefully straight forward effort, to rip DTypes out of NumPy, I think that might be a worthy trade-off. Unless we can activate significantly larger resources very quickly. Best, Sebastian > > Thanks, > > -Travis > > > > > On Wed, Mar 11, 2020 at 7:08 PM Sebastian Berg < > sebastian at sipsolutions.net> > wrote: > > > Hi all, > > > > I am pleased to propose NEP 41: First step towards a new Datatype > > System https://numpy.org/neps/nep-0041-improved-dtype-support.html > > > > This NEP motivates the larger restructure of the datatype machinery > > in > > NumPy and defines a few fundamental design aspects. The long term > > user > > impact will be allowing easier and more rich featured user defined > > datatypes. > > > > As this is a large restructure, the NEP represents only the first > > steps > > with some additional information in further NEPs being drafted [1] > > (this may be helpful to look at depending on the level of detail > > you are > > interested in). > > The NEP itself does not propose to add significant new public API. > > Instead it proposes to move forward with an incremental internal > > refactor and lays the foundation for this process. > > > > The main user facing change at this time is that datatypes will > > become > > classes (e.g. ``type(np.dtype("float64"))`` will be a float64 > > specific > > class. > > For most users, the main impact should be many new datatypes in the > > long run (see the user impact section). However, for those > > interested > > in API design within NumPy or with respect to implementing new > > datatypes, this and the following NEPs are important decisions in > > the > > future roadmap for NumPy. > > > > The current full text is reproduced below, although the above link > > is > > probably a better way to read it. > > > > Cheers > > > > Sebastian > > > > > > [1] NEP 40 gives some background information about the current > > systems > > and issues with it: > > > > https://github.com/numpy/numpy/blob/1248cf7a8765b7b53d883f9e7061173817533aac/doc/neps/nep-0040-legacy-datatype-impl.rst > > and NEP 42 being a first draft of how the new API may look like: > > > > > > https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38f53cee3/doc/neps/nep-0042-new-dtypes.rst > > (links to current rendered versions, check > > https://github.com/numpy/numpy/pull/15505 and > > https://github.com/numpy/numpy/pull/15507 for updates) > > > > > > ----------------------------------------------------------------- > > ----- > > > > > > ================================================= > > NEP 41 ? First step towards a new Datatype System > > ================================================= > > > > :title: Improved Datatype Support > > :Author: Sebastian Berg > > :Author: St?fan van der Walt > > :Author: Matti Picus > > :Status: Draft > > :Type: Standard Track > > :Created: 2020-02-03 > > > > > > .. note:: > > > > This NEP is part of a series of NEPs encompassing first > > information > > about the previous dtype implementation and issues with it in > > NEP 40. > > NEP 41 (this document) then provides an overview and generic > > design > > choices for the refactor. > > Further NEPs 42 and 43 go into the technical details of the > > datatype > > and universal function related internal and external API > > changes. > > In some cases it may be necessary to consult the other NEPs for > > a full > > picture of the desired changes and why these changes are > > necessary. > > > > > > Abstract > > -------- > > > > `Datatypes ` in NumPy describe how to > > interpret > > each > > element in arrays. NumPy provides ``int``, ``float``, and > > ``complex`` > > numerical > > types, as well as string, datetime, and structured datatype > > capabilities. > > The growing Python community, however, has need for more diverse > > datatypes. > > Examples are datatypes with unit information attached (such as > > meters) or > > categorical datatypes (fixed set of possible values). > > However, the current NumPy datatype API is too limited to allow the > > creation > > of these. > > > > This NEP is the first step to enable such growth; it will lead to > > a simpler development path for new datatypes. > > In the long run the new datatype system will also support the > > creation > > of datatypes directly from Python rather than C. > > Refactoring the datatype API will improve maintainability and > > facilitate > > development of both user-defined external datatypes, > > as well as new features for existing datatypes internal to NumPy. > > > > > > Motivation and Scope > > -------------------- > > > > .. seealso:: > > > > The user impact section includes examples of what kind of new > > datatypes > > will be enabled by the proposed changes in the long run. > > It may thus help to read these section out of order. > > > > Motivation > > ^^^^^^^^^^ > > > > One of the main issues with the current API is the definition of > > typical > > functions such as addition and multiplication for parametric > > datatypes > > (see also NEP 40) which require additional steps to determine the > > output > > type. > > For example when adding two strings of length 4, the result is a > > string > > of length 8, which is different from the input. > > Similarly, a datatype which embeds a physical unit must calculate > > the new > > unit > > information: dividing a distance by a time results in a speed. > > A related difficulty is that the :ref:`current casting rules > > <_ufuncs.casting>` > > -- the conversion between different datatypes -- > > cannot describe casting for such parametric datatypes implemented > > outside > > of NumPy. > > > > This additional functionality for supporting parametric datatypes > > introduces > > increased complexity within NumPy itself, > > and furthermore is not available to external user-defined > > datatypes. > > In general the concerns of different datatypes are not well > > well-encapsulated. > > This burden is exacerbated by the exposure of internal C > > structures, > > limiting the addition of new fields > > (for example to support new sorting methods [new_sort]_). > > > > Currently there are many factors which limit the creation of new > > user-defined > > datatypes: > > > > * Creating casting rules for parametric user-defined dtypes is > > either > > impossible > > or so complex that it has never been attempted. > > * Type promotion, e.g. the operation deciding that adding float and > > integer > > values should return a float value, is very valuable for numeric > > datatypes > > but is limited in scope for user-defined and especially > > parametric > > datatypes. > > * Much of the logic (e.g. promotion) is written in single functions > > instead of being split as methods on the datatype itself. > > * In the current design datatypes cannot have methods that do not > > generalize > > to other datatypes. For example a unit datatype cannot have a > > ``.to_si()`` method to > > easily find the datatype which would represent the same values in > > SI > > units. > > > > The large need to solve these issues has driven the scientific > > community > > to create work-arounds in multiple projects implementing physical > > units as > > an > > array-like class instead of a datatype, which would generalize > > better > > across > > multiple array-likes (Dask, pandas, etc.). > > Already, Pandas has made a push into the same direction with its > > extension arrays [pandas_extension_arrays]_ and undoubtedly > > the community would be best served if such new features could be > > common > > between NumPy, Pandas, and other projects. > > > > Scope > > ^^^^^ > > > > The proposed refactoring of the datatype system is a large > > undertaking and > > thus is proposed to be split into various phases, roughly: > > > > * Phase I: Restructure and extend the datatype infrastructure (This > > NEP 41) > > * Phase II: Incrementally define or rework API (Detailed largely in > > NEPs > > 42/43) > > * Phase III: Growth of NumPy and Scientific Python Ecosystem > > capabilities. > > > > For a more detailed accounting of the various phases, see > > "Plan to Approach the Full Refactor" in the Implementation section > > below. > > This NEP proposes to move ahead with the necessary creation of new > > dtype > > subclasses (Phase I), > > and start working on implementing current functionality. > > Within the context of this NEP all development will be fully > > private API or > > use preliminary underscored names which must be changed in the > > future. > > Most of the internal and public API choices are part of a second > > Phase > > and will be discussed in more detail in the following NEPs 42 and > > 43. > > The initial implementation of this NEP will have little or no > > effect on > > users, > > but provides the necessary ground work for incrementally addressing > > the > > full rework. > > > > The implementation of this NEP and the following, implied large > > rework of > > how > > datatypes are defined in NumPy is expected to create small > > incompatibilities > > (see backward compatibility section). > > However, a transition requiring large code adaption is not > > anticipated and > > not > > within scope. > > > > Specifically, this NEP makes the following design choices which are > > discussed > > in more details in the detailed description section: > > > > 1. Each datatype will be an instance of a subclass of ``np.dtype``, > > with > > most of the > > datatype-specific logic being implemented > > as special methods on the class. In the C-API, these correspond > > to > > specific > > slots. In short, for ``f = np.dtype("f8")``, ``isinstance(f, > > np.dtype)`` will remain true, > > but ``type(f)`` will be a subclass of ``np.dtype`` rather than > > just > > ``np.dtype`` itself. > > The ``PyArray_ArrFuncs`` which are currently stored as a pointer > > on the > > instance (as ``PyArray_Descr->f``), > > should instead be stored on the class as typically done in > > Python. > > In the future these may correspond to python side dunder > > methods. > > Storage information such as itemsize and byteorder can differ > > between > > different dtype instances (e.g. "S3" vs. "S8") and will remain > > part of > > the instance. > > This means that in the long run the current lowlevel access to > > dtype > > methods > > will be removed (see ``PyArray_ArrFuncs`` in NEP 40). > > > > 2. The current NumPy scalars will *not* change, they will not be > > instances > > of > > datatypes. This will also be true for new datatypes, scalars > > will not be > > instances of a dtype (although ``isinstance(scalar, dtype)`` may > > be made > > to return ``True`` when appropriate). > > > > Detailed technical decisions to follow in NEP 42. > > > > Further, the public API will be designed in a way that is > > extensible in > > the future: > > > > 3. All new C-API functions provided to the user will hide > > implementation > > details > > as much as possible. The public API should be an identical, but > > limited, > > version of the C-API used for the internal NumPy datatypes. > > > > The changes to the datatype system in Phase II must include a large > > refactor of the > > UFunc machinery, which will be further defined in NEP 43: > > > > 4. To enable all of the desired functionality for new user-defined > > datatypes, > > the UFunc machinery will be changed to replace the current > > dispatching > > and type resolution system. > > The old system should be *mostly* supported as a legacy version > > for > > some time. > > > > Additionally, as a general design principle, the addition of new > > user-defined > > datatypes will *not* change the behaviour of programs. > > For example ``common_dtype(a, b)`` must not be ``c`` unless ``a`` > > or ``b`` > > know > > that ``c`` exists. > > > > > > User Impact > > ----------- > > > > The current ecosystem has very few user-defined datatypes using > > NumPy, the > > two most prominent being: ``rational`` and ``quaternion``. > > These represent fairly simple datatypes which are not strongly > > impacted > > by the current limitations. > > However, we have identified a need for datatypes such as: > > > > * bfloat16, used in deep learning > > * categorical types > > * physical units (such as meters) > > * datatypes for tracing/automatic differentiation > > * high, fixed precision math > > * specialized integer types such as int2, int24 > > * new, better datetime representations > > * extending e.g. integer dtypes to have a sentinel NA value > > * geometrical objects [pygeos]_ > > > > Some of these are partially solved; for example unit capability is > > provided > > in ``astropy.units``, ``unyt``, or ``pint``, as `numpy.ndarray` > > subclasses. > > Most of these datatypes, however, simply cannot be reasonably > > defined > > right now. > > An advantage of having such datatypes in NumPy is that they should > > integrate > > seamlessly with other array or array-like packages such as Pandas, > > ``xarray`` [xarray_dtype_issue]_, or ``Dask``. > > > > The long term user impact of implementing this NEP will be to allow > > both > > the growth of the whole ecosystem by having such new datatypes, as > > well as > > consolidating implementation of such datatypes within NumPy to > > achieve > > better interoperability. > > > > > > Examples > > ^^^^^^^^ > > > > The following examples represent future user-defined datatypes we > > wish to > > enable. > > These datatypes are not part the NEP and choices (e.g. choice of > > casting > > rules) > > are possibilities we wish to enable and do not represent > > recommendations. > > > > Simple Numerical Types > > """""""""""""""""""""" > > > > Mainly used where memory is a consideration, lower-precision > > numeric types > > such as :ref:```bfloat16`` < > > https://en.wikipedia.org/wiki/Bfloat16_floating-point_format>` > > are common in other computational frameworks. > > For these types the definitions of things such as > > ``np.common_type`` and > > ``np.can_cast`` are some of the most important interfaces. Once > > they > > support ``np.common_type``, it is (for the most part) possible to > > find > > the correct ufunc loop to call, since most ufuncs -- such as add -- > > effectively > > only require ``np.result_type``:: > > > > >>> np.add(arr1, arr2).dtype == np.result_type(arr1, arr2) > > > > and `~numpy.result_type` is largely identical to > > `~numpy.common_type`. > > > > > > Fixed, high precision math > > """""""""""""""""""""""""" > > > > Allowing arbitrary precision or higher precision math is important > > in > > simulations. For instance ``mpmath`` defines a precision:: > > > > >>> import mpmath as mp > > >>> print(mp.dps) # the current (default) precision > > 15 > > > > NumPy should be able to construct a native, memory-efficient array > > from > > a list of ``mpmath.mpf`` floating point objects:: > > > > >>> arr_15_dps = np.array(mp.arange(3)) # (mp.arange returns a > > list) > > >>> print(arr_15_dps) # Must find the correct precision from > > the > > objects: > > array(['0.0', '1.0', '2.0'], dtype=mpf[dps=15]) > > > > We should also be able to specify the desired precision when > > creating the datatype for an array. Here, we use > > ``np.dtype[mp.mpf]`` > > to find the DType class (the notation is not part of this NEP), > > which is then instantiated with the desired parameter. > > This could also be written as ``MpfDType`` class:: > > > > >>> arr_100_dps = np.array([1, 2, 3], > > dtype=np.dtype[mp.mpf](dps=100)) > > >>> print(arr_15_dps + arr_100_dps) > > array(['0.0', '2.0', '4.0'], dtype=mpf[dps=100]) > > > > The ``mpf`` datatype can decide that the result of the operation > > should be > > the > > higher precision one of the two, so uses a precision of 100. > > Furthermore, we should be able to define casting, for example as > > in:: > > > > >>> np.can_cast(arr_15_dps.dtype, arr_100_dps.dtype, > > casting="safe") > > True > > >>> np.can_cast(arr_100_dps.dtype, arr_15_dps.dtype, > > casting="safe") > > False # loses precision > > >>> np.can_cast(arr_100_dps.dtype, arr_100_dps.dtype, > > casting="same_kind") > > True > > > > Casting from float is a probably always at least a ``same_kind`` > > cast, but > > in general, it is not safe:: > > > > >>> np.can_cast(np.float64, np.dtype[mp.mpf](dps=4), > > casting="safe") > > False > > > > since a float64 has a higer precision than the ``mpf`` datatype > > with > > ``dps=4``. > > > > Alternatively, we can say that:: > > > > >>> np.common_type(np.dtype[mp.mpf](dps=5), > > np.dtype[mp.mpf](dps=10)) > > np.dtype[mp.mpf](dps=10) > > > > And possibly even:: > > > > >>> np.common_type(np.dtype[mp.mpf](dps=5), np.float64) > > np.dtype[mp.mpf](dps=16) # equivalent precision to float64 (I > > believe) > > > > since ``np.float64`` can be cast to a ``np.dtype[mp.mpf](dps=16)`` > > safely. > > > > > > Categoricals > > """""""""""" > > > > Categoricals are interesting in that they can have fixed, > > predefined > > values, > > or can be dynamic with the ability to modify categories when > > necessary. > > The fixed categories (defined ahead of time) is the most straight > > forward > > categorical definition. > > Categoricals are *hard*, since there are many strategies to > > implement them, > > suggesting NumPy should only provide the scaffolding for user- > > defined > > categorical types. For instance:: > > > > >>> cat = Categorical(["eggs", "spam", "toast"]) > > >>> breakfast = array(["eggs", "spam", "eggs", "toast"], > > dtype=cat) > > > > could store the array very efficiently, since it knows that there > > are only > > 3 > > categories. > > Since a categorical in this sense knows almost nothing about the > > data > > stored > > in it, few operations makes, sense, although equality does: > > > > >>> breakfast2 = array(["eggs", "eggs", "eggs", "eggs"], > > dtype=cat) > > >>> breakfast == breakfast2 > > array[True, False, True, False]) > > > > The categorical datatype could work like a dictionary: no two > > items names can be equal (checked on dtype creation), so that the > > equality > > operation above can be performed very efficiently. > > If the values define an order, the category labels (internally > > integers) > > could > > be ordered the same way to allow efficient sorting and comparison. > > > > Whether or not casting is defined from one categorical with less to > > one > > with > > strictly more values defined, is something that the Categorical > > datatype > > would > > need to decide. Both options should be available. > > > > > > Unit on the Datatype > > """""""""""""""""""" > > > > There are different ways to define Units, depending on how the > > internal > > machinery would be organized, one way is to have a single Unit > > datatype > > for every existing numerical type. > > This will be written as ``Unit[float64]``, the unit itself is part > > of the > > DType instance ``Unit[float64]("m")`` is a ``float64`` with meters > > attached:: > > > > >>> from astropy import units > > >>> meters = np.array([1, 2, 3], dtype=np.float64) * units.m # > > meters > > >>> print(meters) > > array([1.0, 2.0, 3.0], dtype=Unit[float64]("m")) > > > > Note that units are a bit tricky. It is debatable, whether:: > > > > >>> np.array([1.0, 2.0, 3.0], dtype=Unit[float64]("m")) > > > > should be valid syntax (coercing the float scalars without a unit > > to > > meters). > > Once the array is created, math will work without any issue:: > > > > >>> meters / (2 * unit.seconds) > > array([0.5, 1.0, 1.5], dtype=Unit[float64]("m/s")) > > > > Casting is not valid from one unit to the other, but can be valid > > between > > different scales of the same dimensionality (although this may be > > "unsafe"):: > > > > >>> meters.astype(Unit[float64]("s")) > > TypeError: Cannot cast meters to seconds. > > >>> meters.astype(Unit[float64]("km")) > > >>> # Convert to centimeter-gram-second (cgs) units: > > >>> meters.astype(meters.dtype.to_cgs()) > > > > The above notation is somewhat clumsy. Functions > > could be used instead to convert between units. > > There may be ways to make these more convenient, but those must be > > left > > for future discussions:: > > > > >>> units.convert(meters, "km") > > >>> units.to_cgs(meters) > > > > There are some open questions. For example, whether additional > > methods > > on the array object could exist to simplify some of the notions, > > and how > > these > > would percolate from the datatype to the ``ndarray``. > > > > The interaction with other scalars would likely be defined > > through:: > > > > >>> np.common_type(np.float64, Unit) > > Unit[np.float64](dimensionless) > > > > Ufunc output datatype determination can be more involved than for > > simple > > numerical dtypes since there is no "universal" output type:: > > > > >>> np.multiply(meters, seconds).dtype != > > np.result_type(meters, > > seconds) > > > > In fact ``np.result_type(meters, seconds)`` must error without > > context > > of the operation being done. > > This example highlights how the specific ufunc loop > > (loop with known, specific DTypes as inputs), has to be able to to > > make > > certain decisions before the actual calculation can start. > > > > > > > > Implementation > > -------------- > > > > Plan to Approach the Full Refactor > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > To address these issues in NumPy and enable new datatypes, > > multiple development stages are required: > > > > * Phase I: Restructure and extend the datatype infrastructure (This > > NEP) > > > > * Organize Datatypes like normal Python classes [`PR 15508`]_ > > > > * Phase II: Incrementally define or rework API > > > > * Create a new and easily extensible API for defining new > > datatypes > > and related functionality. (NEP 42) > > > > * Incrementally define all necessary functionality through the > > new API > > (NEP 42): > > > > * Defining operations such as ``np.common_type``. > > * Allowing to define casting between datatypes. > > * Add functionality necessary to create a numpy array from > > Python > > scalars > > (i.e. ``np.array(...)``). > > * ? > > > > * Restructure how universal functions work (NEP 43), in order to: > > > > * make it possible to allow a `~numpy.ufunc` such as ``np.add`` > > to be > > extended by user-defined datatypes such as Units. > > > > * allow efficient lookup for the correct implementation for > > user-defined > > datatypes. > > > > * enable reuse of existing code. Units should be able to use > > the > > normal math loops and add additional logic to determine > > output type. > > > > * Phase III: Growth of NumPy and Scientific Python Ecosystem > > capabilities: > > > > * Cleanup of legacy behaviour where it is considered buggy or > > undesirable. > > * Provide a path to define new datatypes from Python. > > * Assist the community in creating types such as Units or > > Categoricals > > * Allow strings to be used in functions such as ``np.equal`` or > > ``np.add``. > > * Remove legacy code paths within NumPy to improve long term > > maintainability > > > > This document serves as a basis for phase I and provides the vision > > and > > motivation for the full project. > > Phase I does not introduce any new user-facing features, > > but is concerned with the necessary conceptual cleanup of the > > current > > datatype system. > > It provides a more "pythonic" datatype Python type object, with a > > clear > > class hierarchy. > > > > The second phase is the incremental creation of all APIs necessary > > to > > define > > fully featured datatypes and reorganization of the NumPy datatype > > system. > > This phase will thus be primarily concerned with defining an, > > initially preliminary, stable public API. > > > > Some of the benefits of a large refactor may only become evident > > after the > > full > > deprecation of the current legacy implementation (i.e. larger code > > removals). > > However, these steps are necessary for improvements to many parts > > of the > > core NumPy API, and are expected to make the implementation > > generally > > easier to understand. > > > > The following figure illustrates the proposed design at a high > > level, > > and roughly delineates the components of the overall design. > > Note that this NEP only regards Phase I (shaded area), > > the rest encompasses Phase II and the design choices are up for > > discussion, > > however, it highlights that the DType datatype class is the > > central, > > necessary > > concept: > > > > .. image:: _static/nep-0041-mindmap.svg > > > > > > First steps directly related to this NEP > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > The required changes necessary to NumPy are large and touch many > > areas > > of the code base > > but many of these changes can be addressed incrementally. > > > > To enable an incremental approach we will start by creating a C > > defined > > ``PyArray_DTypeMeta`` class with its instances being the ``DType`` > > classes, > > subclasses of ``np.dtype``. > > This is necessary to add the ability of storing custom slots on the > > DType > > in C. > > This ``DTypeMeta`` will be implemented first to then enable > > incremental > > restructuring of current code. > > > > The addition of ``DType`` will then enable addressing other changes > > incrementally, some of which may begin before the settling the full > > internal > > API: > > > > 1. New machinery for array coercion, with the goal of enabling user > > DTypes > > with appropriate class methods. > > 2. The replacement or wrapping of the current casting machinery. > > 3. Incremental redefinition of the current ``PyArray_ArrFuncs`` > > slots into > > DType method slots. > > > > At this point, no or only very limited new public API will be added > > and > > the internal API is considered to be in flux. > > Any new public API may be set up give warnings and will have > > leading > > underscores > > to indicate that it is not finalized and can be changed without > > warning. > > > > > > Backward compatibility > > ---------------------- > > > > While the actual backward compatibility impact of implementing > > Phase I and > > II > > are not yet fully clear, we anticipate, and accept the following > > changes: > > > > * **Python API**: > > > > * ``type(np.dtype("f8"))`` will be a subclass of ``np.dtype``, > > while > > right > > now ``type(np.dtype("f8")) is np.dtype``. > > Code should use ``isinstance`` checks, and in very rare cases > > may have > > to > > be adapted to use it. > > > > * **C-API**: > > > > * In old versions of NumPy ``PyArray_DescrCheck`` is a macro > > which uses > > ``type(dtype) is np.dtype``. When compiling against an old > > NumPy > > version, > > the macro may have to be replaced with the corresponding > > ``PyObject_IsInstance`` call. (If this is a problem, we could > > backport > > fixing the macro) > > > > * The UFunc machinery changes will break *limited* parts of the > > current > > implementation. Replacing e.g. the default ``TypeResolver`` is > > expected > > to remain supported for a time, although optimized masked > > inner loop > > iteration > > (which is not even used *within* NumPy) will no longer be > > supported. > > > > * All functions currently defined on the dtypes, such as > > ``PyArray_Descr->f->nonzero``, will be defined and accessed > > differently. > > This means that in the long run lowlevel access code will > > have to be changed to use the new API. Such changes are > > expected to be > > necessary in very few project. > > > > * **dtype implementors (C-API)**: > > > > * The array which is currently provided to some functions (such > > as cast > > functions), > > will no longer be provided. > > For example ``PyArray_Descr->f->nonzero`` or > > ``PyArray_Descr->f->copyswapn``, > > may instead receive a dummy array object with only some fields > > (mainly > > the > > dtype), being valid. > > At least in some code paths, a similar mechanism is already > > used. > > > > * The ``scalarkind`` slot and registration of scalar casting will > > be > > removed/ignored without replacement. > > It currently allows partial value-based casting. > > The ``PyArray_ScalarKind`` function will continue to work for > > builtin > > types, > > but will not be used internally and be deprecated. > > > > * Currently user dtypes are defined as instances of > > ``np.dtype``. > > The creation works by the user providing a prototype instance. > > NumPy will need to modify at least the type during > > registration. > > This has no effect for either ``rational`` or ``quaternion`` > > and > > mutation > > of the structure seems unlikely after registration. > > > > Since there is a fairly large API surface concerning datatypes, > > further > > changes > > or the limitation certain function to currently existing datatypes > > is > > likely to occur. > > For example functions which use the type number as input > > should be replaced with functions taking DType classes instead. > > Although public, large parts of this C-API seem to be used rarely, > > possibly never, by downstream projects. > > > > > > > > Detailed Description > > -------------------- > > > > This section details the design decisions covered by this NEP. > > The subsections correspond to the list of design choices presented > > in the Scope section. > > > > Datatypes as Python Classes (1) > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > The current NumPy datatypes are not full scale python classes. > > They are instead (prototype) instances of a single ``np.dtype`` > > class. > > Changing this means that any special handling, e.g. for > > ``datetime`` > > can be moved to the Datetime DType class instead, away from > > monolithic > > general > > code (e.g. current ``PyArray_AdjustFlexibleDType``). > > > > The main consequence of this change with respect to the API is that > > special methods move from the dtype instances to methods on the new > > DType > > class. > > This is the typical design pattern used in Python. > > Organizing these methods and information in a more Pythonic way > > provides a > > solid foundation for refining and extending the API in the future. > > The current API cannot be extended due to how it is exposed > > publically. > > This means for example that the methods currently stored in > > ``PyArray_ArrFuncs`` > > on each datatype (see NEP 40) will be defined differently in the > > future and > > deprecated in the long run. > > > > The most prominent visible side effect of this will be that > > ``type(np.dtype(np.float64))`` will not be ``np.dtype`` anymore. > > Instead it will be a subclass of ``np.dtype`` meaning that > > ``isinstance(np.dtype(np.float64), np.dtype)`` will remain true. > > This will also add the ability to use ``isinstance(dtype, > > np.dtype[float64])`` > > thus removing the need to use ``dtype.kind``, ``dtype.char``, or > > ``dtype.type`` > > to do this check. > > > > With the design decision of DTypes as full-scale Python classes, > > the question of subclassing arises. > > Inheritance, however, appears problematic and a complexity best > > avoided > > (at least initially) for container datatypes. > > Further, subclasses may be more interesting for interoperability > > for > > example with GPU backends (CuPy) storing additional methods related > > to the > > GPU rather than as a mechanism to define new datatypes. > > A class hierarchy does provides value, this may be achieved by > > allowing the creation of *abstract* datatypes. > > An example for an abstract datatype would be the datatype > > equivalent of > > ``np.floating``, representing any floating point number. > > These can serve the same purpose as Python's abstract base classes. > > > > > > Scalars should not be instances of the datatypes (2) > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > For simple datatypes such as ``float64`` (see also below), it seems > > tempting that the instance of a ``np.dtype("float64")`` can be the > > scalar. > > This idea may be even more appealing due to the fact that scalars, > > rather than datatypes, currently define a useful type hierarchy. > > > > However, we have specifically decided against this for a number of > > reasons. > > First, the new datatypes described herein would be instances of > > DType > > classes. > > Making these instances themselves classes, while possible, adds > > additional > > complexity that users need to understand. > > It would also mean that scalars must have storage information (such > > as > > byteorder) > > which is generally unnecessary and currently is not used. > > Second, while the simple NumPy scalars such as ``float64`` may be > > such > > instances, > > it should be possible to create datatypes for Python objects > > without > > enforcing > > NumPy as a dependency. > > However, Python objects that do not depend on NumPy cannot be > > instances of > > a NumPy DType. > > Third, there is a mismatch between the methods and attributes which > > are > > useful > > for scalars and datatypes. For instance ``to_float()`` makes sense > > for a > > scalar > > but not for a datatype and ``newbyteorder`` is not useful on a > > scalar (or > > has > > a different meaning). > > > > Overall, it seem rather than reducing the complexity, i.e. by > > merging > > the two distinct type hierarchies, making scalars instances of > > DTypes would > > increase the complexity of both the design and implementation. > > > > A possible future path may be to instead simplify the current NumPy > > scalars to > > be much simpler objects which largely derive their behaviour from > > the > > datatypes. > > > > C-API for creating new Datatypes (3) > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > The current C-API with which users can create new datatypes > > is limited in scope, and requires use of "private" structures. This > > means > > the API is not extensible: no new members can be added to the > > structure > > without losing binary compatibility. > > This has already limited the inclusion of new sorting methods into > > NumPy [new_sort]_. > > > > The new version shall thus replace the current ``PyArray_ArrFuncs`` > > structure used > > to define new datatypes. > > Datatypes that currently exist and are defined using these slots > > will be > > supported during a deprecation period. > > > > The most likely solution is to hide the implementation from the > > user and > > thus make > > it extensible in the future is to model the API after Python's > > stable > > API [PEP-384]_: > > > > .. code-block:: C > > > > static struct PyArrayMethodDef slots[] = { > > {NPY_dt_method, method_implementation}, > > ..., > > {0, NULL} > > } > > > > typedef struct{ > > PyTypeObject *typeobj; /* type of python scalar */ > > ...; > > PyType_Slot *slots; > > } PyArrayDTypeMeta_Spec; > > > > PyObject* PyArray_InitDTypeMetaFromSpec( > > PyArray_DTypeMeta *user_dtype, PyArrayDTypeMeta_Spec > > *dtype_spec); > > > > The C-side slots should be designed to mirror Python side methods > > such as ``dtype.__dtype_method__``, although the exposure to Python > > is > > a later step in the implementation to reduce the complexity of the > > initial > > implementation. > > > > > > C-API Changes to the UFunc Machinery (4) > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > Proposed changes to the UFunc machinery will be part of NEP 43. > > However, the following changes will be necessary (see NEP 40 for a > > detailed > > description of the current implementation and its issues): > > > > * The current UFunc type resolution must be adapted to allow better > > control > > for user-defined dtypes as well as resolve current > > inconsistencies. > > * The inner-loop used in UFuncs must be expanded to include a > > return value. > > Further, error reporting must be improved, and passing in dtype- > > specific > > information enabled. > > This requires the modification of the inner-loop function > > signature and > > addition of new hooks called before and after the inner-loop is > > used. > > > > An important goal for any changes to the universal functions will > > be to > > allow the reuse of existing loops. > > It should be easy for a new units datatype to fall back to existing > > math > > functions after handling the unit related computations. > > > > > > Discussion > > ---------- > > > > See NEP 40 for a list of previous meetings and discussions. > > > > > > References > > ---------- > > > > .. [pandas_extension_arrays] > > https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types > > > > .. _xarray_dtype_issue: > > https://github.com/pydata/xarray/issues/1262 > > > > .. [pygeos] https://github.com/caspervdw/pygeos > > > > .. [new_sort] https://github.com/numpy/numpy/pull/12945 > > > > .. [PEP-384] https://www.python.org/dev/peps/pep-0384/ > > > > .. [PR 15508] https://github.com/numpy/numpy/pull/15508 > > > > > > Copyright > > --------- > > > > This document has been placed in the public domain. > > > > > > Acknowledgments > > --------------- > > > > The effort to create new datatypes for NumPy has been discussed for > > several > > years in many different contexts and settings, making it impossible > > to > > list everyone involved. > > We would like to thank especially Stephan Hoyer, Nathaniel Smith, > > and Eric > > Wieser > > for repeated in-depth discussion about datatype design. > > We are very grateful for the community input in reviewing and > > revising this > > NEP and would like to thank especially Ross Barnowski and Ralf > > Gommers. > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From teoliphant at gmail.com Mon Mar 23 11:45:57 2020 From: teoliphant at gmail.com (Travis Oliphant) Date: Mon, 23 Mar 2020 10:45:57 -0500 Subject: [Numpy-discussion] Proposal: NEP 41 -- First step towards a new Datatype System In-Reply-To: <162390781577dd2c2916f4dc600b65cf7f09abb1.camel@sipsolutions.net> References: <3ee04012570e418a874e4c436b5b6f8ac5f56cdb.camel@sipsolutions.net> <162390781577dd2c2916f4dc600b65cf7f09abb1.camel@sipsolutions.net> Message-ID: On Sun, Mar 22, 2020 at 1:33 PM Sebastian Berg wrote: > Hi, > > thanks for the feedback! > > On Sat, 2020-03-21 at 15:58 -0500, Travis Oliphant wrote: > > Thanks for publicizing this and all the work that has gone into > > getting > > this far. > > > > I'm extremely supportive of the foundational DType meta-type and > > making > > dtypes classes. This was the epiphany I had in 2015 that led me to > > experiment with xnd and later mtypes. I have not had the funding to > > work > > on it much since that time directly. > > Right, I realize it is an old idea, if you have any references I am > missing (I am sure there are many), I am happy to add them. > > > But, this is the right way to connect the data type system with the > > rest of > > Python typing. NumPy's current dtypes are currently analogous to > > Python > > 1's user-defined classes. In Python 1 *all* user-defined classes > > were > > instances of a single Class Type at the C-level, just like currently > > all > > NumPy dtypes are instances of a single Dtype "Type" in Python. > > > > Shifting Dtypes to be true types (by making them instances of a > > single > > low-level MetaType) is (IMHO) exactly the right approach. Doing > > this > > first while trying to minimize other changes will help a lot. I'm > > very > > excited by the work being done in this direction. > > > > I can appreciate the desire to be cautious on some of the other > > issues > > (like removing numpy array scalars). I do still think that > > eventually > > removing numpy array scalars in lieu of instances of dtype objects > > will be > > less complex approach and am not sold generally by the reasons listed > > in > > the NEP (though I can appreciate that it's not something to do as > > part of > > *this* NEP) as getting there might take more effort than desired at > > this > > point. > > Well, I do think it is a pretty strong design decision here though. If > instances of DType classes are the actual dtypes (and not themselves > classes, then it seems strange if scalars are also (direct) instances > of the same DType class? > > > If we were designing a new programming language around array computing > principles, I do think that would be the approach I would want to > take/consider. But I simply lack the vision of how marrying the idea > with the scalar language Python would work out well... > > > I agree that it makes sense to wait and do as the NEP says --- not require that to begin with. The persuasive argument to me is that as far as NumPy is concerned it is too many changes at once, and all the repercussions are not fully understood. So, it's better to wait. But, if another container system were to take the dtypes and make scalars instances of those types, it could actually work. At any rate, I agree with the decision and current plan to *not* make that change too. > > What I would *strongly* recommend right now, however, is to make the > > new > > NumPy dtype system a separately-installable module (kept in the NumPy > > GitHub organization). In that way, people can depend on the NumPy > > type > > system without depending on NumPy itself. I think this will become > > more > > and more important in the future. It will help the design as you see > > NumPy > > as one of many *consumers* of the type system instead of the only > > one. It > > would also help projects like arrow and xnd and others in the future > > that > > might only want to depend on NumPy's type system but otherwise > > implement > > their own computations. > > > > Right, I agree that is the correct long term direction to see the > DTypes as distinct from the NumPy array, and maybe I should add that to > the NEP. > What I am unsure about is the feasibility? If we develop it outside of > NumPy, it harder to: > > 1. Use the new system without actually exposing it as public API in > order to incrementally replace the old with a newer machinery. > 2. It may require either exposing subclassing capabilities to NumPy to > add shims for legacy DTypes right from the start, or add a bunch of > public API which is only meant to be used within NumPy to that project? > > > I suppose, I am also not sure that having it in NumPy (at least for > now) is actually all that bad? For array-likes it is probably not a the > most heavy dependency (and it could be slimmed down into a core). > > Since the intention is to dog-feed the API as much as possible and to > limit the public API, it should be plausible to rip it out later of > course. > I am sure that will be more overall effort, but I suppose I feel it is > much more approachable effort. > That makes sense. I'd love to see in the NEP some discussion of the possibility down the road of making a distinct module. It is more work initially to do that. I can see your point about not being sure about the public APIs at this point, but on the other hand, there is nothing like having two consumers to force that issue and make the overall design better. For example, if pyarrow, pytorch, and NumPy were to collaborate on this "dtype" module, I think the result would be stronger for both. But, it would require more funding and this NEP could allow for the possibility but not propose it directly. > One thing I would like is for projects such as CuPy to be able to > subclass DTypes at some point to tag on the GPU aware things they need. > But in some sense the basic DTypes seem to require being tied in with > NumPy? They must be associated with the NumPy scalars, and the basic > methods defined for all DTypes (also user DTypes) will probably be > strided-inner-loops on the CPU. > > > This might require a little more work to provide an adaptor layer in > > NumPy > > itself to use the new system instead of its current dtypes, but I > > think it > > will also help ensure that the datatype API is cleaner and more > > useful to > > the Python ecosystem as a whole. > > While I fully agree with the sentiment, I suppose I am scared that the > little more work will end up being too much :(. We have pretty limited > resources and the most difficult work will not be writing the DType API > itself. It will be wrangling it into NumPy and the associated huge > review effort to get it right. > Only by actually wrangling it into NumPy, I think we can also get the > API fully right to begin with. > So, I am scared that moving development outside and trying to add the > more global scope at this time as will make the NumPy side much more > difficult :(. Maybe not even because it is actually much trickier, but > again because it seems less tangible/approachable. > > So, my main point here is that we have to make this large refactor as > approachable as possible, and if that means that at some point someone > has to spend a huge, but hopefully straight forward effort, to rip > DTypes out of NumPy, I think that might be a worthy trade-off. > Unless we can activate significantly larger resources very quickly. > > That is a very reasonable response. I really appreciate the hard work you and others have put into this and am extremely grateful for the funding that has been provided thus far. I am interested in trying to get more funding and a common data-type API is one of those things I'm interested in helping see emerge. What you are doing here in NumPy is the closest thing to the "right" approach I have seen that has a chance of seeing adoption. Retrofitting NumPy is a lot of work on its own, and it is entirely possible that the kinds of attributes, methods and functions that are needed for NumPy means that an independent typing module is "easy" but not really useful for NumPy. It could be a straight-forward thing later to make what you produce either a formal subtype or simply an abstract duck-type of some other data-type system. As you make progress in this direction in NumPy, I'll keep following your work and maybe we can get more funding to work on something in this direction. Thanks so much, -Travis > Best, > > Sebastian > > > > > > Thanks, > > > > -Travis > > > > > > > > > > On Wed, Mar 11, 2020 at 7:08 PM Sebastian Berg < > > sebastian at sipsolutions.net> > > wrote: > > > > > Hi all, > > > > > > I am pleased to propose NEP 41: First step towards a new Datatype > > > System https://numpy.org/neps/nep-0041-improved-dtype-support.html > > > > > > This NEP motivates the larger restructure of the datatype machinery > > > in > > > NumPy and defines a few fundamental design aspects. The long term > > > user > > > impact will be allowing easier and more rich featured user defined > > > datatypes. > > > > > > As this is a large restructure, the NEP represents only the first > > > steps > > > with some additional information in further NEPs being drafted [1] > > > (this may be helpful to look at depending on the level of detail > > > you are > > > interested in). > > > The NEP itself does not propose to add significant new public API. > > > Instead it proposes to move forward with an incremental internal > > > refactor and lays the foundation for this process. > > > > > > The main user facing change at this time is that datatypes will > > > become > > > classes (e.g. ``type(np.dtype("float64"))`` will be a float64 > > > specific > > > class. > > > For most users, the main impact should be many new datatypes in the > > > long run (see the user impact section). However, for those > > > interested > > > in API design within NumPy or with respect to implementing new > > > datatypes, this and the following NEPs are important decisions in > > > the > > > future roadmap for NumPy. > > > > > > The current full text is reproduced below, although the above link > > > is > > > probably a better way to read it. > > > > > > Cheers > > > > > > Sebastian > > > > > > > > > [1] NEP 40 gives some background information about the current > > > systems > > > and issues with it: > > > > > > > https://github.com/numpy/numpy/blob/1248cf7a8765b7b53d883f9e7061173817533aac/doc/neps/nep-0040-legacy-datatype-impl.rst > > > and NEP 42 being a first draft of how the new API may look like: > > > > > > > > > > https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38f53cee3/doc/neps/nep-0042-new-dtypes.rst > > > (links to current rendered versions, check > > > https://github.com/numpy/numpy/pull/15505 and > > > https://github.com/numpy/numpy/pull/15507 for updates) > > > > > > > > > ----------------------------------------------------------------- > > > ----- > > > > > > > > > ================================================= > > > NEP 41 ? First step towards a new Datatype System > > > ================================================= > > > > > > :title: Improved Datatype Support > > > :Author: Sebastian Berg > > > :Author: St?fan van der Walt > > > :Author: Matti Picus > > > :Status: Draft > > > :Type: Standard Track > > > :Created: 2020-02-03 > > > > > > > > > .. note:: > > > > > > This NEP is part of a series of NEPs encompassing first > > > information > > > about the previous dtype implementation and issues with it in > > > NEP 40. > > > NEP 41 (this document) then provides an overview and generic > > > design > > > choices for the refactor. > > > Further NEPs 42 and 43 go into the technical details of the > > > datatype > > > and universal function related internal and external API > > > changes. > > > In some cases it may be necessary to consult the other NEPs for > > > a full > > > picture of the desired changes and why these changes are > > > necessary. > > > > > > > > > Abstract > > > -------- > > > > > > `Datatypes ` in NumPy describe how to > > > interpret > > > each > > > element in arrays. NumPy provides ``int``, ``float``, and > > > ``complex`` > > > numerical > > > types, as well as string, datetime, and structured datatype > > > capabilities. > > > The growing Python community, however, has need for more diverse > > > datatypes. > > > Examples are datatypes with unit information attached (such as > > > meters) or > > > categorical datatypes (fixed set of possible values). > > > However, the current NumPy datatype API is too limited to allow the > > > creation > > > of these. > > > > > > This NEP is the first step to enable such growth; it will lead to > > > a simpler development path for new datatypes. > > > In the long run the new datatype system will also support the > > > creation > > > of datatypes directly from Python rather than C. > > > Refactoring the datatype API will improve maintainability and > > > facilitate > > > development of both user-defined external datatypes, > > > as well as new features for existing datatypes internal to NumPy. > > > > > > > > > Motivation and Scope > > > -------------------- > > > > > > .. seealso:: > > > > > > The user impact section includes examples of what kind of new > > > datatypes > > > will be enabled by the proposed changes in the long run. > > > It may thus help to read these section out of order. > > > > > > Motivation > > > ^^^^^^^^^^ > > > > > > One of the main issues with the current API is the definition of > > > typical > > > functions such as addition and multiplication for parametric > > > datatypes > > > (see also NEP 40) which require additional steps to determine the > > > output > > > type. > > > For example when adding two strings of length 4, the result is a > > > string > > > of length 8, which is different from the input. > > > Similarly, a datatype which embeds a physical unit must calculate > > > the new > > > unit > > > information: dividing a distance by a time results in a speed. > > > A related difficulty is that the :ref:`current casting rules > > > <_ufuncs.casting>` > > > -- the conversion between different datatypes -- > > > cannot describe casting for such parametric datatypes implemented > > > outside > > > of NumPy. > > > > > > This additional functionality for supporting parametric datatypes > > > introduces > > > increased complexity within NumPy itself, > > > and furthermore is not available to external user-defined > > > datatypes. > > > In general the concerns of different datatypes are not well > > > well-encapsulated. > > > This burden is exacerbated by the exposure of internal C > > > structures, > > > limiting the addition of new fields > > > (for example to support new sorting methods [new_sort]_). > > > > > > Currently there are many factors which limit the creation of new > > > user-defined > > > datatypes: > > > > > > * Creating casting rules for parametric user-defined dtypes is > > > either > > > impossible > > > or so complex that it has never been attempted. > > > * Type promotion, e.g. the operation deciding that adding float and > > > integer > > > values should return a float value, is very valuable for numeric > > > datatypes > > > but is limited in scope for user-defined and especially > > > parametric > > > datatypes. > > > * Much of the logic (e.g. promotion) is written in single functions > > > instead of being split as methods on the datatype itself. > > > * In the current design datatypes cannot have methods that do not > > > generalize > > > to other datatypes. For example a unit datatype cannot have a > > > ``.to_si()`` method to > > > easily find the datatype which would represent the same values in > > > SI > > > units. > > > > > > The large need to solve these issues has driven the scientific > > > community > > > to create work-arounds in multiple projects implementing physical > > > units as > > > an > > > array-like class instead of a datatype, which would generalize > > > better > > > across > > > multiple array-likes (Dask, pandas, etc.). > > > Already, Pandas has made a push into the same direction with its > > > extension arrays [pandas_extension_arrays]_ and undoubtedly > > > the community would be best served if such new features could be > > > common > > > between NumPy, Pandas, and other projects. > > > > > > Scope > > > ^^^^^ > > > > > > The proposed refactoring of the datatype system is a large > > > undertaking and > > > thus is proposed to be split into various phases, roughly: > > > > > > * Phase I: Restructure and extend the datatype infrastructure (This > > > NEP 41) > > > * Phase II: Incrementally define or rework API (Detailed largely in > > > NEPs > > > 42/43) > > > * Phase III: Growth of NumPy and Scientific Python Ecosystem > > > capabilities. > > > > > > For a more detailed accounting of the various phases, see > > > "Plan to Approach the Full Refactor" in the Implementation section > > > below. > > > This NEP proposes to move ahead with the necessary creation of new > > > dtype > > > subclasses (Phase I), > > > and start working on implementing current functionality. > > > Within the context of this NEP all development will be fully > > > private API or > > > use preliminary underscored names which must be changed in the > > > future. > > > Most of the internal and public API choices are part of a second > > > Phase > > > and will be discussed in more detail in the following NEPs 42 and > > > 43. > > > The initial implementation of this NEP will have little or no > > > effect on > > > users, > > > but provides the necessary ground work for incrementally addressing > > > the > > > full rework. > > > > > > The implementation of this NEP and the following, implied large > > > rework of > > > how > > > datatypes are defined in NumPy is expected to create small > > > incompatibilities > > > (see backward compatibility section). > > > However, a transition requiring large code adaption is not > > > anticipated and > > > not > > > within scope. > > > > > > Specifically, this NEP makes the following design choices which are > > > discussed > > > in more details in the detailed description section: > > > > > > 1. Each datatype will be an instance of a subclass of ``np.dtype``, > > > with > > > most of the > > > datatype-specific logic being implemented > > > as special methods on the class. In the C-API, these correspond > > > to > > > specific > > > slots. In short, for ``f = np.dtype("f8")``, ``isinstance(f, > > > np.dtype)`` will remain true, > > > but ``type(f)`` will be a subclass of ``np.dtype`` rather than > > > just > > > ``np.dtype`` itself. > > > The ``PyArray_ArrFuncs`` which are currently stored as a pointer > > > on the > > > instance (as ``PyArray_Descr->f``), > > > should instead be stored on the class as typically done in > > > Python. > > > In the future these may correspond to python side dunder > > > methods. > > > Storage information such as itemsize and byteorder can differ > > > between > > > different dtype instances (e.g. "S3" vs. "S8") and will remain > > > part of > > > the instance. > > > This means that in the long run the current lowlevel access to > > > dtype > > > methods > > > will be removed (see ``PyArray_ArrFuncs`` in NEP 40). > > > > > > 2. The current NumPy scalars will *not* change, they will not be > > > instances > > > of > > > datatypes. This will also be true for new datatypes, scalars > > > will not be > > > instances of a dtype (although ``isinstance(scalar, dtype)`` may > > > be made > > > to return ``True`` when appropriate). > > > > > > Detailed technical decisions to follow in NEP 42. > > > > > > Further, the public API will be designed in a way that is > > > extensible in > > > the future: > > > > > > 3. All new C-API functions provided to the user will hide > > > implementation > > > details > > > as much as possible. The public API should be an identical, but > > > limited, > > > version of the C-API used for the internal NumPy datatypes. > > > > > > The changes to the datatype system in Phase II must include a large > > > refactor of the > > > UFunc machinery, which will be further defined in NEP 43: > > > > > > 4. To enable all of the desired functionality for new user-defined > > > datatypes, > > > the UFunc machinery will be changed to replace the current > > > dispatching > > > and type resolution system. > > > The old system should be *mostly* supported as a legacy version > > > for > > > some time. > > > > > > Additionally, as a general design principle, the addition of new > > > user-defined > > > datatypes will *not* change the behaviour of programs. > > > For example ``common_dtype(a, b)`` must not be ``c`` unless ``a`` > > > or ``b`` > > > know > > > that ``c`` exists. > > > > > > > > > User Impact > > > ----------- > > > > > > The current ecosystem has very few user-defined datatypes using > > > NumPy, the > > > two most prominent being: ``rational`` and ``quaternion``. > > > These represent fairly simple datatypes which are not strongly > > > impacted > > > by the current limitations. > > > However, we have identified a need for datatypes such as: > > > > > > * bfloat16, used in deep learning > > > * categorical types > > > * physical units (such as meters) > > > * datatypes for tracing/automatic differentiation > > > * high, fixed precision math > > > * specialized integer types such as int2, int24 > > > * new, better datetime representations > > > * extending e.g. integer dtypes to have a sentinel NA value > > > * geometrical objects [pygeos]_ > > > > > > Some of these are partially solved; for example unit capability is > > > provided > > > in ``astropy.units``, ``unyt``, or ``pint``, as `numpy.ndarray` > > > subclasses. > > > Most of these datatypes, however, simply cannot be reasonably > > > defined > > > right now. > > > An advantage of having such datatypes in NumPy is that they should > > > integrate > > > seamlessly with other array or array-like packages such as Pandas, > > > ``xarray`` [xarray_dtype_issue]_, or ``Dask``. > > > > > > The long term user impact of implementing this NEP will be to allow > > > both > > > the growth of the whole ecosystem by having such new datatypes, as > > > well as > > > consolidating implementation of such datatypes within NumPy to > > > achieve > > > better interoperability. > > > > > > > > > Examples > > > ^^^^^^^^ > > > > > > The following examples represent future user-defined datatypes we > > > wish to > > > enable. > > > These datatypes are not part the NEP and choices (e.g. choice of > > > casting > > > rules) > > > are possibilities we wish to enable and do not represent > > > recommendations. > > > > > > Simple Numerical Types > > > """""""""""""""""""""" > > > > > > Mainly used where memory is a consideration, lower-precision > > > numeric types > > > such as :ref:```bfloat16`` < > > > https://en.wikipedia.org/wiki/Bfloat16_floating-point_format>` > > > are common in other computational frameworks. > > > For these types the definitions of things such as > > > ``np.common_type`` and > > > ``np.can_cast`` are some of the most important interfaces. Once > > > they > > > support ``np.common_type``, it is (for the most part) possible to > > > find > > > the correct ufunc loop to call, since most ufuncs -- such as add -- > > > effectively > > > only require ``np.result_type``:: > > > > > > >>> np.add(arr1, arr2).dtype == np.result_type(arr1, arr2) > > > > > > and `~numpy.result_type` is largely identical to > > > `~numpy.common_type`. > > > > > > > > > Fixed, high precision math > > > """""""""""""""""""""""""" > > > > > > Allowing arbitrary precision or higher precision math is important > > > in > > > simulations. For instance ``mpmath`` defines a precision:: > > > > > > >>> import mpmath as mp > > > >>> print(mp.dps) # the current (default) precision > > > 15 > > > > > > NumPy should be able to construct a native, memory-efficient array > > > from > > > a list of ``mpmath.mpf`` floating point objects:: > > > > > > >>> arr_15_dps = np.array(mp.arange(3)) # (mp.arange returns a > > > list) > > > >>> print(arr_15_dps) # Must find the correct precision from > > > the > > > objects: > > > array(['0.0', '1.0', '2.0'], dtype=mpf[dps=15]) > > > > > > We should also be able to specify the desired precision when > > > creating the datatype for an array. Here, we use > > > ``np.dtype[mp.mpf]`` > > > to find the DType class (the notation is not part of this NEP), > > > which is then instantiated with the desired parameter. > > > This could also be written as ``MpfDType`` class:: > > > > > > >>> arr_100_dps = np.array([1, 2, 3], > > > dtype=np.dtype[mp.mpf](dps=100)) > > > >>> print(arr_15_dps + arr_100_dps) > > > array(['0.0', '2.0', '4.0'], dtype=mpf[dps=100]) > > > > > > The ``mpf`` datatype can decide that the result of the operation > > > should be > > > the > > > higher precision one of the two, so uses a precision of 100. > > > Furthermore, we should be able to define casting, for example as > > > in:: > > > > > > >>> np.can_cast(arr_15_dps.dtype, arr_100_dps.dtype, > > > casting="safe") > > > True > > > >>> np.can_cast(arr_100_dps.dtype, arr_15_dps.dtype, > > > casting="safe") > > > False # loses precision > > > >>> np.can_cast(arr_100_dps.dtype, arr_100_dps.dtype, > > > casting="same_kind") > > > True > > > > > > Casting from float is a probably always at least a ``same_kind`` > > > cast, but > > > in general, it is not safe:: > > > > > > >>> np.can_cast(np.float64, np.dtype[mp.mpf](dps=4), > > > casting="safe") > > > False > > > > > > since a float64 has a higer precision than the ``mpf`` datatype > > > with > > > ``dps=4``. > > > > > > Alternatively, we can say that:: > > > > > > >>> np.common_type(np.dtype[mp.mpf](dps=5), > > > np.dtype[mp.mpf](dps=10)) > > > np.dtype[mp.mpf](dps=10) > > > > > > And possibly even:: > > > > > > >>> np.common_type(np.dtype[mp.mpf](dps=5), np.float64) > > > np.dtype[mp.mpf](dps=16) # equivalent precision to float64 (I > > > believe) > > > > > > since ``np.float64`` can be cast to a ``np.dtype[mp.mpf](dps=16)`` > > > safely. > > > > > > > > > Categoricals > > > """""""""""" > > > > > > Categoricals are interesting in that they can have fixed, > > > predefined > > > values, > > > or can be dynamic with the ability to modify categories when > > > necessary. > > > The fixed categories (defined ahead of time) is the most straight > > > forward > > > categorical definition. > > > Categoricals are *hard*, since there are many strategies to > > > implement them, > > > suggesting NumPy should only provide the scaffolding for user- > > > defined > > > categorical types. For instance:: > > > > > > >>> cat = Categorical(["eggs", "spam", "toast"]) > > > >>> breakfast = array(["eggs", "spam", "eggs", "toast"], > > > dtype=cat) > > > > > > could store the array very efficiently, since it knows that there > > > are only > > > 3 > > > categories. > > > Since a categorical in this sense knows almost nothing about the > > > data > > > stored > > > in it, few operations makes, sense, although equality does: > > > > > > >>> breakfast2 = array(["eggs", "eggs", "eggs", "eggs"], > > > dtype=cat) > > > >>> breakfast == breakfast2 > > > array[True, False, True, False]) > > > > > > The categorical datatype could work like a dictionary: no two > > > items names can be equal (checked on dtype creation), so that the > > > equality > > > operation above can be performed very efficiently. > > > If the values define an order, the category labels (internally > > > integers) > > > could > > > be ordered the same way to allow efficient sorting and comparison. > > > > > > Whether or not casting is defined from one categorical with less to > > > one > > > with > > > strictly more values defined, is something that the Categorical > > > datatype > > > would > > > need to decide. Both options should be available. > > > > > > > > > Unit on the Datatype > > > """""""""""""""""""" > > > > > > There are different ways to define Units, depending on how the > > > internal > > > machinery would be organized, one way is to have a single Unit > > > datatype > > > for every existing numerical type. > > > This will be written as ``Unit[float64]``, the unit itself is part > > > of the > > > DType instance ``Unit[float64]("m")`` is a ``float64`` with meters > > > attached:: > > > > > > >>> from astropy import units > > > >>> meters = np.array([1, 2, 3], dtype=np.float64) * units.m # > > > meters > > > >>> print(meters) > > > array([1.0, 2.0, 3.0], dtype=Unit[float64]("m")) > > > > > > Note that units are a bit tricky. It is debatable, whether:: > > > > > > >>> np.array([1.0, 2.0, 3.0], dtype=Unit[float64]("m")) > > > > > > should be valid syntax (coercing the float scalars without a unit > > > to > > > meters). > > > Once the array is created, math will work without any issue:: > > > > > > >>> meters / (2 * unit.seconds) > > > array([0.5, 1.0, 1.5], dtype=Unit[float64]("m/s")) > > > > > > Casting is not valid from one unit to the other, but can be valid > > > between > > > different scales of the same dimensionality (although this may be > > > "unsafe"):: > > > > > > >>> meters.astype(Unit[float64]("s")) > > > TypeError: Cannot cast meters to seconds. > > > >>> meters.astype(Unit[float64]("km")) > > > >>> # Convert to centimeter-gram-second (cgs) units: > > > >>> meters.astype(meters.dtype.to_cgs()) > > > > > > The above notation is somewhat clumsy. Functions > > > could be used instead to convert between units. > > > There may be ways to make these more convenient, but those must be > > > left > > > for future discussions:: > > > > > > >>> units.convert(meters, "km") > > > >>> units.to_cgs(meters) > > > > > > There are some open questions. For example, whether additional > > > methods > > > on the array object could exist to simplify some of the notions, > > > and how > > > these > > > would percolate from the datatype to the ``ndarray``. > > > > > > The interaction with other scalars would likely be defined > > > through:: > > > > > > >>> np.common_type(np.float64, Unit) > > > Unit[np.float64](dimensionless) > > > > > > Ufunc output datatype determination can be more involved than for > > > simple > > > numerical dtypes since there is no "universal" output type:: > > > > > > >>> np.multiply(meters, seconds).dtype != > > > np.result_type(meters, > > > seconds) > > > > > > In fact ``np.result_type(meters, seconds)`` must error without > > > context > > > of the operation being done. > > > This example highlights how the specific ufunc loop > > > (loop with known, specific DTypes as inputs), has to be able to to > > > make > > > certain decisions before the actual calculation can start. > > > > > > > > > > > > Implementation > > > -------------- > > > > > > Plan to Approach the Full Refactor > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > > To address these issues in NumPy and enable new datatypes, > > > multiple development stages are required: > > > > > > * Phase I: Restructure and extend the datatype infrastructure (This > > > NEP) > > > > > > * Organize Datatypes like normal Python classes [`PR 15508`]_ > > > > > > * Phase II: Incrementally define or rework API > > > > > > * Create a new and easily extensible API for defining new > > > datatypes > > > and related functionality. (NEP 42) > > > > > > * Incrementally define all necessary functionality through the > > > new API > > > (NEP 42): > > > > > > * Defining operations such as ``np.common_type``. > > > * Allowing to define casting between datatypes. > > > * Add functionality necessary to create a numpy array from > > > Python > > > scalars > > > (i.e. ``np.array(...)``). > > > * ? > > > > > > * Restructure how universal functions work (NEP 43), in order to: > > > > > > * make it possible to allow a `~numpy.ufunc` such as ``np.add`` > > > to be > > > extended by user-defined datatypes such as Units. > > > > > > * allow efficient lookup for the correct implementation for > > > user-defined > > > datatypes. > > > > > > * enable reuse of existing code. Units should be able to use > > > the > > > normal math loops and add additional logic to determine > > > output type. > > > > > > * Phase III: Growth of NumPy and Scientific Python Ecosystem > > > capabilities: > > > > > > * Cleanup of legacy behaviour where it is considered buggy or > > > undesirable. > > > * Provide a path to define new datatypes from Python. > > > * Assist the community in creating types such as Units or > > > Categoricals > > > * Allow strings to be used in functions such as ``np.equal`` or > > > ``np.add``. > > > * Remove legacy code paths within NumPy to improve long term > > > maintainability > > > > > > This document serves as a basis for phase I and provides the vision > > > and > > > motivation for the full project. > > > Phase I does not introduce any new user-facing features, > > > but is concerned with the necessary conceptual cleanup of the > > > current > > > datatype system. > > > It provides a more "pythonic" datatype Python type object, with a > > > clear > > > class hierarchy. > > > > > > The second phase is the incremental creation of all APIs necessary > > > to > > > define > > > fully featured datatypes and reorganization of the NumPy datatype > > > system. > > > This phase will thus be primarily concerned with defining an, > > > initially preliminary, stable public API. > > > > > > Some of the benefits of a large refactor may only become evident > > > after the > > > full > > > deprecation of the current legacy implementation (i.e. larger code > > > removals). > > > However, these steps are necessary for improvements to many parts > > > of the > > > core NumPy API, and are expected to make the implementation > > > generally > > > easier to understand. > > > > > > The following figure illustrates the proposed design at a high > > > level, > > > and roughly delineates the components of the overall design. > > > Note that this NEP only regards Phase I (shaded area), > > > the rest encompasses Phase II and the design choices are up for > > > discussion, > > > however, it highlights that the DType datatype class is the > > > central, > > > necessary > > > concept: > > > > > > .. image:: _static/nep-0041-mindmap.svg > > > > > > > > > First steps directly related to this NEP > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > > The required changes necessary to NumPy are large and touch many > > > areas > > > of the code base > > > but many of these changes can be addressed incrementally. > > > > > > To enable an incremental approach we will start by creating a C > > > defined > > > ``PyArray_DTypeMeta`` class with its instances being the ``DType`` > > > classes, > > > subclasses of ``np.dtype``. > > > This is necessary to add the ability of storing custom slots on the > > > DType > > > in C. > > > This ``DTypeMeta`` will be implemented first to then enable > > > incremental > > > restructuring of current code. > > > > > > The addition of ``DType`` will then enable addressing other changes > > > incrementally, some of which may begin before the settling the full > > > internal > > > API: > > > > > > 1. New machinery for array coercion, with the goal of enabling user > > > DTypes > > > with appropriate class methods. > > > 2. The replacement or wrapping of the current casting machinery. > > > 3. Incremental redefinition of the current ``PyArray_ArrFuncs`` > > > slots into > > > DType method slots. > > > > > > At this point, no or only very limited new public API will be added > > > and > > > the internal API is considered to be in flux. > > > Any new public API may be set up give warnings and will have > > > leading > > > underscores > > > to indicate that it is not finalized and can be changed without > > > warning. > > > > > > > > > Backward compatibility > > > ---------------------- > > > > > > While the actual backward compatibility impact of implementing > > > Phase I and > > > II > > > are not yet fully clear, we anticipate, and accept the following > > > changes: > > > > > > * **Python API**: > > > > > > * ``type(np.dtype("f8"))`` will be a subclass of ``np.dtype``, > > > while > > > right > > > now ``type(np.dtype("f8")) is np.dtype``. > > > Code should use ``isinstance`` checks, and in very rare cases > > > may have > > > to > > > be adapted to use it. > > > > > > * **C-API**: > > > > > > * In old versions of NumPy ``PyArray_DescrCheck`` is a macro > > > which uses > > > ``type(dtype) is np.dtype``. When compiling against an old > > > NumPy > > > version, > > > the macro may have to be replaced with the corresponding > > > ``PyObject_IsInstance`` call. (If this is a problem, we could > > > backport > > > fixing the macro) > > > > > > * The UFunc machinery changes will break *limited* parts of the > > > current > > > implementation. Replacing e.g. the default ``TypeResolver`` is > > > expected > > > to remain supported for a time, although optimized masked > > > inner loop > > > iteration > > > (which is not even used *within* NumPy) will no longer be > > > supported. > > > > > > * All functions currently defined on the dtypes, such as > > > ``PyArray_Descr->f->nonzero``, will be defined and accessed > > > differently. > > > This means that in the long run lowlevel access code will > > > have to be changed to use the new API. Such changes are > > > expected to be > > > necessary in very few project. > > > > > > * **dtype implementors (C-API)**: > > > > > > * The array which is currently provided to some functions (such > > > as cast > > > functions), > > > will no longer be provided. > > > For example ``PyArray_Descr->f->nonzero`` or > > > ``PyArray_Descr->f->copyswapn``, > > > may instead receive a dummy array object with only some fields > > > (mainly > > > the > > > dtype), being valid. > > > At least in some code paths, a similar mechanism is already > > > used. > > > > > > * The ``scalarkind`` slot and registration of scalar casting will > > > be > > > removed/ignored without replacement. > > > It currently allows partial value-based casting. > > > The ``PyArray_ScalarKind`` function will continue to work for > > > builtin > > > types, > > > but will not be used internally and be deprecated. > > > > > > * Currently user dtypes are defined as instances of > > > ``np.dtype``. > > > The creation works by the user providing a prototype instance. > > > NumPy will need to modify at least the type during > > > registration. > > > This has no effect for either ``rational`` or ``quaternion`` > > > and > > > mutation > > > of the structure seems unlikely after registration. > > > > > > Since there is a fairly large API surface concerning datatypes, > > > further > > > changes > > > or the limitation certain function to currently existing datatypes > > > is > > > likely to occur. > > > For example functions which use the type number as input > > > should be replaced with functions taking DType classes instead. > > > Although public, large parts of this C-API seem to be used rarely, > > > possibly never, by downstream projects. > > > > > > > > > > > > Detailed Description > > > -------------------- > > > > > > This section details the design decisions covered by this NEP. > > > The subsections correspond to the list of design choices presented > > > in the Scope section. > > > > > > Datatypes as Python Classes (1) > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > > The current NumPy datatypes are not full scale python classes. > > > They are instead (prototype) instances of a single ``np.dtype`` > > > class. > > > Changing this means that any special handling, e.g. for > > > ``datetime`` > > > can be moved to the Datetime DType class instead, away from > > > monolithic > > > general > > > code (e.g. current ``PyArray_AdjustFlexibleDType``). > > > > > > The main consequence of this change with respect to the API is that > > > special methods move from the dtype instances to methods on the new > > > DType > > > class. > > > This is the typical design pattern used in Python. > > > Organizing these methods and information in a more Pythonic way > > > provides a > > > solid foundation for refining and extending the API in the future. > > > The current API cannot be extended due to how it is exposed > > > publically. > > > This means for example that the methods currently stored in > > > ``PyArray_ArrFuncs`` > > > on each datatype (see NEP 40) will be defined differently in the > > > future and > > > deprecated in the long run. > > > > > > The most prominent visible side effect of this will be that > > > ``type(np.dtype(np.float64))`` will not be ``np.dtype`` anymore. > > > Instead it will be a subclass of ``np.dtype`` meaning that > > > ``isinstance(np.dtype(np.float64), np.dtype)`` will remain true. > > > This will also add the ability to use ``isinstance(dtype, > > > np.dtype[float64])`` > > > thus removing the need to use ``dtype.kind``, ``dtype.char``, or > > > ``dtype.type`` > > > to do this check. > > > > > > With the design decision of DTypes as full-scale Python classes, > > > the question of subclassing arises. > > > Inheritance, however, appears problematic and a complexity best > > > avoided > > > (at least initially) for container datatypes. > > > Further, subclasses may be more interesting for interoperability > > > for > > > example with GPU backends (CuPy) storing additional methods related > > > to the > > > GPU rather than as a mechanism to define new datatypes. > > > A class hierarchy does provides value, this may be achieved by > > > allowing the creation of *abstract* datatypes. > > > An example for an abstract datatype would be the datatype > > > equivalent of > > > ``np.floating``, representing any floating point number. > > > These can serve the same purpose as Python's abstract base classes. > > > > > > > > > Scalars should not be instances of the datatypes (2) > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > > For simple datatypes such as ``float64`` (see also below), it seems > > > tempting that the instance of a ``np.dtype("float64")`` can be the > > > scalar. > > > This idea may be even more appealing due to the fact that scalars, > > > rather than datatypes, currently define a useful type hierarchy. > > > > > > However, we have specifically decided against this for a number of > > > reasons. > > > First, the new datatypes described herein would be instances of > > > DType > > > classes. > > > Making these instances themselves classes, while possible, adds > > > additional > > > complexity that users need to understand. > > > It would also mean that scalars must have storage information (such > > > as > > > byteorder) > > > which is generally unnecessary and currently is not used. > > > Second, while the simple NumPy scalars such as ``float64`` may be > > > such > > > instances, > > > it should be possible to create datatypes for Python objects > > > without > > > enforcing > > > NumPy as a dependency. > > > However, Python objects that do not depend on NumPy cannot be > > > instances of > > > a NumPy DType. > > > Third, there is a mismatch between the methods and attributes which > > > are > > > useful > > > for scalars and datatypes. For instance ``to_float()`` makes sense > > > for a > > > scalar > > > but not for a datatype and ``newbyteorder`` is not useful on a > > > scalar (or > > > has > > > a different meaning). > > > > > > Overall, it seem rather than reducing the complexity, i.e. by > > > merging > > > the two distinct type hierarchies, making scalars instances of > > > DTypes would > > > increase the complexity of both the design and implementation. > > > > > > A possible future path may be to instead simplify the current NumPy > > > scalars to > > > be much simpler objects which largely derive their behaviour from > > > the > > > datatypes. > > > > > > C-API for creating new Datatypes (3) > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > > The current C-API with which users can create new datatypes > > > is limited in scope, and requires use of "private" structures. This > > > means > > > the API is not extensible: no new members can be added to the > > > structure > > > without losing binary compatibility. > > > This has already limited the inclusion of new sorting methods into > > > NumPy [new_sort]_. > > > > > > The new version shall thus replace the current ``PyArray_ArrFuncs`` > > > structure used > > > to define new datatypes. > > > Datatypes that currently exist and are defined using these slots > > > will be > > > supported during a deprecation period. > > > > > > The most likely solution is to hide the implementation from the > > > user and > > > thus make > > > it extensible in the future is to model the API after Python's > > > stable > > > API [PEP-384]_: > > > > > > .. code-block:: C > > > > > > static struct PyArrayMethodDef slots[] = { > > > {NPY_dt_method, method_implementation}, > > > ..., > > > {0, NULL} > > > } > > > > > > typedef struct{ > > > PyTypeObject *typeobj; /* type of python scalar */ > > > ...; > > > PyType_Slot *slots; > > > } PyArrayDTypeMeta_Spec; > > > > > > PyObject* PyArray_InitDTypeMetaFromSpec( > > > PyArray_DTypeMeta *user_dtype, PyArrayDTypeMeta_Spec > > > *dtype_spec); > > > > > > The C-side slots should be designed to mirror Python side methods > > > such as ``dtype.__dtype_method__``, although the exposure to Python > > > is > > > a later step in the implementation to reduce the complexity of the > > > initial > > > implementation. > > > > > > > > > C-API Changes to the UFunc Machinery (4) > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > > Proposed changes to the UFunc machinery will be part of NEP 43. > > > However, the following changes will be necessary (see NEP 40 for a > > > detailed > > > description of the current implementation and its issues): > > > > > > * The current UFunc type resolution must be adapted to allow better > > > control > > > for user-defined dtypes as well as resolve current > > > inconsistencies. > > > * The inner-loop used in UFuncs must be expanded to include a > > > return value. > > > Further, error reporting must be improved, and passing in dtype- > > > specific > > > information enabled. > > > This requires the modification of the inner-loop function > > > signature and > > > addition of new hooks called before and after the inner-loop is > > > used. > > > > > > An important goal for any changes to the universal functions will > > > be to > > > allow the reuse of existing loops. > > > It should be easy for a new units datatype to fall back to existing > > > math > > > functions after handling the unit related computations. > > > > > > > > > Discussion > > > ---------- > > > > > > See NEP 40 for a list of previous meetings and discussions. > > > > > > > > > References > > > ---------- > > > > > > .. [pandas_extension_arrays] > > > > https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types > > > > > > .. _xarray_dtype_issue: > > > https://github.com/pydata/xarray/issues/1262 > > > > > > .. [pygeos] https://github.com/caspervdw/pygeos > > > > > > .. [new_sort] https://github.com/numpy/numpy/pull/12945 > > > > > > .. [PEP-384] https://www.python.org/dev/peps/pep-0384/ > > > > > > .. [PR 15508] https://github.com/numpy/numpy/pull/15508 > > > > > > > > > Copyright > > > --------- > > > > > > This document has been placed in the public domain. > > > > > > > > > Acknowledgments > > > --------------- > > > > > > The effort to create new datatypes for NumPy has been discussed for > > > several > > > years in many different contexts and settings, making it impossible > > > to > > > list everyone involved. > > > We would like to thank especially Stephan Hoyer, Nathaniel Smith, > > > and Eric > > > Wieser > > > for repeated in-depth discussion about datatype design. > > > We are very grateful for the community input in reviewing and > > > revising this > > > NEP and would like to thank especially Ross Barnowski and Ralf > > > Gommers. > > > > > > _______________________________________________ > > > NumPy-Discussion mailing list > > > NumPy-Discussion at python.org > > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From faltet at gmail.com Mon Mar 23 13:23:07 2020 From: faltet at gmail.com (Francesc Alted) Date: Mon, 23 Mar 2020 18:23:07 +0100 Subject: [Numpy-discussion] Proposal: NEP 41 -- First step towards a new Datatype System In-Reply-To: <162390781577dd2c2916f4dc600b65cf7f09abb1.camel@sipsolutions.net> References: <3ee04012570e418a874e4c436b5b6f8ac5f56cdb.camel@sipsolutions.net> <162390781577dd2c2916f4dc600b65cf7f09abb1.camel@sipsolutions.net> Message-ID: On Sun, Mar 22, 2020 at 7:33 PM Sebastian Berg wrote: > Hi, > > thanks for the feedback! > > On Sat, 2020-03-21 at 15:58 -0500, Travis Oliphant wrote: > > Thanks for publicizing this and all the work that has gone into > > getting > > this far. > > > > I'm extremely supportive of the foundational DType meta-type and > > making > > dtypes classes. This was the epiphany I had in 2015 that led me to > > experiment with xnd and later mtypes. I have not had the funding to > > work > > on it much since that time directly. > > Right, I realize it is an old idea, if you have any references I am > missing (I am sure there are many), I am happy to add them. > > > But, this is the right way to connect the data type system with the > > rest of > > Python typing. NumPy's current dtypes are currently analogous to > > Python > > 1's user-defined classes. In Python 1 *all* user-defined classes > > were > > instances of a single Class Type at the C-level, just like currently > > all > > NumPy dtypes are instances of a single Dtype "Type" in Python. > > > > Shifting Dtypes to be true types (by making them instances of a > > single > > low-level MetaType) is (IMHO) exactly the right approach. Doing > > this > > first while trying to minimize other changes will help a lot. I'm > > very > > excited by the work being done in this direction. > > > > I can appreciate the desire to be cautious on some of the other > > issues > > (like removing numpy array scalars). I do still think that > > eventually > > removing numpy array scalars in lieu of instances of dtype objects > > will be > > less complex approach and am not sold generally by the reasons listed > > in > > the NEP (though I can appreciate that it's not something to do as > > part of > > *this* NEP) as getting there might take more effort than desired at > > this > > point. > > Well, I do think it is a pretty strong design decision here though. If > instances of DType classes are the actual dtypes (and not themselves > classes, then it seems strange if scalars are also (direct) instances > of the same DType class? > > Of course we can and probably will allow `isinstance(scalar, DType)` to > work in either case. I do not see a problem with that, although I do > not feel like making that decision right now. > > If we can agree on still going this direction for now I am happy of > course. Nothing stops us from amending or finding new solutions in the > future after all. > > I used to love the idea, but to be honest, I currently do not see: > > 1. How to approach it. It would have to be within Python itself, or we > would need more shims for Python builtin types? > 2. That it is actually helpful for users. > > If we were designing a new programming language around array computing > principles, I do think that would be the approach I would want to > take/consider. But I simply lack the vision of how marrying the idea > with the scalar language Python would work out well... > I have had a glance at what you are after, and it seems challenging indeed. IMO, trying to cope with everybody's need regarding data types is extremely costly (or more simply, not even possible). I think that a better approach would be to decouple the storage part of a container from its data type system. In the storage there should go things needed to cope with the data retrieval, like the itemsize, the shape, or even other sub-shapes for chunked datasets. Then, in the data type layer, one should be able to add meaning to the raw data: is that an integer? speed? temperature? a compound type? Indeed the data storage layer should be able to provide a way to store the data type representation so that a container can be serialized and deserialized correctly. But the important thing here is that this decoupling between storage and types allows for different data type systems, so that anyone can come with a specific type system depending on her needs. One can envision here even a basic data type system (e.g. a version of what's now supported in NumPy) that can be extended with other layers, depending on the needs, so that every community can interchange data at a basic level at least. As an example, this is the basic spirit behind the under-construction Caterva array container (https://github.com/Blosc/Caterva). Blosc2 ( https://github.com/Blosc/C-Blosc2) will be providing the low-level storage layer, with no information about dimensionality. Caterva will be building the multidimensional layer, but with no information about the types at all. On top of this scaffold, third-party layers will be free to build their own data dtypes, specific for every domain (the concept is imaged in slide 18 of this presentation: https://blosc.org/docs/Caterva-HDF5-Workshop.pdf). There is nothing to prevent to add more layers, or even a two-layer (and preferably no more than two-level) data type system: one for simple data types (e.g. NumPy ones) and one meant to be more domain-specific. Right now, I think that the limitation that is keeping the NumPy community thinking in terms of blending storage and types in the same layer is that NumPy is providing a computational engine too, and for doing computations, one need to provide both storage (including dimensionality info) and type information indeed. By using the multi-layer approach, there should be a computational layer that is laid out on top of the storage and the type layers, and hence, specific for leveraging them. Sure, that is a big departure from what we are used to, but as long as one can keep the architecture of the different layers simple, one could see interesting results in not that long time. Just my 2 cents, Francesc > > > > What I would *strongly* recommend right now, however, is to make the > > new > > NumPy dtype system a separately-installable module (kept in the NumPy > > GitHub organization). In that way, people can depend on the NumPy > > type > > system without depending on NumPy itself. I think this will become > > more > > and more important in the future. It will help the design as you see > > NumPy > > as one of many *consumers* of the type system instead of the only > > one. It > > would also help projects like arrow and xnd and others in the future > > that > > might only want to depend on NumPy's type system but otherwise > > implement > > their own computations. > > > > Right, I agree that is the correct long term direction to see the > DTypes as distinct from the NumPy array, and maybe I should add that to > the NEP. > What I am unsure about is the feasibility? If we develop it outside of > NumPy, it harder to: > > 1. Use the new system without actually exposing it as public API in > order to incrementally replace the old with a newer machinery. > 2. It may require either exposing subclassing capabilities to NumPy to > add shims for legacy DTypes right from the start, or add a bunch of > public API which is only meant to be used within NumPy to that project? > > I suppose, I am also not sure that having it in NumPy (at least for > now) is actually all that bad? For array-likes it is probably not a the > most heavy dependency (and it could be slimmed down into a core). > > Since the intention is to dog-feed the API as much as possible and to > limit the public API, it should be plausible to rip it out later of > course. > I am sure that will be more overall effort, but I suppose I feel it is > much more approachable effort. > > One thing I would like is for projects such as CuPy to be able to > subclass DTypes at some point to tag on the GPU aware things they need. > But in some sense the basic DTypes seem to require being tied in with > NumPy? They must be associated with the NumPy scalars, and the basic > methods defined for all DTypes (also user DTypes) will probably be > strided-inner-loops on the CPU. > > > This might require a little more work to provide an adaptor layer in > > NumPy > > itself to use the new system instead of its current dtypes, but I > > think it > > will also help ensure that the datatype API is cleaner and more > > useful to > > the Python ecosystem as a whole. > > While I fully agree with the sentiment, I suppose I am scared that the > little more work will end up being too much :(. We have pretty limited > resources and the most difficult work will not be writing the DType API > itself. It will be wrangling it into NumPy and the associated huge > review effort to get it right. > Only by actually wrangling it into NumPy, I think we can also get the > API fully right to begin with. > So, I am scared that moving development outside and trying to add the > more global scope at this time as will make the NumPy side much more > difficult :(. Maybe not even because it is actually much trickier, but > again because it seems less tangible/approachable. > > So, my main point here is that we have to make this large refactor as > approachable as possible, and if that means that at some point someone > has to spend a huge, but hopefully straight forward effort, to rip > DTypes out of NumPy, I think that might be a worthy trade-off. > Unless we can activate significantly larger resources very quickly. > > Best, > > Sebastian > > > > > > Thanks, > > > > -Travis > > > > > > > > > > On Wed, Mar 11, 2020 at 7:08 PM Sebastian Berg < > > sebastian at sipsolutions.net> > > wrote: > > > > > Hi all, > > > > > > I am pleased to propose NEP 41: First step towards a new Datatype > > > System https://numpy.org/neps/nep-0041-improved-dtype-support.html > > > > > > This NEP motivates the larger restructure of the datatype machinery > > > in > > > NumPy and defines a few fundamental design aspects. The long term > > > user > > > impact will be allowing easier and more rich featured user defined > > > datatypes. > > > > > > As this is a large restructure, the NEP represents only the first > > > steps > > > with some additional information in further NEPs being drafted [1] > > > (this may be helpful to look at depending on the level of detail > > > you are > > > interested in). > > > The NEP itself does not propose to add significant new public API. > > > Instead it proposes to move forward with an incremental internal > > > refactor and lays the foundation for this process. > > > > > > The main user facing change at this time is that datatypes will > > > become > > > classes (e.g. ``type(np.dtype("float64"))`` will be a float64 > > > specific > > > class. > > > For most users, the main impact should be many new datatypes in the > > > long run (see the user impact section). However, for those > > > interested > > > in API design within NumPy or with respect to implementing new > > > datatypes, this and the following NEPs are important decisions in > > > the > > > future roadmap for NumPy. > > > > > > The current full text is reproduced below, although the above link > > > is > > > probably a better way to read it. > > > > > > Cheers > > > > > > Sebastian > > > > > > > > > [1] NEP 40 gives some background information about the current > > > systems > > > and issues with it: > > > > > > > https://github.com/numpy/numpy/blob/1248cf7a8765b7b53d883f9e7061173817533aac/doc/neps/nep-0040-legacy-datatype-impl.rst > > > and NEP 42 being a first draft of how the new API may look like: > > > > > > > > > > https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38f53cee3/doc/neps/nep-0042-new-dtypes.rst > > > (links to current rendered versions, check > > > https://github.com/numpy/numpy/pull/15505 and > > > https://github.com/numpy/numpy/pull/15507 for updates) > > > > > > > > > ----------------------------------------------------------------- > > > ----- > > > > > > > > > ================================================= > > > NEP 41 ? First step towards a new Datatype System > > > ================================================= > > > > > > :title: Improved Datatype Support > > > :Author: Sebastian Berg > > > :Author: St?fan van der Walt > > > :Author: Matti Picus > > > :Status: Draft > > > :Type: Standard Track > > > :Created: 2020-02-03 > > > > > > > > > .. note:: > > > > > > This NEP is part of a series of NEPs encompassing first > > > information > > > about the previous dtype implementation and issues with it in > > > NEP 40. > > > NEP 41 (this document) then provides an overview and generic > > > design > > > choices for the refactor. > > > Further NEPs 42 and 43 go into the technical details of the > > > datatype > > > and universal function related internal and external API > > > changes. > > > In some cases it may be necessary to consult the other NEPs for > > > a full > > > picture of the desired changes and why these changes are > > > necessary. > > > > > > > > > Abstract > > > -------- > > > > > > `Datatypes ` in NumPy describe how to > > > interpret > > > each > > > element in arrays. NumPy provides ``int``, ``float``, and > > > ``complex`` > > > numerical > > > types, as well as string, datetime, and structured datatype > > > capabilities. > > > The growing Python community, however, has need for more diverse > > > datatypes. > > > Examples are datatypes with unit information attached (such as > > > meters) or > > > categorical datatypes (fixed set of possible values). > > > However, the current NumPy datatype API is too limited to allow the > > > creation > > > of these. > > > > > > This NEP is the first step to enable such growth; it will lead to > > > a simpler development path for new datatypes. > > > In the long run the new datatype system will also support the > > > creation > > > of datatypes directly from Python rather than C. > > > Refactoring the datatype API will improve maintainability and > > > facilitate > > > development of both user-defined external datatypes, > > > as well as new features for existing datatypes internal to NumPy. > > > > > > > > > Motivation and Scope > > > -------------------- > > > > > > .. seealso:: > > > > > > The user impact section includes examples of what kind of new > > > datatypes > > > will be enabled by the proposed changes in the long run. > > > It may thus help to read these section out of order. > > > > > > Motivation > > > ^^^^^^^^^^ > > > > > > One of the main issues with the current API is the definition of > > > typical > > > functions such as addition and multiplication for parametric > > > datatypes > > > (see also NEP 40) which require additional steps to determine the > > > output > > > type. > > > For example when adding two strings of length 4, the result is a > > > string > > > of length 8, which is different from the input. > > > Similarly, a datatype which embeds a physical unit must calculate > > > the new > > > unit > > > information: dividing a distance by a time results in a speed. > > > A related difficulty is that the :ref:`current casting rules > > > <_ufuncs.casting>` > > > -- the conversion between different datatypes -- > > > cannot describe casting for such parametric datatypes implemented > > > outside > > > of NumPy. > > > > > > This additional functionality for supporting parametric datatypes > > > introduces > > > increased complexity within NumPy itself, > > > and furthermore is not available to external user-defined > > > datatypes. > > > In general the concerns of different datatypes are not well > > > well-encapsulated. > > > This burden is exacerbated by the exposure of internal C > > > structures, > > > limiting the addition of new fields > > > (for example to support new sorting methods [new_sort]_). > > > > > > Currently there are many factors which limit the creation of new > > > user-defined > > > datatypes: > > > > > > * Creating casting rules for parametric user-defined dtypes is > > > either > > > impossible > > > or so complex that it has never been attempted. > > > * Type promotion, e.g. the operation deciding that adding float and > > > integer > > > values should return a float value, is very valuable for numeric > > > datatypes > > > but is limited in scope for user-defined and especially > > > parametric > > > datatypes. > > > * Much of the logic (e.g. promotion) is written in single functions > > > instead of being split as methods on the datatype itself. > > > * In the current design datatypes cannot have methods that do not > > > generalize > > > to other datatypes. For example a unit datatype cannot have a > > > ``.to_si()`` method to > > > easily find the datatype which would represent the same values in > > > SI > > > units. > > > > > > The large need to solve these issues has driven the scientific > > > community > > > to create work-arounds in multiple projects implementing physical > > > units as > > > an > > > array-like class instead of a datatype, which would generalize > > > better > > > across > > > multiple array-likes (Dask, pandas, etc.). > > > Already, Pandas has made a push into the same direction with its > > > extension arrays [pandas_extension_arrays]_ and undoubtedly > > > the community would be best served if such new features could be > > > common > > > between NumPy, Pandas, and other projects. > > > > > > Scope > > > ^^^^^ > > > > > > The proposed refactoring of the datatype system is a large > > > undertaking and > > > thus is proposed to be split into various phases, roughly: > > > > > > * Phase I: Restructure and extend the datatype infrastructure (This > > > NEP 41) > > > * Phase II: Incrementally define or rework API (Detailed largely in > > > NEPs > > > 42/43) > > > * Phase III: Growth of NumPy and Scientific Python Ecosystem > > > capabilities. > > > > > > For a more detailed accounting of the various phases, see > > > "Plan to Approach the Full Refactor" in the Implementation section > > > below. > > > This NEP proposes to move ahead with the necessary creation of new > > > dtype > > > subclasses (Phase I), > > > and start working on implementing current functionality. > > > Within the context of this NEP all development will be fully > > > private API or > > > use preliminary underscored names which must be changed in the > > > future. > > > Most of the internal and public API choices are part of a second > > > Phase > > > and will be discussed in more detail in the following NEPs 42 and > > > 43. > > > The initial implementation of this NEP will have little or no > > > effect on > > > users, > > > but provides the necessary ground work for incrementally addressing > > > the > > > full rework. > > > > > > The implementation of this NEP and the following, implied large > > > rework of > > > how > > > datatypes are defined in NumPy is expected to create small > > > incompatibilities > > > (see backward compatibility section). > > > However, a transition requiring large code adaption is not > > > anticipated and > > > not > > > within scope. > > > > > > Specifically, this NEP makes the following design choices which are > > > discussed > > > in more details in the detailed description section: > > > > > > 1. Each datatype will be an instance of a subclass of ``np.dtype``, > > > with > > > most of the > > > datatype-specific logic being implemented > > > as special methods on the class. In the C-API, these correspond > > > to > > > specific > > > slots. In short, for ``f = np.dtype("f8")``, ``isinstance(f, > > > np.dtype)`` will remain true, > > > but ``type(f)`` will be a subclass of ``np.dtype`` rather than > > > just > > > ``np.dtype`` itself. > > > The ``PyArray_ArrFuncs`` which are currently stored as a pointer > > > on the > > > instance (as ``PyArray_Descr->f``), > > > should instead be stored on the class as typically done in > > > Python. > > > In the future these may correspond to python side dunder > > > methods. > > > Storage information such as itemsize and byteorder can differ > > > between > > > different dtype instances (e.g. "S3" vs. "S8") and will remain > > > part of > > > the instance. > > > This means that in the long run the current lowlevel access to > > > dtype > > > methods > > > will be removed (see ``PyArray_ArrFuncs`` in NEP 40). > > > > > > 2. The current NumPy scalars will *not* change, they will not be > > > instances > > > of > > > datatypes. This will also be true for new datatypes, scalars > > > will not be > > > instances of a dtype (although ``isinstance(scalar, dtype)`` may > > > be made > > > to return ``True`` when appropriate). > > > > > > Detailed technical decisions to follow in NEP 42. > > > > > > Further, the public API will be designed in a way that is > > > extensible in > > > the future: > > > > > > 3. All new C-API functions provided to the user will hide > > > implementation > > > details > > > as much as possible. The public API should be an identical, but > > > limited, > > > version of the C-API used for the internal NumPy datatypes. > > > > > > The changes to the datatype system in Phase II must include a large > > > refactor of the > > > UFunc machinery, which will be further defined in NEP 43: > > > > > > 4. To enable all of the desired functionality for new user-defined > > > datatypes, > > > the UFunc machinery will be changed to replace the current > > > dispatching > > > and type resolution system. > > > The old system should be *mostly* supported as a legacy version > > > for > > > some time. > > > > > > Additionally, as a general design principle, the addition of new > > > user-defined > > > datatypes will *not* change the behaviour of programs. > > > For example ``common_dtype(a, b)`` must not be ``c`` unless ``a`` > > > or ``b`` > > > know > > > that ``c`` exists. > > > > > > > > > User Impact > > > ----------- > > > > > > The current ecosystem has very few user-defined datatypes using > > > NumPy, the > > > two most prominent being: ``rational`` and ``quaternion``. > > > These represent fairly simple datatypes which are not strongly > > > impacted > > > by the current limitations. > > > However, we have identified a need for datatypes such as: > > > > > > * bfloat16, used in deep learning > > > * categorical types > > > * physical units (such as meters) > > > * datatypes for tracing/automatic differentiation > > > * high, fixed precision math > > > * specialized integer types such as int2, int24 > > > * new, better datetime representations > > > * extending e.g. integer dtypes to have a sentinel NA value > > > * geometrical objects [pygeos]_ > > > > > > Some of these are partially solved; for example unit capability is > > > provided > > > in ``astropy.units``, ``unyt``, or ``pint``, as `numpy.ndarray` > > > subclasses. > > > Most of these datatypes, however, simply cannot be reasonably > > > defined > > > right now. > > > An advantage of having such datatypes in NumPy is that they should > > > integrate > > > seamlessly with other array or array-like packages such as Pandas, > > > ``xarray`` [xarray_dtype_issue]_, or ``Dask``. > > > > > > The long term user impact of implementing this NEP will be to allow > > > both > > > the growth of the whole ecosystem by having such new datatypes, as > > > well as > > > consolidating implementation of such datatypes within NumPy to > > > achieve > > > better interoperability. > > > > > > > > > Examples > > > ^^^^^^^^ > > > > > > The following examples represent future user-defined datatypes we > > > wish to > > > enable. > > > These datatypes are not part the NEP and choices (e.g. choice of > > > casting > > > rules) > > > are possibilities we wish to enable and do not represent > > > recommendations. > > > > > > Simple Numerical Types > > > """""""""""""""""""""" > > > > > > Mainly used where memory is a consideration, lower-precision > > > numeric types > > > such as :ref:```bfloat16`` < > > > https://en.wikipedia.org/wiki/Bfloat16_floating-point_format>` > > > are common in other computational frameworks. > > > For these types the definitions of things such as > > > ``np.common_type`` and > > > ``np.can_cast`` are some of the most important interfaces. Once > > > they > > > support ``np.common_type``, it is (for the most part) possible to > > > find > > > the correct ufunc loop to call, since most ufuncs -- such as add -- > > > effectively > > > only require ``np.result_type``:: > > > > > > >>> np.add(arr1, arr2).dtype == np.result_type(arr1, arr2) > > > > > > and `~numpy.result_type` is largely identical to > > > `~numpy.common_type`. > > > > > > > > > Fixed, high precision math > > > """""""""""""""""""""""""" > > > > > > Allowing arbitrary precision or higher precision math is important > > > in > > > simulations. For instance ``mpmath`` defines a precision:: > > > > > > >>> import mpmath as mp > > > >>> print(mp.dps) # the current (default) precision > > > 15 > > > > > > NumPy should be able to construct a native, memory-efficient array > > > from > > > a list of ``mpmath.mpf`` floating point objects:: > > > > > > >>> arr_15_dps = np.array(mp.arange(3)) # (mp.arange returns a > > > list) > > > >>> print(arr_15_dps) # Must find the correct precision from > > > the > > > objects: > > > array(['0.0', '1.0', '2.0'], dtype=mpf[dps=15]) > > > > > > We should also be able to specify the desired precision when > > > creating the datatype for an array. Here, we use > > > ``np.dtype[mp.mpf]`` > > > to find the DType class (the notation is not part of this NEP), > > > which is then instantiated with the desired parameter. > > > This could also be written as ``MpfDType`` class:: > > > > > > >>> arr_100_dps = np.array([1, 2, 3], > > > dtype=np.dtype[mp.mpf](dps=100)) > > > >>> print(arr_15_dps + arr_100_dps) > > > array(['0.0', '2.0', '4.0'], dtype=mpf[dps=100]) > > > > > > The ``mpf`` datatype can decide that the result of the operation > > > should be > > > the > > > higher precision one of the two, so uses a precision of 100. > > > Furthermore, we should be able to define casting, for example as > > > in:: > > > > > > >>> np.can_cast(arr_15_dps.dtype, arr_100_dps.dtype, > > > casting="safe") > > > True > > > >>> np.can_cast(arr_100_dps.dtype, arr_15_dps.dtype, > > > casting="safe") > > > False # loses precision > > > >>> np.can_cast(arr_100_dps.dtype, arr_100_dps.dtype, > > > casting="same_kind") > > > True > > > > > > Casting from float is a probably always at least a ``same_kind`` > > > cast, but > > > in general, it is not safe:: > > > > > > >>> np.can_cast(np.float64, np.dtype[mp.mpf](dps=4), > > > casting="safe") > > > False > > > > > > since a float64 has a higer precision than the ``mpf`` datatype > > > with > > > ``dps=4``. > > > > > > Alternatively, we can say that:: > > > > > > >>> np.common_type(np.dtype[mp.mpf](dps=5), > > > np.dtype[mp.mpf](dps=10)) > > > np.dtype[mp.mpf](dps=10) > > > > > > And possibly even:: > > > > > > >>> np.common_type(np.dtype[mp.mpf](dps=5), np.float64) > > > np.dtype[mp.mpf](dps=16) # equivalent precision to float64 (I > > > believe) > > > > > > since ``np.float64`` can be cast to a ``np.dtype[mp.mpf](dps=16)`` > > > safely. > > > > > > > > > Categoricals > > > """""""""""" > > > > > > Categoricals are interesting in that they can have fixed, > > > predefined > > > values, > > > or can be dynamic with the ability to modify categories when > > > necessary. > > > The fixed categories (defined ahead of time) is the most straight > > > forward > > > categorical definition. > > > Categoricals are *hard*, since there are many strategies to > > > implement them, > > > suggesting NumPy should only provide the scaffolding for user- > > > defined > > > categorical types. For instance:: > > > > > > >>> cat = Categorical(["eggs", "spam", "toast"]) > > > >>> breakfast = array(["eggs", "spam", "eggs", "toast"], > > > dtype=cat) > > > > > > could store the array very efficiently, since it knows that there > > > are only > > > 3 > > > categories. > > > Since a categorical in this sense knows almost nothing about the > > > data > > > stored > > > in it, few operations makes, sense, although equality does: > > > > > > >>> breakfast2 = array(["eggs", "eggs", "eggs", "eggs"], > > > dtype=cat) > > > >>> breakfast == breakfast2 > > > array[True, False, True, False]) > > > > > > The categorical datatype could work like a dictionary: no two > > > items names can be equal (checked on dtype creation), so that the > > > equality > > > operation above can be performed very efficiently. > > > If the values define an order, the category labels (internally > > > integers) > > > could > > > be ordered the same way to allow efficient sorting and comparison. > > > > > > Whether or not casting is defined from one categorical with less to > > > one > > > with > > > strictly more values defined, is something that the Categorical > > > datatype > > > would > > > need to decide. Both options should be available. > > > > > > > > > Unit on the Datatype > > > """""""""""""""""""" > > > > > > There are different ways to define Units, depending on how the > > > internal > > > machinery would be organized, one way is to have a single Unit > > > datatype > > > for every existing numerical type. > > > This will be written as ``Unit[float64]``, the unit itself is part > > > of the > > > DType instance ``Unit[float64]("m")`` is a ``float64`` with meters > > > attached:: > > > > > > >>> from astropy import units > > > >>> meters = np.array([1, 2, 3], dtype=np.float64) * units.m # > > > meters > > > >>> print(meters) > > > array([1.0, 2.0, 3.0], dtype=Unit[float64]("m")) > > > > > > Note that units are a bit tricky. It is debatable, whether:: > > > > > > >>> np.array([1.0, 2.0, 3.0], dtype=Unit[float64]("m")) > > > > > > should be valid syntax (coercing the float scalars without a unit > > > to > > > meters). > > > Once the array is created, math will work without any issue:: > > > > > > >>> meters / (2 * unit.seconds) > > > array([0.5, 1.0, 1.5], dtype=Unit[float64]("m/s")) > > > > > > Casting is not valid from one unit to the other, but can be valid > > > between > > > different scales of the same dimensionality (although this may be > > > "unsafe"):: > > > > > > >>> meters.astype(Unit[float64]("s")) > > > TypeError: Cannot cast meters to seconds. > > > >>> meters.astype(Unit[float64]("km")) > > > >>> # Convert to centimeter-gram-second (cgs) units: > > > >>> meters.astype(meters.dtype.to_cgs()) > > > > > > The above notation is somewhat clumsy. Functions > > > could be used instead to convert between units. > > > There may be ways to make these more convenient, but those must be > > > left > > > for future discussions:: > > > > > > >>> units.convert(meters, "km") > > > >>> units.to_cgs(meters) > > > > > > There are some open questions. For example, whether additional > > > methods > > > on the array object could exist to simplify some of the notions, > > > and how > > > these > > > would percolate from the datatype to the ``ndarray``. > > > > > > The interaction with other scalars would likely be defined > > > through:: > > > > > > >>> np.common_type(np.float64, Unit) > > > Unit[np.float64](dimensionless) > > > > > > Ufunc output datatype determination can be more involved than for > > > simple > > > numerical dtypes since there is no "universal" output type:: > > > > > > >>> np.multiply(meters, seconds).dtype != > > > np.result_type(meters, > > > seconds) > > > > > > In fact ``np.result_type(meters, seconds)`` must error without > > > context > > > of the operation being done. > > > This example highlights how the specific ufunc loop > > > (loop with known, specific DTypes as inputs), has to be able to to > > > make > > > certain decisions before the actual calculation can start. > > > > > > > > > > > > Implementation > > > -------------- > > > > > > Plan to Approach the Full Refactor > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > > To address these issues in NumPy and enable new datatypes, > > > multiple development stages are required: > > > > > > * Phase I: Restructure and extend the datatype infrastructure (This > > > NEP) > > > > > > * Organize Datatypes like normal Python classes [`PR 15508`]_ > > > > > > * Phase II: Incrementally define or rework API > > > > > > * Create a new and easily extensible API for defining new > > > datatypes > > > and related functionality. (NEP 42) > > > > > > * Incrementally define all necessary functionality through the > > > new API > > > (NEP 42): > > > > > > * Defining operations such as ``np.common_type``. > > > * Allowing to define casting between datatypes. > > > * Add functionality necessary to create a numpy array from > > > Python > > > scalars > > > (i.e. ``np.array(...)``). > > > * ? > > > > > > * Restructure how universal functions work (NEP 43), in order to: > > > > > > * make it possible to allow a `~numpy.ufunc` such as ``np.add`` > > > to be > > > extended by user-defined datatypes such as Units. > > > > > > * allow efficient lookup for the correct implementation for > > > user-defined > > > datatypes. > > > > > > * enable reuse of existing code. Units should be able to use > > > the > > > normal math loops and add additional logic to determine > > > output type. > > > > > > * Phase III: Growth of NumPy and Scientific Python Ecosystem > > > capabilities: > > > > > > * Cleanup of legacy behaviour where it is considered buggy or > > > undesirable. > > > * Provide a path to define new datatypes from Python. > > > * Assist the community in creating types such as Units or > > > Categoricals > > > * Allow strings to be used in functions such as ``np.equal`` or > > > ``np.add``. > > > * Remove legacy code paths within NumPy to improve long term > > > maintainability > > > > > > This document serves as a basis for phase I and provides the vision > > > and > > > motivation for the full project. > > > Phase I does not introduce any new user-facing features, > > > but is concerned with the necessary conceptual cleanup of the > > > current > > > datatype system. > > > It provides a more "pythonic" datatype Python type object, with a > > > clear > > > class hierarchy. > > > > > > The second phase is the incremental creation of all APIs necessary > > > to > > > define > > > fully featured datatypes and reorganization of the NumPy datatype > > > system. > > > This phase will thus be primarily concerned with defining an, > > > initially preliminary, stable public API. > > > > > > Some of the benefits of a large refactor may only become evident > > > after the > > > full > > > deprecation of the current legacy implementation (i.e. larger code > > > removals). > > > However, these steps are necessary for improvements to many parts > > > of the > > > core NumPy API, and are expected to make the implementation > > > generally > > > easier to understand. > > > > > > The following figure illustrates the proposed design at a high > > > level, > > > and roughly delineates the components of the overall design. > > > Note that this NEP only regards Phase I (shaded area), > > > the rest encompasses Phase II and the design choices are up for > > > discussion, > > > however, it highlights that the DType datatype class is the > > > central, > > > necessary > > > concept: > > > > > > .. image:: _static/nep-0041-mindmap.svg > > > > > > > > > First steps directly related to this NEP > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > > The required changes necessary to NumPy are large and touch many > > > areas > > > of the code base > > > but many of these changes can be addressed incrementally. > > > > > > To enable an incremental approach we will start by creating a C > > > defined > > > ``PyArray_DTypeMeta`` class with its instances being the ``DType`` > > > classes, > > > subclasses of ``np.dtype``. > > > This is necessary to add the ability of storing custom slots on the > > > DType > > > in C. > > > This ``DTypeMeta`` will be implemented first to then enable > > > incremental > > > restructuring of current code. > > > > > > The addition of ``DType`` will then enable addressing other changes > > > incrementally, some of which may begin before the settling the full > > > internal > > > API: > > > > > > 1. New machinery for array coercion, with the goal of enabling user > > > DTypes > > > with appropriate class methods. > > > 2. The replacement or wrapping of the current casting machinery. > > > 3. Incremental redefinition of the current ``PyArray_ArrFuncs`` > > > slots into > > > DType method slots. > > > > > > At this point, no or only very limited new public API will be added > > > and > > > the internal API is considered to be in flux. > > > Any new public API may be set up give warnings and will have > > > leading > > > underscores > > > to indicate that it is not finalized and can be changed without > > > warning. > > > > > > > > > Backward compatibility > > > ---------------------- > > > > > > While the actual backward compatibility impact of implementing > > > Phase I and > > > II > > > are not yet fully clear, we anticipate, and accept the following > > > changes: > > > > > > * **Python API**: > > > > > > * ``type(np.dtype("f8"))`` will be a subclass of ``np.dtype``, > > > while > > > right > > > now ``type(np.dtype("f8")) is np.dtype``. > > > Code should use ``isinstance`` checks, and in very rare cases > > > may have > > > to > > > be adapted to use it. > > > > > > * **C-API**: > > > > > > * In old versions of NumPy ``PyArray_DescrCheck`` is a macro > > > which uses > > > ``type(dtype) is np.dtype``. When compiling against an old > > > NumPy > > > version, > > > the macro may have to be replaced with the corresponding > > > ``PyObject_IsInstance`` call. (If this is a problem, we could > > > backport > > > fixing the macro) > > > > > > * The UFunc machinery changes will break *limited* parts of the > > > current > > > implementation. Replacing e.g. the default ``TypeResolver`` is > > > expected > > > to remain supported for a time, although optimized masked > > > inner loop > > > iteration > > > (which is not even used *within* NumPy) will no longer be > > > supported. > > > > > > * All functions currently defined on the dtypes, such as > > > ``PyArray_Descr->f->nonzero``, will be defined and accessed > > > differently. > > > This means that in the long run lowlevel access code will > > > have to be changed to use the new API. Such changes are > > > expected to be > > > necessary in very few project. > > > > > > * **dtype implementors (C-API)**: > > > > > > * The array which is currently provided to some functions (such > > > as cast > > > functions), > > > will no longer be provided. > > > For example ``PyArray_Descr->f->nonzero`` or > > > ``PyArray_Descr->f->copyswapn``, > > > may instead receive a dummy array object with only some fields > > > (mainly > > > the > > > dtype), being valid. > > > At least in some code paths, a similar mechanism is already > > > used. > > > > > > * The ``scalarkind`` slot and registration of scalar casting will > > > be > > > removed/ignored without replacement. > > > It currently allows partial value-based casting. > > > The ``PyArray_ScalarKind`` function will continue to work for > > > builtin > > > types, > > > but will not be used internally and be deprecated. > > > > > > * Currently user dtypes are defined as instances of > > > ``np.dtype``. > > > The creation works by the user providing a prototype instance. > > > NumPy will need to modify at least the type during > > > registration. > > > This has no effect for either ``rational`` or ``quaternion`` > > > and > > > mutation > > > of the structure seems unlikely after registration. > > > > > > Since there is a fairly large API surface concerning datatypes, > > > further > > > changes > > > or the limitation certain function to currently existing datatypes > > > is > > > likely to occur. > > > For example functions which use the type number as input > > > should be replaced with functions taking DType classes instead. > > > Although public, large parts of this C-API seem to be used rarely, > > > possibly never, by downstream projects. > > > > > > > > > > > > Detailed Description > > > -------------------- > > > > > > This section details the design decisions covered by this NEP. > > > The subsections correspond to the list of design choices presented > > > in the Scope section. > > > > > > Datatypes as Python Classes (1) > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > > The current NumPy datatypes are not full scale python classes. > > > They are instead (prototype) instances of a single ``np.dtype`` > > > class. > > > Changing this means that any special handling, e.g. for > > > ``datetime`` > > > can be moved to the Datetime DType class instead, away from > > > monolithic > > > general > > > code (e.g. current ``PyArray_AdjustFlexibleDType``). > > > > > > The main consequence of this change with respect to the API is that > > > special methods move from the dtype instances to methods on the new > > > DType > > > class. > > > This is the typical design pattern used in Python. > > > Organizing these methods and information in a more Pythonic way > > > provides a > > > solid foundation for refining and extending the API in the future. > > > The current API cannot be extended due to how it is exposed > > > publically. > > > This means for example that the methods currently stored in > > > ``PyArray_ArrFuncs`` > > > on each datatype (see NEP 40) will be defined differently in the > > > future and > > > deprecated in the long run. > > > > > > The most prominent visible side effect of this will be that > > > ``type(np.dtype(np.float64))`` will not be ``np.dtype`` anymore. > > > Instead it will be a subclass of ``np.dtype`` meaning that > > > ``isinstance(np.dtype(np.float64), np.dtype)`` will remain true. > > > This will also add the ability to use ``isinstance(dtype, > > > np.dtype[float64])`` > > > thus removing the need to use ``dtype.kind``, ``dtype.char``, or > > > ``dtype.type`` > > > to do this check. > > > > > > With the design decision of DTypes as full-scale Python classes, > > > the question of subclassing arises. > > > Inheritance, however, appears problematic and a complexity best > > > avoided > > > (at least initially) for container datatypes. > > > Further, subclasses may be more interesting for interoperability > > > for > > > example with GPU backends (CuPy) storing additional methods related > > > to the > > > GPU rather than as a mechanism to define new datatypes. > > > A class hierarchy does provides value, this may be achieved by > > > allowing the creation of *abstract* datatypes. > > > An example for an abstract datatype would be the datatype > > > equivalent of > > > ``np.floating``, representing any floating point number. > > > These can serve the same purpose as Python's abstract base classes. > > > > > > > > > Scalars should not be instances of the datatypes (2) > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > > For simple datatypes such as ``float64`` (see also below), it seems > > > tempting that the instance of a ``np.dtype("float64")`` can be the > > > scalar. > > > This idea may be even more appealing due to the fact that scalars, > > > rather than datatypes, currently define a useful type hierarchy. > > > > > > However, we have specifically decided against this for a number of > > > reasons. > > > First, the new datatypes described herein would be instances of > > > DType > > > classes. > > > Making these instances themselves classes, while possible, adds > > > additional > > > complexity that users need to understand. > > > It would also mean that scalars must have storage information (such > > > as > > > byteorder) > > > which is generally unnecessary and currently is not used. > > > Second, while the simple NumPy scalars such as ``float64`` may be > > > such > > > instances, > > > it should be possible to create datatypes for Python objects > > > without > > > enforcing > > > NumPy as a dependency. > > > However, Python objects that do not depend on NumPy cannot be > > > instances of > > > a NumPy DType. > > > Third, there is a mismatch between the methods and attributes which > > > are > > > useful > > > for scalars and datatypes. For instance ``to_float()`` makes sense > > > for a > > > scalar > > > but not for a datatype and ``newbyteorder`` is not useful on a > > > scalar (or > > > has > > > a different meaning). > > > > > > Overall, it seem rather than reducing the complexity, i.e. by > > > merging > > > the two distinct type hierarchies, making scalars instances of > > > DTypes would > > > increase the complexity of both the design and implementation. > > > > > > A possible future path may be to instead simplify the current NumPy > > > scalars to > > > be much simpler objects which largely derive their behaviour from > > > the > > > datatypes. > > > > > > C-API for creating new Datatypes (3) > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > > The current C-API with which users can create new datatypes > > > is limited in scope, and requires use of "private" structures. This > > > means > > > the API is not extensible: no new members can be added to the > > > structure > > > without losing binary compatibility. > > > This has already limited the inclusion of new sorting methods into > > > NumPy [new_sort]_. > > > > > > The new version shall thus replace the current ``PyArray_ArrFuncs`` > > > structure used > > > to define new datatypes. > > > Datatypes that currently exist and are defined using these slots > > > will be > > > supported during a deprecation period. > > > > > > The most likely solution is to hide the implementation from the > > > user and > > > thus make > > > it extensible in the future is to model the API after Python's > > > stable > > > API [PEP-384]_: > > > > > > .. code-block:: C > > > > > > static struct PyArrayMethodDef slots[] = { > > > {NPY_dt_method, method_implementation}, > > > ..., > > > {0, NULL} > > > } > > > > > > typedef struct{ > > > PyTypeObject *typeobj; /* type of python scalar */ > > > ...; > > > PyType_Slot *slots; > > > } PyArrayDTypeMeta_Spec; > > > > > > PyObject* PyArray_InitDTypeMetaFromSpec( > > > PyArray_DTypeMeta *user_dtype, PyArrayDTypeMeta_Spec > > > *dtype_spec); > > > > > > The C-side slots should be designed to mirror Python side methods > > > such as ``dtype.__dtype_method__``, although the exposure to Python > > > is > > > a later step in the implementation to reduce the complexity of the > > > initial > > > implementation. > > > > > > > > > C-API Changes to the UFunc Machinery (4) > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > > Proposed changes to the UFunc machinery will be part of NEP 43. > > > However, the following changes will be necessary (see NEP 40 for a > > > detailed > > > description of the current implementation and its issues): > > > > > > * The current UFunc type resolution must be adapted to allow better > > > control > > > for user-defined dtypes as well as resolve current > > > inconsistencies. > > > * The inner-loop used in UFuncs must be expanded to include a > > > return value. > > > Further, error reporting must be improved, and passing in dtype- > > > specific > > > information enabled. > > > This requires the modification of the inner-loop function > > > signature and > > > addition of new hooks called before and after the inner-loop is > > > used. > > > > > > An important goal for any changes to the universal functions will > > > be to > > > allow the reuse of existing loops. > > > It should be easy for a new units datatype to fall back to existing > > > math > > > functions after handling the unit related computations. > > > > > > > > > Discussion > > > ---------- > > > > > > See NEP 40 for a list of previous meetings and discussions. > > > > > > > > > References > > > ---------- > > > > > > .. [pandas_extension_arrays] > > > > https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types > > > > > > .. _xarray_dtype_issue: > > > https://github.com/pydata/xarray/issues/1262 > > > > > > .. [pygeos] https://github.com/caspervdw/pygeos > > > > > > .. [new_sort] https://github.com/numpy/numpy/pull/12945 > > > > > > .. [PEP-384] https://www.python.org/dev/peps/pep-0384/ > > > > > > .. [PR 15508] https://github.com/numpy/numpy/pull/15508 > > > > > > > > > Copyright > > > --------- > > > > > > This document has been placed in the public domain. > > > > > > > > > Acknowledgments > > > --------------- > > > > > > The effort to create new datatypes for NumPy has been discussed for > > > several > > > years in many different contexts and settings, making it impossible > > > to > > > list everyone involved. > > > We would like to thank especially Stephan Hoyer, Nathaniel Smith, > > > and Eric > > > Wieser > > > for repeated in-depth discussion about datatype design. > > > We are very grateful for the community input in reviewing and > > > revising this > > > NEP and would like to thank especially Ross Barnowski and Ralf > > > Gommers. > > > > > > _______________________________________________ > > > NumPy-Discussion mailing list > > > NumPy-Discussion at python.org > > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -- Francesc Alted -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Mon Mar 23 14:45:51 2020 From: chris.barker at noaa.gov (Chris Barker) Date: Mon, 23 Mar 2020 11:45:51 -0700 Subject: [Numpy-discussion] New DTypes: Are scalars a central concept in NumPy or not? In-Reply-To: <9716a1b7-871b-1776-11fe-59e095b89cf0@gmail.com> References: <9716a1b7-871b-1776-11fe-59e095b89cf0@gmail.com> Message-ID: I've always found the duality of zero-d arrays an scalars confusing, and I'm sure I'm not alone. Having both is just plain weird. But, backward compatibility aside, could we have ONLY Scalars? When we index into an array, the dimensionality is reduced by one, so indexing into a 1D array has to get us something: but the zero-d array is a really weird object -- do we really need it? There is certainly a need for more numpy-like scalars: more than the built in data types, and some handy attributes and methods, like dtype, .itemsize, etc. But could we make an enhanced scalar that had everything we actually need from a zero-d array? The key point would be mutability -- but do we really need mutable scalars? I can't think of any time I've needed that, when I couldn't have used a 1-d array of length 1. Is there a use case for zero-d arrays that could not be met with an enhanced scalar? -CHB On Mon, Feb 24, 2020 at 12:30 PM Allan Haldane wrote: > I have some thoughts on scalars from playing with ndarray ducktypes > (__array_function__), eg a MaskedArray ndarray-ducktype, for which I > wanted an associated "MaskedScalar" type. > > In summary, the ways scalars currently work makes ducktyping > (duck-scalars) difficult: > > * numpy scalar types are not subclassable, so my duck-scalars aren't > subclasses of numpy scalars and aren't in the type hierarchy > * even if scalars were subclassable, I would have to subclass each > scalar datatype individually to make masked versions > * lots of code checks `np.isinstance(var, np.float64)` which breaks > for my duck-scalars > * it was difficult to distinguish between a duck-scalar and a duck-0d > array. The method I used in the end seems hacky. > > This has led to some daydreams about how scalars should work, and also > led me last to read through your NEPs 40/41 with specific focus on what > you said about scalars, and was about to post there until I saw this > discussion. I agree with what you said in the NEPs about not making > scalars be dtype instances. > > Here is what ducktypes led me to: > > If we are able to do something like define a `np.numpy_scalar` type > covering all numpy scalars, which has a `.dtype` attribute like you > describe in the NEPs, then that would seem to solve the ducktype > problems above. Ducktype implementors would need to make a "duck-scalar" > type in parallel to their "duck-ndarray" type, but I found that to be > pretty easy using an abstract class in my MaskedArray ducktype, since > the MaskedArray and MaskedScalar share a lot of behavior. > > A numpy_scalar type would also help solve some object-array problems if > the object scalars are wrapped in the np_scalar type. A long time ago I > started to try to fix up various funny/strange behaviors of object > datatypes, but there are lots of special cases, and the main problem was > that the returned objects (eg from indexing) were not numpy types and > did not support numpy attributes or indexing. Wrapping the returned > object in `np.numpy_scalar` might add an extra slight annoyance to > people who want to unwrap the object, but I think it would make object > arrays less buggy and make code using object arrays easier to reason > about and debug. > > Finally, a few random votes/comments based on the other emails on the list: > > I think scalars have a place in numpy (rather than just reusing 0d > arrays), since there is a clear use in having hashable, immutable > scalars. Structured scalars should probably be immutable. > > I agree with your suggestion that scalars should not be indexable. Thus, > my duck-scalars (and proposed numpy_scalar) would not be indexable. > However, I think they should encode their datatype though a .dtype > attribute like ndarrays, rather than by inheritance. > > Also, something to think about is that currently numpy scalars satisfy > the property `isinstance(np.float64(1), float)`, i.e they are within the > python numerical type hierarchy. 0d arrays do not have this property. My > proposal above would break this. I'm not sure what to think about > whether this is a good property to maintain or not. > > Cheers, > Allan > > > > On 2/21/20 8:37 PM, Sebastian Berg wrote: > > Hi all, > > > > When we create new datatypes, we have the option to make new choices > > for the new datatypes [0] (not the existing ones). > > > > The question is: Should every NumPy datatype have a scalar associated > > and should operations like indexing return a scalar or a 0-D array? > > > > This is in my opinion a complex, almost philosophical, question, and we > > do not have to settle anything for a long time. But, if we do not > > decide a direction before we have many new datatypes the decision will > > make itself... > > So happy about any ideas, even if its just a gut feeling :). > > > > There are various points. I would like to mostly ignore the technical > > ones, but I am listing them anyway here: > > > > * Scalars are faster (although that can be optimized likely) > > > > * Scalars have a lower memory footprint > > > > * The current implementation incurs a technical debt in NumPy. > > (I do not think that is a general issue, though. We could > > automatically create scalars for each new datatype probably.) > > > > Advantages of having no scalars: > > > > * No need to keep track of scalars to preserve them in ufuncs, or > > libraries using `np.asarray`, do they need `np.asarray_or_scalar`? > > (or decide they return always arrays, although ufuncs may not) > > > > * Seems simpler in many ways, you always know the output will be an > > array if it has to do with NumPy. > > > > Advantages of having scalars: > > > > * Scalars are immutable and we are used to them from Python. > > A 0-D array cannot be used as a dictionary key consistently [1]. > > > > I.e. without scalars as first class citizen `dict[arr1d[0]]` > > cannot work, `dict[arr1d[0].item()]` may (if `.item()` is defined, > > and e.g. `dict[arr1d[0].frozen()]` could make a copy to work. [2] > > > > * Object arrays as we have them now make sense, `arr1d[0]` can > > reasonably return a Python object. I.e. arrays feel more like > > container if you can take elements out easily. > > > > Could go both ways: > > > > * Scalar math `scalar = arr1d[0]; scalar += 1` modifies the array > > without scalars. With scalars `arr1d[0, ...]` clarifies the > > meaning. (In principle it is good to never use `arr2d[0]` to > > get a 1D slice, probably more-so if scalars exist.) > > > > Note: array-scalars (the current NumPy scalars) are not useful in my > > opinion [3]. A scalar should not be indexed or have a shape. I do not > > believe in scalars pretending to be arrays. > > > > I personally tend towards liking scalars. If Python was a language > > where the array (array-programming) concept was ingrained into the > > language itself, I would lean the other way. But users are used to > > scalars, and they "put" scalars into arrays. Array objects are in some > > ways strange in Python, and I feel not having scalars detaches them > > further. > > > > Having scalars, however also means we should preserve them. I feel in > > principle that is actually fairly straight forward. E.g. for ufuncs: > > > > * np.add(scalar, scalar) -> scalar > > * np.add.reduce(arr, axis=None) -> scalar > > * np.add.reduce(arr, axis=1) -> array (even if arr is 1d) > > * np.add.reduce(scalar, axis=()) -> array > > > > Of course libraries that do `np.asarray` would/could basically chose to > > not preserve scalars: Their signature is defined as taking strictly > > array input. > > > > Cheers, > > > > Sebastian > > > > > > [0] At best this can be a vision to decide which way they may evolve. > > > > [1] E.g. PyTorch uses `hash(tensor) == id(tensor)` which is arguably > > strange. E.g. Quantity defines hash correctly, but does not fully > > ensure immutability for 0-D Quantities. Ensuring immutability in a > > world where "views" are a central concept requires a write-only copy. > > > > [2] Arguably `.item()` would always return a scalar, but it would be a > > second class citizen. (Although if it returns a scalar, at least we > > already have a scalar implementation.) > > > > [3] They are necessary due to technical debt for NumPy datatypes > > though. > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Mon Mar 23 14:55:50 2020 From: chris.barker at noaa.gov (Chris Barker) Date: Mon, 23 Mar 2020 11:55:50 -0700 Subject: [Numpy-discussion] Good use of __dunder__ methods in numpy In-Reply-To: References: <97371f612474f98cd85df3a427acbbe7f6f78746.camel@sipsolutions.net> Message-ID: On Thu, Mar 5, 2020 at 2:15 PM Gregory Lee wrote: > If i can get a link to a file that shows how dunder methods help with >> having cool coding APIs that would be great! >> >> > You may want to take a look at PEP 465 as an example, then. If I recall > correctly, the __matmul__ method described in it was added to the standard > library largely with NumPy in mind. > https://www.python.org/dev/peps/pep-0465/ > and so were "rich comparisons", and in-place operators (at least in part). numpy is VERY, VERY, heavily built on the concept of overloading operators, i.e. using dunders or magic methods. I'm going to venture a guess that numpy arrays custom define every single standard dunder -- and certainly most of them. -CHB > On Thu, Mar 5, 2020 at 10:32 PM Sebastian Berg >> wrote: >> >>> Hi, >>> >>> On Thu, 2020-03-05 at 11:14 +0400, Abdur-Rahmaan Janhangeer wrote: >>> > Greetings list, >>> > >>> > I have a talk about dunder methods in Python >>> > >>> > ( >>> > >>> https://conference.mscc.mu/speaker/67604187-57c3-4be6-987c-ea4bef388ad3 >>> > ) >>> > >>> > and it would be nice to include Numpy in the mix. Can someone point >>> > me to one or two use cases / file link where dunder methods help >>> > numpy? >>> > >>> >>> I am not sure in what sense you are looking for. NumPy has its own set >>> of dunder methods (some of which should not be used super much >>> probably), like `__array__`, `__array_interface__`, `__array_ufunc__`, >>> `__array_function__`, `__array_finalize__`, ... >>> So we are using `__array_*__` for numpy related dunders. >>> >>> Of course we use most Python defined dunders, but I am not sure that >>> you are looking for that? >>> >>> Best, >>> >>> Sebastian >>> >>> >>> > Thanks >>> > >>> > fun info: i am a tiny numpy contributor with a one line merge. >>> > _______________________________________________ >>> > NumPy-Discussion mailing list >>> > NumPy-Discussion at python.org >>> > https://mail.python.org/mailman/listinfo/numpy-discussion >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion at python.org >>> https://mail.python.org/mailman/listinfo/numpy-discussion >>> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Mon Mar 23 16:30:56 2020 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Mon, 23 Mar 2020 15:30:56 -0500 Subject: [Numpy-discussion] New DTypes: Are scalars a central concept in NumPy or not? In-Reply-To: References: <9716a1b7-871b-1776-11fe-59e095b89cf0@gmail.com> Message-ID: On Mon, 2020-03-23 at 11:45 -0700, Chris Barker wrote: > I've always found the duality of zero-d arrays an scalars confusing, > and > I'm sure I'm not alone. > > Having both is just plain weird. I guess so, it is a tricky situation, and I do not really have an answer. > > But, backward compatibility aside, could we have ONLY Scalars? > > When we index into an array, the dimensionality is reduced by one, so > indexing into a 1D array has to get us something: but the zero-d > array is a > really weird object -- do we really need it? > Well, it is hard to write functions that work on N-Dimensions (where N can be 0), if the 0-D array does not exist. You can get away with scalars in most cases, because they pretend to be arrays in most cases (aside from mutability). But I am pretty sure we have a bunch of cases that need `res = np.asarray(res)` simply because `res` is N-D but could then be silently converted to a scalar. E.g. see https://github.com/numpy/numpy/issues/13105 for an issue about this (although it does not actually list any specific problems). - Sebastian > There is certainly a need for more numpy-like scalars: more than the > built > in data types, and some handy attributes and methods, like dtype, > .itemsize, etc. But could we make an enhanced scalar that had > everything we > actually need from a zero-d array? > > The key point would be mutability -- but do we really need mutable > scalars? > I can't think of any time I've needed that, when I couldn't have used > a 1-d > array of length 1. > > Is there a use case for zero-d arrays that could not be met with an > enhanced scalar? > > -CHB > > > > > > > > On Mon, Feb 24, 2020 at 12:30 PM Allan Haldane < > allanhaldane at gmail.com> > wrote: > > > I have some thoughts on scalars from playing with ndarray ducktypes > > (__array_function__), eg a MaskedArray ndarray-ducktype, for which > > I > > wanted an associated "MaskedScalar" type. > > > > In summary, the ways scalars currently work makes ducktyping > > (duck-scalars) difficult: > > > > * numpy scalar types are not subclassable, so my duck-scalars > > aren't > > subclasses of numpy scalars and aren't in the type hierarchy > > * even if scalars were subclassable, I would have to subclass > > each > > scalar datatype individually to make masked versions > > * lots of code checks `np.isinstance(var, np.float64)` which > > breaks > > for my duck-scalars > > * it was difficult to distinguish between a duck-scalar and a > > duck-0d > > array. The method I used in the end seems hacky. > > > > This has led to some daydreams about how scalars should work, and > > also > > led me last to read through your NEPs 40/41 with specific focus on > > what > > you said about scalars, and was about to post there until I saw > > this > > discussion. I agree with what you said in the NEPs about not making > > scalars be dtype instances. > > > > Here is what ducktypes led me to: > > > > If we are able to do something like define a `np.numpy_scalar` type > > covering all numpy scalars, which has a `.dtype` attribute like you > > describe in the NEPs, then that would seem to solve the ducktype > > problems above. Ducktype implementors would need to make a "duck- > > scalar" > > type in parallel to their "duck-ndarray" type, but I found that to > > be > > pretty easy using an abstract class in my MaskedArray ducktype, > > since > > the MaskedArray and MaskedScalar share a lot of behavior. > > > > A numpy_scalar type would also help solve some object-array > > problems if > > the object scalars are wrapped in the np_scalar type. A long time > > ago I > > started to try to fix up various funny/strange behaviors of object > > datatypes, but there are lots of special cases, and the main > > problem was > > that the returned objects (eg from indexing) were not numpy types > > and > > did not support numpy attributes or indexing. Wrapping the returned > > object in `np.numpy_scalar` might add an extra slight annoyance to > > people who want to unwrap the object, but I think it would make > > object > > arrays less buggy and make code using object arrays easier to > > reason > > about and debug. > > > > Finally, a few random votes/comments based on the other emails on > > the list: > > > > I think scalars have a place in numpy (rather than just reusing 0d > > arrays), since there is a clear use in having hashable, immutable > > scalars. Structured scalars should probably be immutable. > > > > I agree with your suggestion that scalars should not be indexable. > > Thus, > > my duck-scalars (and proposed numpy_scalar) would not be indexable. > > However, I think they should encode their datatype though a .dtype > > attribute like ndarrays, rather than by inheritance. > > > > Also, something to think about is that currently numpy scalars > > satisfy > > the property `isinstance(np.float64(1), float)`, i.e they are > > within the > > python numerical type hierarchy. 0d arrays do not have this > > property. My > > proposal above would break this. I'm not sure what to think about > > whether this is a good property to maintain or not. > > > > Cheers, > > Allan > > > > > > > > On 2/21/20 8:37 PM, Sebastian Berg wrote: > > > Hi all, > > > > > > When we create new datatypes, we have the option to make new > > > choices > > > for the new datatypes [0] (not the existing ones). > > > > > > The question is: Should every NumPy datatype have a scalar > > > associated > > > and should operations like indexing return a scalar or a 0-D > > > array? > > > > > > This is in my opinion a complex, almost philosophical, question, > > > and we > > > do not have to settle anything for a long time. But, if we do not > > > decide a direction before we have many new datatypes the decision > > > will > > > make itself... > > > So happy about any ideas, even if its just a gut feeling :). > > > > > > There are various points. I would like to mostly ignore the > > > technical > > > ones, but I am listing them anyway here: > > > > > > * Scalars are faster (although that can be optimized likely) > > > > > > * Scalars have a lower memory footprint > > > > > > * The current implementation incurs a technical debt in NumPy. > > > (I do not think that is a general issue, though. We could > > > automatically create scalars for each new datatype probably.) > > > > > > Advantages of having no scalars: > > > > > > * No need to keep track of scalars to preserve them in ufuncs, > > > or > > > libraries using `np.asarray`, do they need > > > `np.asarray_or_scalar`? > > > (or decide they return always arrays, although ufuncs may > > > not) > > > > > > * Seems simpler in many ways, you always know the output will > > > be an > > > array if it has to do with NumPy. > > > > > > Advantages of having scalars: > > > > > > * Scalars are immutable and we are used to them from Python. > > > A 0-D array cannot be used as a dictionary key consistently > > > [1]. > > > > > > I.e. without scalars as first class citizen `dict[arr1d[0]]` > > > cannot work, `dict[arr1d[0].item()]` may (if `.item()` is > > > defined, > > > and e.g. `dict[arr1d[0].frozen()]` could make a copy to work. > > > [2] > > > > > > * Object arrays as we have them now make sense, `arr1d[0]` can > > > reasonably return a Python object. I.e. arrays feel more like > > > container if you can take elements out easily. > > > > > > Could go both ways: > > > > > > * Scalar math `scalar = arr1d[0]; scalar += 1` modifies the > > > array > > > without scalars. With scalars `arr1d[0, ...]` clarifies the > > > meaning. (In principle it is good to never use `arr2d[0]` to > > > get a 1D slice, probably more-so if scalars exist.) > > > > > > Note: array-scalars (the current NumPy scalars) are not useful in > > > my > > > opinion [3]. A scalar should not be indexed or have a shape. I do > > > not > > > believe in scalars pretending to be arrays. > > > > > > I personally tend towards liking scalars. If Python was a > > > language > > > where the array (array-programming) concept was ingrained into > > > the > > > language itself, I would lean the other way. But users are used > > > to > > > scalars, and they "put" scalars into arrays. Array objects are in > > > some > > > ways strange in Python, and I feel not having scalars detaches > > > them > > > further. > > > > > > Having scalars, however also means we should preserve them. I > > > feel in > > > principle that is actually fairly straight forward. E.g. for > > > ufuncs: > > > > > > * np.add(scalar, scalar) -> scalar > > > * np.add.reduce(arr, axis=None) -> scalar > > > * np.add.reduce(arr, axis=1) -> array (even if arr is 1d) > > > * np.add.reduce(scalar, axis=()) -> array > > > > > > Of course libraries that do `np.asarray` would/could basically > > > chose to > > > not preserve scalars: Their signature is defined as taking > > > strictly > > > array input. > > > > > > Cheers, > > > > > > Sebastian > > > > > > > > > [0] At best this can be a vision to decide which way they may > > > evolve. > > > > > > [1] E.g. PyTorch uses `hash(tensor) == id(tensor)` which is > > > arguably > > > strange. E.g. Quantity defines hash correctly, but does not fully > > > ensure immutability for 0-D Quantities. Ensuring immutability in > > > a > > > world where "views" are a central concept requires a write-only > > > copy. > > > > > > [2] Arguably `.item()` would always return a scalar, but it would > > > be a > > > second class citizen. (Although if it returns a scalar, at least > > > we > > > already have a scalar implementation.) > > > > > > [3] They are necessary due to technical debt for NumPy datatypes > > > though. > > > > > > > > > _______________________________________________ > > > NumPy-Discussion mailing list > > > NumPy-Discussion at python.org > > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From sebastian at sipsolutions.net Mon Mar 23 16:47:39 2020 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Mon, 23 Mar 2020 15:47:39 -0500 Subject: [Numpy-discussion] Proposal: NEP 41 -- First step towards a new Datatype System In-Reply-To: References: <3ee04012570e418a874e4c436b5b6f8ac5f56cdb.camel@sipsolutions.net> <162390781577dd2c2916f4dc600b65cf7f09abb1.camel@sipsolutions.net> Message-ID: On Mon, 2020-03-23 at 18:23 +0100, Francesc Alted wrote: > > If we were designing a new programming language around array > > computing > > principles, I do think that would be the approach I would want to > > take/consider. But I simply lack the vision of how marrying the > > idea > > with the scalar language Python would work out well... > > > > I have had a glance at what you are after, and it seems challenging > indeed. IMO, trying to cope with everybody's need regarding data > types is > extremely costly (or more simply, not even possible). I think that a > better approach would be to decouple the storage part of a container > from > its data type system. In the storage there should go things needed > to cope > with the data retrieval, like the itemsize, the shape, or even other > sub-shapes for chunked datasets. Then, in the data type layer, one > should > be able to add meaning to the raw data: is that an integer? speed? > temperature? a compound type? > I am struggling a bit fully understand the lessons to learn. There seems some overlap of storage and DTypes? That is mainly `itemsize` and more tricky `is/has_object`. Which is about how the data is stored but depends on which data is stored? In my current view these are part of the `dtype` instance, e.g. the class `np.dtype[np.string]` (a DTypeMeta instance), will have instances: `np.dtype[np.string](length=5, byteorder="=")` (which is identical to `np.dtype("U5")`). Or is it that `np.ndarray` would actually use an `np.naivearray` internally, which is told the itemsize at construction time? In principle, the DType class could also be much more basic, and NumPy could subclass it (or something similar) to tag on the things it needs to efficiently use the DTypes (outside the computational engine/UFuncs, which cover a lot, but unfortunately I do not think everything). - Sebastian > Indeed the data storage layer should be able to provide a way to > store the > data type representation so that a container can be serialized and > deserialized correctly. But the important thing here is that this > decoupling between storage and types allows for different data type > systems, so that anyone can come with a specific type system > depending on > her needs. One can envision here even a basic data type system (e.g. > a > version of what's now supported in NumPy) that can be extended with > other > layers, depending on the needs, so that every community can > interchange > data at a basic level at least. > > As an example, this is the basic spirit behind the under-construction > Caterva array container (https://github.com/Blosc/Caterva). Blosc2 ( > https://github.com/Blosc/C-Blosc2) will be providing the low-level > storage > layer, with no information about dimensionality. Caterva will be > building > the multidimensional layer, but with no information about the types > at > all. On top of this scaffold, third-party layers will be free to > build > their own data dtypes, specific for every domain (the concept is > imaged in > slide 18 of this presentation: > https://blosc.org/docs/Caterva-HDF5-Workshop.pdf). There is nothing > to > prevent to add more layers, or even a two-layer (and preferably no > more > than two-level) data type system: one for simple data types (e.g. > NumPy > ones) and one meant to be more domain-specific. > > Right now, I think that the limitation that is keeping the NumPy > community > thinking in terms of blending storage and types in the same layer is > that > NumPy is providing a computational engine too, and for doing > computations, > one need to provide both storage (including dimensionality info) and > type > information indeed. By using the multi-layer approach, there should > be a > computational layer that is laid out on top of the storage and the > type > layers, and hence, specific for leveraging them. Sure, that is a big > departure from what we are used to, but as long as one can keep the > architecture of the different layers simple, one could see > interesting > results in not that long time. > > Just my 2 cents, > Francesc > > -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From arj.python at gmail.com Tue Mar 24 04:37:24 2020 From: arj.python at gmail.com (Abdur-Rahmaan Janhangeer) Date: Tue, 24 Mar 2020 12:37:24 +0400 Subject: [Numpy-discussion] Good use of __dunder__ methods in numpy In-Reply-To: References: <97371f612474f98cd85df3a427acbbe7f6f78746.camel@sipsolutions.net>

Message-ID: Thanks for info! Kind Regards, Abdur-Rahmaan Janhangeer compileralchemy.com | github Mauritius On Mon, Mar 23, 2020 at 10:56 PM Chris Barker wrote: > On Thu, Mar 5, 2020 at 2:15 PM Gregory Lee wrote: > >> If i can get a link to a file that shows how dunder methods help with >>> having cool coding APIs that would be great! >>> >>> >> You may want to take a look at PEP 465 as an example, then. If I recall >> correctly, the __matmul__ method described in it was added to the standard >> library largely with NumPy in mind. >> https://www.python.org/dev/peps/pep-0465/ >> > > and so were "rich comparisons", and in-place operators (at least in part). > > numpy is VERY, VERY, heavily built on the concept of overloading > operators, i.e. using dunders or magic methods. > > I'm going to venture a guess that numpy arrays custom define every single > standard dunder -- and certainly most of them. > > -CHB > > > > > >> On Thu, Mar 5, 2020 at 10:32 PM Sebastian Berg < >>> sebastian at sipsolutions.net> wrote: >>> >>>> Hi, >>>> >>>> On Thu, 2020-03-05 at 11:14 +0400, Abdur-Rahmaan Janhangeer wrote: >>>> > Greetings list, >>>> > >>>> > I have a talk about dunder methods in Python >>>> > >>>> > ( >>>> > >>>> https://conference.mscc.mu/speaker/67604187-57c3-4be6-987c-ea4bef388ad3 >>>> > ) >>>> > >>>> > and it would be nice to include Numpy in the mix. Can someone point >>>> > me to one or two use cases / file link where dunder methods help >>>> > numpy? >>>> > >>>> >>>> I am not sure in what sense you are looking for. NumPy has its own set >>>> of dunder methods (some of which should not be used super much >>>> probably), like `__array__`, `__array_interface__`, `__array_ufunc__`, >>>> `__array_function__`, `__array_finalize__`, ... >>>> So we are using `__array_*__` for numpy related dunders. >>>> >>>> Of course we use most Python defined dunders, but I am not sure that >>>> you are looking for that? >>>> >>>> Best, >>>> >>>> Sebastian >>>> >>>> >>>> > Thanks >>>> > >>>> > fun info: i am a tiny numpy contributor with a one line merge. >>>> > _______________________________________________ >>>> > NumPy-Discussion mailing list >>>> > NumPy-Discussion at python.org >>>> > https://mail.python.org/mailman/listinfo/numpy-discussion >>>> _______________________________________________ >>>> NumPy-Discussion mailing list >>>> NumPy-Discussion at python.org >>>> https://mail.python.org/mailman/listinfo/numpy-discussion >>>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion at python.org >>> https://mail.python.org/mailman/listinfo/numpy-discussion >>> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> > > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > Chris.Barker at noaa.gov > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From faltet at gmail.com Tue Mar 24 05:48:25 2020 From: faltet at gmail.com (Francesc Alted) Date: Tue, 24 Mar 2020 10:48:25 +0100 Subject: [Numpy-discussion] Proposal: NEP 41 -- First step towards a new Datatype System In-Reply-To: References: <3ee04012570e418a874e4c436b5b6f8ac5f56cdb.camel@sipsolutions.net> <162390781577dd2c2916f4dc600b65cf7f09abb1.camel@sipsolutions.net> Message-ID: On Mon, Mar 23, 2020 at 9:49 PM Sebastian Berg wrote: > On Mon, 2020-03-23 at 18:23 +0100, Francesc Alted wrote: > > > > If we were designing a new programming language around array > > > computing > > > principles, I do think that would be the approach I would want to > > > take/consider. But I simply lack the vision of how marrying the > > > idea > > > with the scalar language Python would work out well... > > > > > > > I have had a glance at what you are after, and it seems challenging > > indeed. IMO, trying to cope with everybody's need regarding data > > types is > > extremely costly (or more simply, not even possible). I think that a > > better approach would be to decouple the storage part of a container > > from > > its data type system. In the storage there should go things needed > > to cope > > with the data retrieval, like the itemsize, the shape, or even other > > sub-shapes for chunked datasets. Then, in the data type layer, one > > should > > be able to add meaning to the raw data: is that an integer? speed? > > temperature? a compound type? > > > > I am struggling a bit fully understand the lessons to learn. > > There seems some overlap of storage and DTypes? That is mainly > `itemsize` and more tricky `is/has_object`. Which is about how the data > is stored but depends on which data is stored? > In my current view these are part of the `dtype` instance, e.g. the > class `np.dtype[np.string]` (a DTypeMeta instance), will have > instances: `np.dtype[np.string](length=5, byteorder="=")` (which is > identical to `np.dtype("U5")`). > Or is it that `np.ndarray` would actually use an `np.naivearray` > internally, which is told the itemsize at construction time? > In principle, the DType class could also be much more basic, and NumPy > could subclass it (or something similar) to tag on the things it needs > to efficiently use the DTypes (outside the computational engine/UFuncs, > which cover a lot, but unfortunately I do not think everything). > What I am trying to say is that NumPy should be rather agnostic about providing data types beyond the relatively simple set that already supports. I am suggesting that focusing on providing a way to allow the storage (not only in-memory, but also persisted arrays via .npy/.npz files) of user-defined data types (or any other kind of metadata) and let 3rd party libraries use this machinery to serialize/deserialize them might be a better use of resources. I am envisioning making life easier for libraries like e.g. xarray, which already extends NumPy in a number of ways, and that can make use of computational kernels different than NumPy itself (dask, probably numba too) in order to implement functionality not present in NumPy. Allowing an easy way to serialize library-defined data types would open the door to use NumPy itself as a storage layer for persistency too, bringing an important complement to NetCDF or zarr formats (remember that every format comes with its own pros and cons). But xarray is just an example; why not thinking on other kind of libraries that would provide their own types, leveraging NumPy for storage and e.g. numba for building a library of efficient functions, specific for the new types? If done properly, these datasets can still be shared efficiently with other libraries, as long as the basic data type system existing in NumPy is used to access to it. Cheers, Francesc > > - Sebastian > > > > Indeed the data storage layer should be able to provide a way to > > store the > > data type representation so that a container can be serialized and > > deserialized correctly. But the important thing here is that this > > decoupling between storage and types allows for different data type > > systems, so that anyone can come with a specific type system > > depending on > > her needs. One can envision here even a basic data type system (e.g. > > a > > version of what's now supported in NumPy) that can be extended with > > other > > layers, depending on the needs, so that every community can > > interchange > > data at a basic level at least. > > > > As an example, this is the basic spirit behind the under-construction > > Caterva array container (https://github.com/Blosc/Caterva). Blosc2 ( > > https://github.com/Blosc/C-Blosc2) will be providing the low-level > > storage > > layer, with no information about dimensionality. Caterva will be > > building > > the multidimensional layer, but with no information about the types > > at > > all. On top of this scaffold, third-party layers will be free to > > build > > their own data dtypes, specific for every domain (the concept is > > imaged in > > slide 18 of this presentation: > > https://blosc.org/docs/Caterva-HDF5-Workshop.pdf). There is nothing > > to > > prevent to add more layers, or even a two-layer (and preferably no > > more > > than two-level) data type system: one for simple data types (e.g. > > NumPy > > ones) and one meant to be more domain-specific. > > > > Right now, I think that the limitation that is keeping the NumPy > > community > > thinking in terms of blending storage and types in the same layer is > > that > > NumPy is providing a computational engine too, and for doing > > computations, > > one need to provide both storage (including dimensionality info) and > > type > > information indeed. By using the multi-layer approach, there should > > be a > > computational layer that is laid out on top of the storage and the > > type > > layers, and hence, specific for leveraging them. Sure, that is a big > > departure from what we are used to, but as long as one can keep the > > architecture of the different layers simple, one could see > > interesting > > results in not that long time. > > > > Just my 2 cents, > > Francesc > > > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -- Francesc Alted -------------- next part -------------- An HTML attachment was scrubbed... URL: From matti.picus at gmail.com Tue Mar 24 07:12:10 2020 From: matti.picus at gmail.com (Matti Picus) Date: Tue, 24 Mar 2020 13:12:10 +0200 Subject: [Numpy-discussion] Proposal: NEP 41 -- First step towards a new Datatype System In-Reply-To: References: <3ee04012570e418a874e4c436b5b6f8ac5f56cdb.camel@sipsolutions.net> <162390781577dd2c2916f4dc600b65cf7f09abb1.camel@sipsolutions.net>

Message-ID: <20436868-bc36-939f-9f71-dbb00b642cc5@gmail.com> On 24/3/20 11:48 am, Francesc Alted wrote: > > What I am trying to say is that NumPy should be rather agnostic about > providing data types beyond the relatively simple set that already > supports.? I am suggesting that focusing on providing a way to allow > the storage (not only in-memory, but also persisted arrays via > .npy/.npz files) of user-defined data types (or any other kind of > metadata)? and let 3rd party libraries use this machinery to > serialize/deserialize them might be a better use of resources. > > ... > Cheers, > Francesc > I agree that the goal is to enable user-defined data types, and even make the creation of them from python possible (with some caveats about performance). But I think this should be done in steps, and as the subject line says this is the first step. There are many scary details to work out around the problems of promotion and casting, what to do when the output might overflow, how to mark missing values and more. The question at hand is, as I understand it, one of finding the right way to create a data type object that will enable exactly what you propose. I think this is the correct path, as most large refactor-in-one-step efforts I have seem leave both the old code and the new code in an unusable state for years until the bugs are worked out. As for serialization protocols: I think that is a separate issue. We already have the npy/npz protocol, PEP3118 buffer protocol, and the pickle 5 buffering protocol. Each of them handle user-defined data types in different ways, with differing amounts of success. Matti From faltet at gmail.com Tue Mar 24 09:31:30 2020 From: faltet at gmail.com (Francesc Alted) Date: Tue, 24 Mar 2020 14:31:30 +0100 Subject: [Numpy-discussion] Proposal: NEP 41 -- First step towards a new Datatype System In-Reply-To: <20436868-bc36-939f-9f71-dbb00b642cc5@gmail.com> References: <3ee04012570e418a874e4c436b5b6f8ac5f56cdb.camel@sipsolutions.net> <162390781577dd2c2916f4dc600b65cf7f09abb1.camel@sipsolutions.net>

<20436868-bc36-939f-9f71-dbb00b642cc5@gmail.com> Message-ID: On Tue, Mar 24, 2020 at 12:12 PM Matti Picus wrote: > > On 24/3/20 11:48 am, Francesc Alted wrote: > > > > What I am trying to say is that NumPy should be rather agnostic about > > providing data types beyond the relatively simple set that already > > supports. I am suggesting that focusing on providing a way to allow > > the storage (not only in-memory, but also persisted arrays via > > .npy/.npz files) of user-defined data types (or any other kind of > > metadata) and let 3rd party libraries use this machinery to > > serialize/deserialize them might be a better use of resources. > > > > ... > > Cheers, > > Francesc > > > I agree that the goal is to enable user-defined data types, and even > make the creation of them from python possible (with some caveats about > performance). But I think this should be done in steps, and as the > subject line says this is the first step. There are many scary details > to work out around the problems of promotion and casting, what to do > when the output might overflow, how to mark missing values and more. The > question at hand is, as I understand it, one of finding the right way to > create a data type object that will enable exactly what you propose. I > think this is the correct path, as most large refactor-in-one-step > efforts I have seem leave both the old code and the new code in an > unusable state for years until the bugs are worked out. > Thanks Matti for clarifying the goals of the NEP; having the sentence "New Datatype System" in the title sounded scary to my ears indeed, and I share your concerns about new code largely undergoing 'beta' stage for long time. Before shutting up, I'll just reiterate that providing pretty shallow machinery for allowing the integration with user-defined data types should avoid big headaches: the simpler, the better. But this is of course up to the maintainers. > > As for serialization protocols: I think that is a separate issue. We > already have the npy/npz protocol, PEP3118 buffer protocol, and the > pickle 5 buffering protocol. Each of them handle user-defined data types > in different ways, with differing amounts of success. > Yup, I forgot the buffer protocol an pickle 5. Thanks for reminder. Cheers, -- Francesc Alted -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Tue Mar 24 13:00:19 2020 From: shoyer at gmail.com (Stephan Hoyer) Date: Tue, 24 Mar 2020 10:00:19 -0700 Subject: [Numpy-discussion] Put type annotations in NumPy proper? Message-ID: When we started numpy-stubs [1] a few years ago, putting type annotations in NumPy itself seemed premature. We still supported Python 2, which meant that we would need to use awkward comments for type annotations. Over the past few years, using type annotations has become increasingly popular, even in the scientific Python stack. For example, off-hand I know that at least SciPy, pandas and xarray have at least part of their APIs type annotated. Even without annotations for shapes or dtypes, it would be valuable to have near complete annotations for NumPy, the project at the bottom of the scientific stack. Unfortunately, numpy-stubs never really took off. I can think of a few reasons for that: 1. Missing high level guidance on how to write type annotations, particularly for how (or if) to annotate particularly dynamic parts of NumPy (e.g., consider __array_function__), and whether we should prioritize strictness or faithfulness [2]. 2. We didn't have a good experience for new contributors. Due to the relatively low level of interest in the project, when a contributor would occasionally drop in, I often didn't even notice their PR for a few weeks. 3. Developing type annotations separately from the main codebase makes them a little harder to keep in sync. This means that type annotations couldn't serve their typical purpose of self-documenting code. Part of this may be necessary for NumPy (due to our use of C extensions), but large parts of NumPy's user facing APIs are written in Python. We no longer support Python 2, so at least we no longer need to worry about putting annotations in comments. We eventually could probably use a formal NEP (or several) on how we want to use type annotations in NumPy, but I think a good first step would be to think about how to start moving the annotations from numpy-stubs into numpy proper. Any thoughts? Anyone interested in taking the lead on this? Cheers, Stephan [1] https://github.com/numpy/numpy-stubs [2] https://github.com/numpy/numpy-stubs/issues/12 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at gmail.com Tue Mar 24 13:28:04 2020 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Tue, 24 Mar 2020 18:28:04 +0100 Subject: [Numpy-discussion] Put type annotations in NumPy proper? In-Reply-To: References: Message-ID: <950157a0-5548-9065-0b39-d877e51c824f@gmail.com> Thanks for re-starting this discussion, Stephan! I think there is definitely significant interest in this topic: https://github.com/numpy/numpy/issues/7370 is the issue with the largest number of user likes in the issue tracker (FWIW). Having them in numpy, as opposed to a separate numpy-stubs repository would indeed be ideal from a user perspective. When looking into it in the past, I was never sure how well in sync numpy-stubs was. Putting aside ndarray, as more challenging, even annotations for numpy functions and method parameters with built-in types would help, as a start. To add to the previously listed projects that would benefit from this, we are currently considering to start using some (minimal) type annotations in scikit-learn. -- Roman Yurchak On 24/03/2020 18:00, Stephan Hoyer wrote: > When we started numpy-stubs [1] a few years ago, putting type > annotations in NumPy itself seemed premature. We still supported Python > 2, which meant that we would need to use awkward comments for type > annotations. > > Over the past few years, using type annotations has become increasingly > popular, even in the scientific Python stack. For example, off-hand I > know that at least SciPy, pandas and xarray have at least part of their > APIs type annotated. Even without annotations for shapes or dtypes, it > would be valuable to have near complete annotations for NumPy, the > project at the bottom of the scientific stack. > > Unfortunately, numpy-stubs never really took off. I can think of a few > reasons for that: > 1. Missing high level guidance on how to write type annotations, > particularly for how (or if) to annotate particularly dynamic parts of > NumPy (e.g., consider __array_function__), and whether we should > prioritize strictness or faithfulness [2]. > 2. We didn't have a good experience for new contributors. Due to the > relatively low level of interest in the project, when a contributor > would occasionally drop in, I often didn't even notice their PR for a > few weeks. > 3. Developing type annotations separately from the main codebase makes > them a little harder to keep in sync. This means that type annotations > couldn't serve their typical purpose of self-documenting code. Part of > this may be necessary for NumPy (due to our use of C extensions), but > large parts of NumPy's user facing APIs are written in Python. We no > longer support Python 2, so at least we no longer need to worry about > putting annotations in comments. > > We eventually could probably use a formal NEP (or several) on how we > want to use type annotations in NumPy, but I think a good first step > would be to think about how to start moving the annotations from > numpy-stubs into numpy proper. > > Any thoughts? Anyone interested in taking the lead on this? > > Cheers, > Stephan > > [1] https://github.com/numpy/numpy-stubs > [2] https://github.com/numpy/numpy-stubs/issues/12 > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > From quantkeyvis at gmail.com Tue Mar 24 13:59:40 2020 From: quantkeyvis at gmail.com (Keyvis Damptey) Date: Tue, 24 Mar 2020 13:59:40 -0400 Subject: [Numpy-discussion] Numpy doesn't use RAM Message-ID: Hi Numpy dev community, I'm keyvis, a statistical data scientist. I'm currently using numpy in python 3.8.2 64-bit for a clustering problem, on a machine with 1.9 TB RAM. When I try using np.zeros to create a 600,000 by 600,000 matrix of dtype=np.float32 it says "Unable to allocate 1.31 TiB for an array with shape (600000, 600000) and data type float32" I used psutils to determine how much RAM python thinks it has access to and it return with 1.8 TB approx. Is there some way I can fix numpy to create these large arrays? Thanks for your time and consideration -------------- next part -------------- An HTML attachment was scrubbed... URL: From wieser.eric+numpy at gmail.com Tue Mar 24 14:14:26 2020 From: wieser.eric+numpy at gmail.com (Eric Wieser) Date: Tue, 24 Mar 2020 18:14:26 +0000 Subject: [Numpy-discussion] Put type annotations in NumPy proper? In-Reply-To: <950157a0-5548-9065-0b39-d877e51c824f@gmail.com> References: <950157a0-5548-9065-0b39-d877e51c824f@gmail.com> Message-ID: > Putting > aside ndarray, as more challenging, even annotations for numpy functions > and method parameters with built-in types would help, as a start. This is a good idea in principle, but one thing concerns me. If we add type annotations to numpy, does it become an error to have numpy-stubs installed? That is, is this an all-or-nothing thing where as soon as we start, numpy-stubs becomes unusable? Eric On Tue, 24 Mar 2020 at 17:28, Roman Yurchak wrote: > Thanks for re-starting this discussion, Stephan! I think there is > definitely significant interest in this topic: > https://github.com/numpy/numpy/issues/7370 is the issue with the largest > number of user likes in the issue tracker (FWIW). > > Having them in numpy, as opposed to a separate numpy-stubs repository > would indeed be ideal from a user perspective. When looking into it in > the past, I was never sure how well in sync numpy-stubs was. Putting > aside ndarray, as more challenging, even annotations for numpy functions > and method parameters with built-in types would help, as a start. > > To add to the previously listed projects that would benefit from this, > we are currently considering to start using some (minimal) type > annotations in scikit-learn. > > -- > Roman Yurchak > > On 24/03/2020 18:00, Stephan Hoyer wrote: > > When we started numpy-stubs [1] a few years ago, putting type > > annotations in NumPy itself seemed premature. We still supported Python > > 2, which meant that we would need to use awkward comments for type > > annotations. > > > > Over the past few years, using type annotations has become increasingly > > popular, even in the scientific Python stack. For example, off-hand I > > know that at least SciPy, pandas and xarray have at least part of their > > APIs type annotated. Even without annotations for shapes or dtypes, it > > would be valuable to have near complete annotations for NumPy, the > > project at the bottom of the scientific stack. > > > > Unfortunately, numpy-stubs never really took off. I can think of a few > > reasons for that: > > 1. Missing high level guidance on how to write type annotations, > > particularly for how (or if) to annotate particularly dynamic parts of > > NumPy (e.g., consider __array_function__), and whether we should > > prioritize strictness or faithfulness [2]. > > 2. We didn't have a good experience for new contributors. Due to the > > relatively low level of interest in the project, when a contributor > > would occasionally drop in, I often didn't even notice their PR for a > > few weeks. > > 3. Developing type annotations separately from the main codebase makes > > them a little harder to keep in sync. This means that type annotations > > couldn't serve their typical purpose of self-documenting code. Part of > > this may be necessary for NumPy (due to our use of C extensions), but > > large parts of NumPy's user facing APIs are written in Python. We no > > longer support Python 2, so at least we no longer need to worry about > > putting annotations in comments. > > > > We eventually could probably use a formal NEP (or several) on how we > > want to use type annotations in NumPy, but I think a good first step > > would be to think about how to start moving the annotations from > > numpy-stubs into numpy proper. > > > > Any thoughts? Anyone interested in taking the lead on this? > > > > Cheers, > > Stephan > > > > [1] https://github.com/numpy/numpy-stubs > > [2] https://github.com/numpy/numpy-stubs/issues/12 > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Tue Mar 24 14:15:47 2020 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Tue, 24 Mar 2020 13:15:47 -0500 Subject: [Numpy-discussion] Numpy doesn't use RAM In-Reply-To: References: Message-ID: On Tue, 2020-03-24 at 13:59 -0400, Keyvis Damptey wrote: > Hi Numpy dev community, > > I'm keyvis, a statistical data scientist. > > I'm currently using numpy in python 3.8.2 64-bit for a clustering > problem, > on a machine with 1.9 TB RAM. When I try using np.zeros to create a > 600,000 > by 600,000 matrix of dtype=np.float32 it says > "Unable to allocate 1.31 TiB for an array with shape (600000, 600000) > and > > data type float32" > If this error happens, allocating the memory failed. This should be pretty much a simple `malloc` call in C, so this is the kernel complaining, not Python/NumPy. I am not quite sure, but maybe memory fragmentation plays its part, or simply are actually out of memory for that process, 1.44TB is a significant portion of the total memory after all. Not sure what to say, but I think you should probably look into other solutions, maybe using HDF5, zarr, or memory-mapping (although I am not sure the last actually helps). It will be tricky to work with arrays of a size that is close to the available total memory. Maybe someone who works more with such data here can give you tips on what projects can help you or what solutions to look into. - Sebastian > I used psutils to determine how much RAM python thinks it has access > to and > it return with 1.8 TB approx. > > Is there some way I can fix numpy to create these large arrays? > Thanks for your time and consideration > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From sseibert at anaconda.com Tue Mar 24 14:35:49 2020 From: sseibert at anaconda.com (Stanley Seibert) Date: Tue, 24 Mar 2020 13:35:49 -0500 Subject: [Numpy-discussion] Numpy doesn't use RAM In-Reply-To: References: Message-ID: In addition to what Sebastian said about memory fragmentation and OS limits about memory allocations, I do think it will be hard to work with an array that close to the memory limit in NumPy regardless. Almost any operation will need to make a temporary array and exceed your memory limit. You might want to look at Dask Array for a NumPy-like API for working with chunked arrays that can be staged in and out of memory: https://docs.dask.org/en/latest/array.html As a bonus, Dask will also let you make better use of the large number of CPU cores that you likely have in your 1.9 TB RAM system. :) On Tue, Mar 24, 2020 at 1:00 PM Keyvis Damptey wrote: > Hi Numpy dev community, > > I'm keyvis, a statistical data scientist. > > I'm currently using numpy in python 3.8.2 64-bit for a clustering problem, > on a machine with 1.9 TB RAM. When I try using np.zeros to create a 600,000 > by 600,000 matrix of dtype=np.float32 it says > "Unable to allocate 1.31 TiB for an array with shape (600000, 600000) and > data type float32" > > I used psutils to determine how much RAM python thinks it has access to > and it return with 1.8 TB approx. > > Is there some way I can fix numpy to create these large arrays? > Thanks for your time and consideration > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben.v.root at gmail.com Tue Mar 24 14:36:45 2020 From: ben.v.root at gmail.com (Benjamin Root) Date: Tue, 24 Mar 2020 14:36:45 -0400 Subject: [Numpy-discussion] Numpy doesn't use RAM In-Reply-To: References: Message-ID: Another thing to point out about having an array of that percentage of the available memory is that it severely restricts what you can do with it. Since you are above 50% of the available memory, you won't be able to create another array that would be the result of computing something with that array. So, you are restricted to querying (which you could do without having everything in-memory), or in-place operations. Dask arrays might be what you are really looking for. Ben Root On Tue, Mar 24, 2020 at 2:18 PM Sebastian Berg wrote: > On Tue, 2020-03-24 at 13:59 -0400, Keyvis Damptey wrote: > > Hi Numpy dev community, > > > > I'm keyvis, a statistical data scientist. > > > > I'm currently using numpy in python 3.8.2 64-bit for a clustering > > problem, > > on a machine with 1.9 TB RAM. When I try using np.zeros to create a > > 600,000 > > by 600,000 matrix of dtype=np.float32 it says > > "Unable to allocate 1.31 TiB for an array with shape (600000, 600000) > > and > > > > data type float32" > > > > If this error happens, allocating the memory failed. This should be > pretty much a simple `malloc` call in C, so this is the kernel > complaining, not Python/NumPy. > > I am not quite sure, but maybe memory fragmentation plays its part, or > simply are actually out of memory for that process, 1.44TB is a > significant portion of the total memory after all. > > Not sure what to say, but I think you should probably look into other > solutions, maybe using HDF5, zarr, or memory-mapping (although I am not > sure the last actually helps). It will be tricky to work with arrays of > a size that is close to the available total memory. > > Maybe someone who works more with such data here can give you tips on > what projects can help you or what solutions to look into. > > - Sebastian > > > > > I used psutils to determine how much RAM python thinks it has access > > to and > > it return with 1.8 TB approx. > > > > Is there some way I can fix numpy to create these large arrays? > > Thanks for your time and consideration > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From josh.craig.wilson at gmail.com Tue Mar 24 18:42:27 2020 From: josh.craig.wilson at gmail.com (Joshua Wilson) Date: Tue, 24 Mar 2020 15:42:27 -0700 Subject: [Numpy-discussion] Put type annotations in NumPy proper? In-Reply-To: References: <950157a0-5548-9065-0b39-d877e51c824f@gmail.com> Message-ID: > That is, is this an all-or-nothing thing where as soon as we start, numpy-stubs becomes unusable? Until NumPy is made PEP 561 compatible by adding a `py.typed` file, type checkers will ignore the types in the repo, so in theory you can avoid the all or nothing. In practice it's maybe trickier because currently people can use the stubs, but they won't be able to use the types in the repo until the PEP 561 switch is flipped. So e.g. currently SciPy pulls the stubs from `numpy-stubs` master, allowing for a short find place where NumPy stubs are lacking -> improve stubs -> improve SciPy types loop. If all development moves into the main repo then SciPy is blocked on it becoming PEP 561 compatible before moving forward. But, you could complain that I put the cart before the horse with introducing typing in the SciPy repo before the NumPy types were more resolved, and that's probably a fair complaint. > Anyone interested in taking the lead on this? Not that I am a core developer or anything, but I am interested in helping to improve typing in NumPy. On Tue, Mar 24, 2020 at 11:15 AM Eric Wieser wrote: > > > Putting > > aside ndarray, as more challenging, even annotations for numpy functions > > and method parameters with built-in types would help, as a start. > > This is a good idea in principle, but one thing concerns me. > > If we add type annotations to numpy, does it become an error to have numpy-stubs installed? > That is, is this an all-or-nothing thing where as soon as we start, numpy-stubs becomes unusable? > > Eric > > On Tue, 24 Mar 2020 at 17:28, Roman Yurchak wrote: >> >> Thanks for re-starting this discussion, Stephan! I think there is >> definitely significant interest in this topic: >> https://github.com/numpy/numpy/issues/7370 is the issue with the largest >> number of user likes in the issue tracker (FWIW). >> >> Having them in numpy, as opposed to a separate numpy-stubs repository >> would indeed be ideal from a user perspective. When looking into it in >> the past, I was never sure how well in sync numpy-stubs was. Putting >> aside ndarray, as more challenging, even annotations for numpy functions >> and method parameters with built-in types would help, as a start. >> >> To add to the previously listed projects that would benefit from this, >> we are currently considering to start using some (minimal) type >> annotations in scikit-learn. >> >> -- >> Roman Yurchak >> >> On 24/03/2020 18:00, Stephan Hoyer wrote: >> > When we started numpy-stubs [1] a few years ago, putting type >> > annotations in NumPy itself seemed premature. We still supported Python >> > 2, which meant that we would need to use awkward comments for type >> > annotations. >> > >> > Over the past few years, using type annotations has become increasingly >> > popular, even in the scientific Python stack. For example, off-hand I >> > know that at least SciPy, pandas and xarray have at least part of their >> > APIs type annotated. Even without annotations for shapes or dtypes, it >> > would be valuable to have near complete annotations for NumPy, the >> > project at the bottom of the scientific stack. >> > >> > Unfortunately, numpy-stubs never really took off. I can think of a few >> > reasons for that: >> > 1. Missing high level guidance on how to write type annotations, >> > particularly for how (or if) to annotate particularly dynamic parts of >> > NumPy (e.g., consider __array_function__), and whether we should >> > prioritize strictness or faithfulness [2]. >> > 2. We didn't have a good experience for new contributors. Due to the >> > relatively low level of interest in the project, when a contributor >> > would occasionally drop in, I often didn't even notice their PR for a >> > few weeks. >> > 3. Developing type annotations separately from the main codebase makes >> > them a little harder to keep in sync. This means that type annotations >> > couldn't serve their typical purpose of self-documenting code. Part of >> > this may be necessary for NumPy (due to our use of C extensions), but >> > large parts of NumPy's user facing APIs are written in Python. We no >> > longer support Python 2, so at least we no longer need to worry about >> > putting annotations in comments. >> > >> > We eventually could probably use a formal NEP (or several) on how we >> > want to use type annotations in NumPy, but I think a good first step >> > would be to think about how to start moving the annotations from >> > numpy-stubs into numpy proper. >> > >> > Any thoughts? Anyone interested in taking the lead on this? >> > >> > Cheers, >> > Stephan >> > >> > [1] https://github.com/numpy/numpy-stubs >> > [2] https://github.com/numpy/numpy-stubs/issues/12 >> > >> > _______________________________________________ >> > NumPy-Discussion mailing list >> > NumPy-Discussion at python.org >> > https://mail.python.org/mailman/listinfo/numpy-discussion >> > >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion From quantkeyvis at gmail.com Tue Mar 24 18:50:02 2020 From: quantkeyvis at gmail.com (Keyvis Damptey) Date: Tue, 24 Mar 2020 18:50:02 -0400 Subject: [Numpy-discussion] NumPy-Discussion Digest, Vol 162, Issue 27 In-Reply-To: References: Message-ID: Thanks. This is soooo embarressing, but I wasn't able to create a new matrix because I forgot to delete the original massive matrix. I was testing how big it could go in terms of rows/columns before reaching the limit and forgot to delete the last object before creating a new one. Sadly that data usage was not reflected in the task manager for the VM instance. On Tue, Mar 24, 2020, 6:44 PM wrote: > Send NumPy-Discussion mailing list submissions to > numpy-discussion at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/numpy-discussion > or, via email, send a message with subject or body 'help' to > numpy-discussion-request at python.org > > You can reach the person managing the list at > numpy-discussion-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of NumPy-Discussion digest..." > > > Today's Topics: > > 1. Re: Numpy doesn't use RAM (Sebastian Berg) > 2. Re: Numpy doesn't use RAM (Stanley Seibert) > 3. Re: Numpy doesn't use RAM (Benjamin Root) > 4. Re: Put type annotations in NumPy proper? (Joshua Wilson) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 24 Mar 2020 13:15:47 -0500 > From: Sebastian Berg > To: numpy-discussion at python.org > Subject: Re: [Numpy-discussion] Numpy doesn't use RAM > Message-ID: > > Content-Type: text/plain; charset="utf-8" > > On Tue, 2020-03-24 at 13:59 -0400, Keyvis Damptey wrote: > > Hi Numpy dev community, > > > > I'm keyvis, a statistical data scientist. > > > > I'm currently using numpy in python 3.8.2 64-bit for a clustering > > problem, > > on a machine with 1.9 TB RAM. When I try using np.zeros to create a > > 600,000 > > by 600,000 matrix of dtype=np.float32 it says > > "Unable to allocate 1.31 TiB for an array with shape (600000, 600000) > > and > > > > data type float32" > > > > If this error happens, allocating the memory failed. This should be > pretty much a simple `malloc` call in C, so this is the kernel > complaining, not Python/NumPy. > > I am not quite sure, but maybe memory fragmentation plays its part, or > simply are actually out of memory for that process, 1.44TB is a > significant portion of the total memory after all. > > Not sure what to say, but I think you should probably look into other > solutions, maybe using HDF5, zarr, or memory-mapping (although I am not > sure the last actually helps). It will be tricky to work with arrays of > a size that is close to the available total memory. > > Maybe someone who works more with such data here can give you tips on > what projects can help you or what solutions to look into. > > - Sebastian > > > > > I used psutils to determine how much RAM python thinks it has access > > to and > > it return with 1.8 TB approx. > > > > Is there some way I can fix numpy to create these large arrays? > > Thanks for your time and consideration > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: signature.asc > Type: application/pgp-signature > Size: 833 bytes > Desc: This is a digitally signed message part > URL: < > http://mail.python.org/pipermail/numpy-discussion/attachments/20200324/16501583/attachment-0001.sig > > > > ------------------------------ > > Message: 2 > Date: Tue, 24 Mar 2020 13:35:49 -0500 > From: Stanley Seibert > To: Discussion of Numerical Python > Subject: Re: [Numpy-discussion] Numpy doesn't use RAM > Message-ID: > < > CADv3RKTjBo48a+eYJn7m+gpT2iASD8esaiHtZs0vqLNUNY_fbg at mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > In addition to what Sebastian said about memory fragmentation and OS limits > about memory allocations, I do think it will be hard to work with an array > that close to the memory limit in NumPy regardless. Almost any operation > will need to make a temporary array and exceed your memory limit. You > might want to look at Dask Array for a NumPy-like API for working with > chunked arrays that can be staged in and out of memory: > > https://docs.dask.org/en/latest/array.html > > As a bonus, Dask will also let you make better use of the large number of > CPU cores that you likely have in your 1.9 TB RAM system. :) > > On Tue, Mar 24, 2020 at 1:00 PM Keyvis Damptey > wrote: > > > Hi Numpy dev community, > > > > I'm keyvis, a statistical data scientist. > > > > I'm currently using numpy in python 3.8.2 64-bit for a clustering > problem, > > on a machine with 1.9 TB RAM. When I try using np.zeros to create a > 600,000 > > by 600,000 matrix of dtype=np.float32 it says > > "Unable to allocate 1.31 TiB for an array with shape (600000, 600000) and > > data type float32" > > > > I used psutils to determine how much RAM python thinks it has access to > > and it return with 1.8 TB approx. > > > > Is there some way I can fix numpy to create these large arrays? > > Thanks for your time and consideration > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://mail.python.org/pipermail/numpy-discussion/attachments/20200324/02cbeb71/attachment-0001.html > > > > ------------------------------ > > Message: 3 > Date: Tue, 24 Mar 2020 14:36:45 -0400 > From: Benjamin Root > To: Discussion of Numerical Python > Subject: Re: [Numpy-discussion] Numpy doesn't use RAM > Message-ID: > CQGVF18Pavi3fcbXXDZA at mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Another thing to point out about having an array of that percentage of the > available memory is that it severely restricts what you can do with it. > Since you are above 50% of the available memory, you won't be able to > create another array that would be the result of computing something with > that array. So, you are restricted to querying (which you could do without > having everything in-memory), or in-place operations. > > Dask arrays might be what you are really looking for. > > Ben Root > > On Tue, Mar 24, 2020 at 2:18 PM Sebastian Berg > > wrote: > > > On Tue, 2020-03-24 at 13:59 -0400, Keyvis Damptey wrote: > > > Hi Numpy dev community, > > > > > > I'm keyvis, a statistical data scientist. > > > > > > I'm currently using numpy in python 3.8.2 64-bit for a clustering > > > problem, > > > on a machine with 1.9 TB RAM. When I try using np.zeros to create a > > > 600,000 > > > by 600,000 matrix of dtype=np.float32 it says > > > "Unable to allocate 1.31 TiB for an array with shape (600000, 600000) > > > and > > > > > > data type float32" > > > > > > > If this error happens, allocating the memory failed. This should be > > pretty much a simple `malloc` call in C, so this is the kernel > > complaining, not Python/NumPy. > > > > I am not quite sure, but maybe memory fragmentation plays its part, or > > simply are actually out of memory for that process, 1.44TB is a > > significant portion of the total memory after all. > > > > Not sure what to say, but I think you should probably look into other > > solutions, maybe using HDF5, zarr, or memory-mapping (although I am not > > sure the last actually helps). It will be tricky to work with arrays of > > a size that is close to the available total memory. > > > > Maybe someone who works more with such data here can give you tips on > > what projects can help you or what solutions to look into. > > > > - Sebastian > > > > > > > > > I used psutils to determine how much RAM python thinks it has access > > > to and > > > it return with 1.8 TB approx. > > > > > > Is there some way I can fix numpy to create these large arrays? > > > Thanks for your time and consideration > > > _______________________________________________ > > > NumPy-Discussion mailing list > > > NumPy-Discussion at python.org > > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://mail.python.org/pipermail/numpy-discussion/attachments/20200324/12a718d2/attachment-0001.html > > > > ------------------------------ > > Message: 4 > Date: Tue, 24 Mar 2020 15:42:27 -0700 > From: Joshua Wilson > To: Discussion of Numerical Python > Subject: Re: [Numpy-discussion] Put type annotations in NumPy proper? > Message-ID: > < > CAKFGQGwChWdKGNpv2W5xpE7+Bxd-hvrgmN6BAUpG+Hgs-4-2+g at mail.gmail.com> > Content-Type: text/plain; charset="UTF-8" > > > That is, is this an all-or-nothing thing where as soon as we start, > numpy-stubs becomes unusable? > > Until NumPy is made PEP 561 compatible by adding a `py.typed` file, > type checkers will ignore the types in the repo, so in theory you can > avoid the all or nothing. In practice it's maybe trickier because > currently people can use the stubs, but they won't be able to use the > types in the repo until the PEP 561 switch is flipped. So e.g. > currently SciPy pulls the stubs from `numpy-stubs` master, allowing > for a short > > find place where NumPy stubs are lacking -> improve stubs -> improve SciPy > types > > loop. If all development moves into the main repo then SciPy is > blocked on it becoming PEP 561 compatible before moving forward. But, > you could complain that I put the cart before the horse with > introducing typing in the SciPy repo before the NumPy types were more > resolved, and that's probably a fair complaint. > > > Anyone interested in taking the lead on this? > > Not that I am a core developer or anything, but I am interested in > helping to improve typing in NumPy. > > On Tue, Mar 24, 2020 at 11:15 AM Eric Wieser > wrote: > > > > > Putting > > > aside ndarray, as more challenging, even annotations for numpy > functions > > > and method parameters with built-in types would help, as a start. > > > > This is a good idea in principle, but one thing concerns me. > > > > If we add type annotations to numpy, does it become an error to have > numpy-stubs installed? > > That is, is this an all-or-nothing thing where as soon as we start, > numpy-stubs becomes unusable? > > > > Eric > > > > On Tue, 24 Mar 2020 at 17:28, Roman Yurchak > wrote: > >> > >> Thanks for re-starting this discussion, Stephan! I think there is > >> definitely significant interest in this topic: > >> https://github.com/numpy/numpy/issues/7370 is the issue with the > largest > >> number of user likes in the issue tracker (FWIW). > >> > >> Having them in numpy, as opposed to a separate numpy-stubs repository > >> would indeed be ideal from a user perspective. When looking into it in > >> the past, I was never sure how well in sync numpy-stubs was. Putting > >> aside ndarray, as more challenging, even annotations for numpy functions > >> and method parameters with built-in types would help, as a start. > >> > >> To add to the previously listed projects that would benefit from this, > >> we are currently considering to start using some (minimal) type > >> annotations in scikit-learn. > >> > >> -- > >> Roman Yurchak > >> > >> On 24/03/2020 18:00, Stephan Hoyer wrote: > >> > When we started numpy-stubs [1] a few years ago, putting type > >> > annotations in NumPy itself seemed premature. We still supported > Python > >> > 2, which meant that we would need to use awkward comments for type > >> > annotations. > >> > > >> > Over the past few years, using type annotations has become > increasingly > >> > popular, even in the scientific Python stack. For example, off-hand I > >> > know that at least SciPy, pandas and xarray have at least part of > their > >> > APIs type annotated. Even without annotations for shapes or dtypes, it > >> > would be valuable to have near complete annotations for NumPy, the > >> > project at the bottom of the scientific stack. > >> > > >> > Unfortunately, numpy-stubs never really took off. I can think of a few > >> > reasons for that: > >> > 1. Missing high level guidance on how to write type annotations, > >> > particularly for how (or if) to annotate particularly dynamic parts of > >> > NumPy (e.g., consider __array_function__), and whether we should > >> > prioritize strictness or faithfulness [2]. > >> > 2. We didn't have a good experience for new contributors. Due to the > >> > relatively low level of interest in the project, when a contributor > >> > would occasionally drop in, I often didn't even notice their PR for a > >> > few weeks. > >> > 3. Developing type annotations separately from the main codebase makes > >> > them a little harder to keep in sync. This means that type annotations > >> > couldn't serve their typical purpose of self-documenting code. Part of > >> > this may be necessary for NumPy (due to our use of C extensions), but > >> > large parts of NumPy's user facing APIs are written in Python. We no > >> > longer support Python 2, so at least we no longer need to worry about > >> > putting annotations in comments. > >> > > >> > We eventually could probably use a formal NEP (or several) on how we > >> > want to use type annotations in NumPy, but I think a good first step > >> > would be to think about how to start moving the annotations from > >> > numpy-stubs into numpy proper. > >> > > >> > Any thoughts? Anyone interested in taking the lead on this? > >> > > >> > Cheers, > >> > Stephan > >> > > >> > [1] https://github.com/numpy/numpy-stubs > >> > [2] https://github.com/numpy/numpy-stubs/issues/12 > >> > > >> > _______________________________________________ > >> > NumPy-Discussion mailing list > >> > NumPy-Discussion at python.org > >> > https://mail.python.org/mailman/listinfo/numpy-discussion > >> > > >> > >> _______________________________________________ > >> NumPy-Discussion mailing list > >> NumPy-Discussion at python.org > >> https://mail.python.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > > ------------------------------ > > End of NumPy-Discussion Digest, Vol 162, Issue 27 > ************************************************* > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Tue Mar 24 20:15:28 2020 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Tue, 24 Mar 2020 19:15:28 -0500 Subject: [Numpy-discussion] NumPy Development Meeting - Triage Focus Message-ID: <25e7ef22c9d7710a354922f036da8504ac7b0b90.camel@sipsolutions.net> Hi all, Our bi-weekly triage-focused NumPy development meeting is tomorrow (Wednesday, March 25) at 11 am Pacific Time. Everyone is invited to join in and edit the work-in-progress meeting topics and notes: https://hackmd.io/68i_JvOYQfy9ERiHgXMPvg I encourage everyone to notify us of issues or PRs that you feel should be prioritized or simply discussed briefly. Just comment on it so we can label it, or add your PR/issue to this weeks topics for discussion. Best regards Sebastian -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From jni at fastmail.com Tue Mar 24 23:56:17 2020 From: jni at fastmail.com (Juan Nunez-Iglesias) Date: Tue, 24 Mar 2020 22:56:17 -0500 Subject: [Numpy-discussion] Put type annotations in NumPy proper? In-Reply-To: References: <950157a0-5548-9065-0b39-d877e51c824f@gmail.com>

Message-ID: <59d01230-8e34-4971-89ae-8b569937e19e@www.fastmail.com> I'd like to offer a +1 from skimage's perspective (and napari's!) for having NumPy types directly in the repo. We have been wanting to start using type annotations, but the lack of types in NumPy proper, together with the uncertainty about whether numpy-stubs was "officially supported" and in-sync with NumPy itself, have been factors in holding us back. The __array_function__ problem is a major source of confusion for those of us unfamiliar with typing theory. Protocols would seem to solve this but they were only recently accepted and are only available by default in Python 3.8+: https://www.python.org/dev/peps/pep-0544/ https://mypy.readthedocs.io/en/stable/protocols.html I'd like to avoid driving this discussion off-topic, so it would be great if there was a "typing interest group" or similar list where we could discuss these issues. One thing that we would love to do in skimage is distinguish different kinds of arrays using NewType, e.g. Image = NewType('Image', np.ndarray) Coordinates = NewType('Coordinates', np.ndarray) def find_maxima(image : Image) -> Coordinates: ... def gaussian_filter(image : Image, sigma : float) -> Image: ... then find_maxima(gaussian_filter(image, num)) validates, but gaussian_filter(find_maxima(image), num) does not, even though the arguments are all NumPy arrays. However, the above cannot be combined with an array Protocol: https://www.python.org/dev/peps/pep-0544/#newtype-and-type-aliases I'd love to understand the reasons why, and whether this decision can be reversed, but I am out of my depth. Again, this is probably something for a different thread, but I just wanted to flag it as something to discuss as we develop a typing framework to suit the entire SciPy ecosystem. Also if someone has resources for "type theory for beginners", both in Python and more generally, they would be appreciated! Thanks Stephan for raising this issue! Juan. On Tue, 24 Mar 2020, at 5:42 PM, Joshua Wilson wrote: > > That is, is this an all-or-nothing thing where as soon as we start, numpy-stubs becomes unusable? > > Until NumPy is made PEP 561 compatible by adding a `py.typed` file, > type checkers will ignore the types in the repo, so in theory you can > avoid the all or nothing. In practice it's maybe trickier because > currently people can use the stubs, but they won't be able to use the > types in the repo until the PEP 561 switch is flipped. So e.g. > currently SciPy pulls the stubs from `numpy-stubs` master, allowing > for a short > > find place where NumPy stubs are lacking -> improve stubs -> improve SciPy types > > loop. If all development moves into the main repo then SciPy is > blocked on it becoming PEP 561 compatible before moving forward. But, > you could complain that I put the cart before the horse with > introducing typing in the SciPy repo before the NumPy types were more > resolved, and that's probably a fair complaint. > > > Anyone interested in taking the lead on this? > > Not that I am a core developer or anything, but I am interested in > helping to improve typing in NumPy. > > On Tue, Mar 24, 2020 at 11:15 AM Eric Wieser > wrote: > > > > > Putting > > > aside ndarray, as more challenging, even annotations for numpy functions > > > and method parameters with built-in types would help, as a start. > > > > This is a good idea in principle, but one thing concerns me. > > > > If we add type annotations to numpy, does it become an error to have numpy-stubs installed? > > That is, is this an all-or-nothing thing where as soon as we start, numpy-stubs becomes unusable? > > > > Eric > > > > On Tue, 24 Mar 2020 at 17:28, Roman Yurchak wrote: > >> > >> Thanks for re-starting this discussion, Stephan! I think there is > >> definitely significant interest in this topic: > >> https://github.com/numpy/numpy/issues/7370 is the issue with the largest > >> number of user likes in the issue tracker (FWIW). > >> > >> Having them in numpy, as opposed to a separate numpy-stubs repository > >> would indeed be ideal from a user perspective. When looking into it in > >> the past, I was never sure how well in sync numpy-stubs was. Putting > >> aside ndarray, as more challenging, even annotations for numpy functions > >> and method parameters with built-in types would help, as a start. > >> > >> To add to the previously listed projects that would benefit from this, > >> we are currently considering to start using some (minimal) type > >> annotations in scikit-learn. > >> > >> -- > >> Roman Yurchak > >> > >> On 24/03/2020 18:00, Stephan Hoyer wrote: > >> > When we started numpy-stubs [1] a few years ago, putting type > >> > annotations in NumPy itself seemed premature. We still supported Python > >> > 2, which meant that we would need to use awkward comments for type > >> > annotations. > >> > > >> > Over the past few years, using type annotations has become increasingly > >> > popular, even in the scientific Python stack. For example, off-hand I > >> > know that at least SciPy, pandas and xarray have at least part of their > >> > APIs type annotated. Even without annotations for shapes or dtypes, it > >> > would be valuable to have near complete annotations for NumPy, the > >> > project at the bottom of the scientific stack. > >> > > >> > Unfortunately, numpy-stubs never really took off. I can think of a few > >> > reasons for that: > >> > 1. Missing high level guidance on how to write type annotations, > >> > particularly for how (or if) to annotate particularly dynamic parts of > >> > NumPy (e.g., consider __array_function__), and whether we should > >> > prioritize strictness or faithfulness [2]. > >> > 2. We didn't have a good experience for new contributors. Due to the > >> > relatively low level of interest in the project, when a contributor > >> > would occasionally drop in, I often didn't even notice their PR for a > >> > few weeks. > >> > 3. Developing type annotations separately from the main codebase makes > >> > them a little harder to keep in sync. This means that type annotations > >> > couldn't serve their typical purpose of self-documenting code. Part of > >> > this may be necessary for NumPy (due to our use of C extensions), but > >> > large parts of NumPy's user facing APIs are written in Python. We no > >> > longer support Python 2, so at least we no longer need to worry about > >> > putting annotations in comments. > >> > > >> > We eventually could probably use a formal NEP (or several) on how we > >> > want to use type annotations in NumPy, but I think a good first step > >> > would be to think about how to start moving the annotations from > >> > numpy-stubs into numpy proper. > >> > > >> > Any thoughts? Anyone interested in taking the lead on this? > >> > > >> > Cheers, > >> > Stephan > >> > > >> > [1] https://github.com/numpy/numpy-stubs > >> > [2] https://github.com/numpy/numpy-stubs/issues/12 > >> > > >> > _______________________________________________ > >> > NumPy-Discussion mailing list > >> > NumPy-Discussion at python.org > >> > https://mail.python.org/mailman/listinfo/numpy-discussion > >> > > >> > >> _______________________________________________ > >> NumPy-Discussion mailing list > >> NumPy-Discussion at python.org > >> https://mail.python.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > From compl.yue at icloud.com Wed Mar 25 02:18:28 2020 From: compl.yue at icloud.com (YueCompl) Date: Wed, 25 Mar 2020 14:18:28 +0800 Subject: [Numpy-discussion] Numpy doesn't use RAM In-Reply-To: References: Message-ID: An alternative solution may be https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html If you are sure your subsequent computation against the array data has enough locality to avoid thrashing, I think numpy.memmap would work for you, i.e. to use an explicit disk file serving as swap. My env does a lot mmap'ing on disk data files by C++ (after Python read meta data), then wrap as ndarray, that's enough to run out-of-core programs as long as data access patterns fit in physical RAM at any instant, then even scanning the whole dataset is okay along the time axis (in realworld not data). Memory (address space) fragmentation is a problem, besides OS' `nofile` (number of file handles held open) limitation, if too many small data files involved, we are in switching to a solution with FUSE based fs with virtual large file viewing many small files on remote storage server. Cheers, Compl > On 2020-03-25, at 02:35, Stanley Seibert wrote: > > In addition to what Sebastian said about memory fragmentation and OS limits about memory allocations, I do think it will be hard to work with an array that close to the memory limit in NumPy regardless. Almost any operation will need to make a temporary array and exceed your memory limit. You might want to look at Dask Array for a NumPy-like API for working with chunked arrays that can be staged in and out of memory: > > https://docs.dask.org/en/latest/array.html > > As a bonus, Dask will also let you make better use of the large number of CPU cores that you likely have in your 1.9 TB RAM system. :) > > On Tue, Mar 24, 2020 at 1:00 PM Keyvis Damptey > wrote: > Hi Numpy dev community, > > I'm keyvis, a statistical data scientist. > > I'm currently using numpy in python 3.8.2 64-bit for a clustering problem, on a machine with 1.9 TB RAM. When I try using np.zeros to create a 600,000 by 600,000 matrix of dtype=np.float32 it says > "Unable to allocate 1.31 TiB for an array with shape (600000, 600000) and data type float32" > > I used psutils to determine how much RAM python thinks it has access to and it return with 1.8 TB approx. > > Is there some way I can fix numpy to create these large arrays? > Thanks for your time and consideration > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- An HTML attachment was scrubbed... URL: From toddrjen at gmail.com Wed Mar 25 13:32:52 2020 From: toddrjen at gmail.com (Todd) Date: Wed, 25 Mar 2020 13:32:52 -0400 Subject: [Numpy-discussion] Put type annotations in NumPy proper? In-Reply-To: References: <950157a0-5548-9065-0b39-d877e51c824f@gmail.com> Message-ID: What about migrating numpy-stubs over time to just be a literal stub for numpy's own built-in implications? Once numpy's version is done, a depreciation warming can then be added to numpy-stubs. On Tue, Mar 24, 2020, 14:15 Eric Wieser wrote: > > Putting > > aside ndarray, as more challenging, even annotations for numpy functions > > and method parameters with built-in types would help, as a start. > > This is a good idea in principle, but one thing concerns me. > > If we add type annotations to numpy, does it become an error to have > numpy-stubs installed? > That is, is this an all-or-nothing thing where as soon as we start, > numpy-stubs becomes unusable? > > Eric > > On Tue, 24 Mar 2020 at 17:28, Roman Yurchak wrote: > >> Thanks for re-starting this discussion, Stephan! I think there is >> definitely significant interest in this topic: >> https://github.com/numpy/numpy/issues/7370 is the issue with the largest >> number of user likes in the issue tracker (FWIW). >> >> Having them in numpy, as opposed to a separate numpy-stubs repository >> would indeed be ideal from a user perspective. When looking into it in >> the past, I was never sure how well in sync numpy-stubs was. Putting >> aside ndarray, as more challenging, even annotations for numpy functions >> and method parameters with built-in types would help, as a start. >> >> To add to the previously listed projects that would benefit from this, >> we are currently considering to start using some (minimal) type >> annotations in scikit-learn. >> >> -- >> Roman Yurchak >> >> On 24/03/2020 18:00, Stephan Hoyer wrote: >> > When we started numpy-stubs [1] a few years ago, putting type >> > annotations in NumPy itself seemed premature. We still supported Python >> > 2, which meant that we would need to use awkward comments for type >> > annotations. >> > >> > Over the past few years, using type annotations has become increasingly >> > popular, even in the scientific Python stack. For example, off-hand I >> > know that at least SciPy, pandas and xarray have at least part of their >> > APIs type annotated. Even without annotations for shapes or dtypes, it >> > would be valuable to have near complete annotations for NumPy, the >> > project at the bottom of the scientific stack. >> > >> > Unfortunately, numpy-stubs never really took off. I can think of a few >> > reasons for that: >> > 1. Missing high level guidance on how to write type annotations, >> > particularly for how (or if) to annotate particularly dynamic parts of >> > NumPy (e.g., consider __array_function__), and whether we should >> > prioritize strictness or faithfulness [2]. >> > 2. We didn't have a good experience for new contributors. Due to the >> > relatively low level of interest in the project, when a contributor >> > would occasionally drop in, I often didn't even notice their PR for a >> > few weeks. >> > 3. Developing type annotations separately from the main codebase makes >> > them a little harder to keep in sync. This means that type annotations >> > couldn't serve their typical purpose of self-documenting code. Part of >> > this may be necessary for NumPy (due to our use of C extensions), but >> > large parts of NumPy's user facing APIs are written in Python. We no >> > longer support Python 2, so at least we no longer need to worry about >> > putting annotations in comments. >> > >> > We eventually could probably use a formal NEP (or several) on how we >> > want to use type annotations in NumPy, but I think a good first step >> > would be to think about how to start moving the annotations from >> > numpy-stubs into numpy proper. >> > >> > Any thoughts? Anyone interested in taking the lead on this? >> > >> > Cheers, >> > Stephan >> > >> > [1] https://github.com/numpy/numpy-stubs >> > [2] https://github.com/numpy/numpy-stubs/issues/12 >> > >> > _______________________________________________ >> > NumPy-Discussion mailing list >> > NumPy-Discussion at python.org >> > https://mail.python.org/mailman/listinfo/numpy-discussion >> > >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at gmail.com Sun Mar 29 12:30:13 2020 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Sun, 29 Mar 2020 18:30:13 +0200 Subject: [Numpy-discussion] Put type annotations in NumPy proper? In-Reply-To: References: <950157a0-5548-9065-0b39-d877e51c824f@gmail.com>

Message-ID: On Tue, Mar 24, 2020 at 11:43 PM Joshua Wilson wrote: > > That is, is this an all-or-nothing thing where as soon as we start, > numpy-stubs becomes unusable? > > Until NumPy is made PEP 561 compatible by adding a `py.typed` file, > type checkers will ignore the types in the repo, so in theory you can > avoid the all or nothing. In practice it's maybe trickier because > currently people can use the stubs, but they won't be able to use the > types in the repo until the PEP 561 switch is flipped. So e.g. > currently SciPy pulls the stubs from `numpy-stubs` master, allowing > for a short > > find place where NumPy stubs are lacking -> improve stubs -> improve SciPy > types > > loop. If all development moves into the main repo then SciPy is > blocked on it becoming PEP 561 compatible before moving forward. But, > you could complain that I put the cart before the horse with > introducing typing in the SciPy repo before the NumPy types were more > resolved, and that's probably a fair complaint. > I think it makes a lot of sense to add types to SciPy, and then improving numpy-stubs the moment you are blocked on something. This allows for finding relevant issues and iterating quickly. Moving things into the main NumPy repo could be done right before branching off 1.19.x, which is still a few months away. Until then there's probably only downsides to moving it into the main repo - it's easier to depend on master of numpy-stubs than on master of numpy in CI and dev workflows. > > > Anyone interested in taking the lead on this? > > Not that I am a core developer or anything, but I am interested in > helping to improve typing in NumPy. > Thanks again Josh! Cheers, Ralf > On Tue, Mar 24, 2020 at 11:15 AM Eric Wieser > wrote: > > > > > Putting > > > aside ndarray, as more challenging, even annotations for numpy > functions > > > and method parameters with built-in types would help, as a start. > > > > This is a good idea in principle, but one thing concerns me. > > > > If we add type annotations to numpy, does it become an error to have > numpy-stubs installed? > > That is, is this an all-or-nothing thing where as soon as we start, > numpy-stubs becomes unusable? > > > > Eric > > > > On Tue, 24 Mar 2020 at 17:28, Roman Yurchak > wrote: > >> > >> Thanks for re-starting this discussion, Stephan! I think there is > >> definitely significant interest in this topic: > >> https://github.com/numpy/numpy/issues/7370 is the issue with the > largest > >> number of user likes in the issue tracker (FWIW). > >> > >> Having them in numpy, as opposed to a separate numpy-stubs repository > >> would indeed be ideal from a user perspective. When looking into it in > >> the past, I was never sure how well in sync numpy-stubs was. Putting > >> aside ndarray, as more challenging, even annotations for numpy functions > >> and method parameters with built-in types would help, as a start. > >> > >> To add to the previously listed projects that would benefit from this, > >> we are currently considering to start using some (minimal) type > >> annotations in scikit-learn. > >> > >> -- > >> Roman Yurchak > >> > >> On 24/03/2020 18:00, Stephan Hoyer wrote: > >> > When we started numpy-stubs [1] a few years ago, putting type > >> > annotations in NumPy itself seemed premature. We still supported > Python > >> > 2, which meant that we would need to use awkward comments for type > >> > annotations. > >> > > >> > Over the past few years, using type annotations has become > increasingly > >> > popular, even in the scientific Python stack. For example, off-hand I > >> > know that at least SciPy, pandas and xarray have at least part of > their > >> > APIs type annotated. Even without annotations for shapes or dtypes, it > >> > would be valuable to have near complete annotations for NumPy, the > >> > project at the bottom of the scientific stack. > >> > > >> > Unfortunately, numpy-stubs never really took off. I can think of a few > >> > reasons for that: > >> > 1. Missing high level guidance on how to write type annotations, > >> > particularly for how (or if) to annotate particularly dynamic parts of > >> > NumPy (e.g., consider __array_function__), and whether we should > >> > prioritize strictness or faithfulness [2]. > >> > 2. We didn't have a good experience for new contributors. Due to the > >> > relatively low level of interest in the project, when a contributor > >> > would occasionally drop in, I often didn't even notice their PR for a > >> > few weeks. > >> > 3. Developing type annotations separately from the main codebase makes > >> > them a little harder to keep in sync. This means that type annotations > >> > couldn't serve their typical purpose of self-documenting code. Part of > >> > this may be necessary for NumPy (due to our use of C extensions), but > >> > large parts of NumPy's user facing APIs are written in Python. We no > >> > longer support Python 2, so at least we no longer need to worry about > >> > putting annotations in comments. > >> > > >> > We eventually could probably use a formal NEP (or several) on how we > >> > want to use type annotations in NumPy, but I think a good first step > >> > would be to think about how to start moving the annotations from > >> > numpy-stubs into numpy proper. > >> > > >> > Any thoughts? Anyone interested in taking the lead on this? > >> > > >> > Cheers, > >> > Stephan > >> > > >> > [1] https://github.com/numpy/numpy-stubs > >> > [2] https://github.com/numpy/numpy-stubs/issues/12 > >> > > >> > _______________________________________________ > >> > NumPy-Discussion mailing list > >> > NumPy-Discussion at python.org > >> > https://mail.python.org/mailman/listinfo/numpy-discussion > >> > > >> > >> _______________________________________________ > >> NumPy-Discussion mailing list > >> NumPy-Discussion at python.org > >> https://mail.python.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wieser.eric+numpy at gmail.com Mon Mar 30 07:09:16 2020 From: wieser.eric+numpy at gmail.com (Eric Wieser) Date: Mon, 30 Mar 2020 11:09:16 +0000 Subject: [Numpy-discussion] Deprecating ndarray.tostring() Message-ID: Hi all, Just a heads up that in https://github.com/numpy/numpy/pull/15867 I plan to deprecate ndarray.tostring(), which is just a confusing way to spell ndarray.tobytes(). This function has been documented as a compatibility alias since NumPy 1.9, but never emitted a warning upon use. Given array.array.tostring() has been warning as far back as Python 3.1, it seems like we should follow suit. In order to reduce the impact of such a deprecation, I?ve filed the necessary scipy PR: https://github.com/scipy/scipy/pull/11755. It?s unlikely we?ll remove this function entirely any time soon, but the act of deprecating it may cause a few failing CI runs in projects where warnings are turned into errors. Eric -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at gmail.com Tue Mar 31 07:32:39 2020 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Tue, 31 Mar 2020 13:32:39 +0200 Subject: [Numpy-discussion] Deprecating ndarray.tostring() In-Reply-To: References: Message-ID: On Mon, Mar 30, 2020 at 1:09 PM Eric Wieser wrote: > Hi all, > > Just a heads up that in https://github.com/numpy/numpy/pull/15867 I plan > to deprecate ndarray.tostring(), which is just a confusing way to spell > ndarray.tobytes(). > > This function has been documented as a compatibility alias since NumPy > 1.9, but never emitted a warning upon use. > Given array.array.tostring() has been warning as far back as Python 3.1, > it seems like we should follow suit. > > In order to reduce the impact of such a deprecation, I?ve filed the > necessary scipy PR: https://github.com/scipy/scipy/pull/11755. > It?s unlikely we?ll remove this function entirely any time soon, but the > act of deprecating it may cause a few failing CI runs in projects where > warnings are turned into errors. > +1 agree with the deprecation, and with making it a long one given the handful of uses in Matplotlib and several handfuls in SciPy. Cheers, Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Tue Mar 31 09:57:50 2020 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 31 Mar 2020 07:57:50 -0600 Subject: [Numpy-discussion] Deprecating ndarray.tostring() In-Reply-To: References:

Message-ID: On Tue, Mar 31, 2020 at 5:33 AM Ralf Gommers wrote: > > > On Mon, Mar 30, 2020 at 1:09 PM Eric Wieser > wrote: > >> Hi all, >> >> Just a heads up that in https://github.com/numpy/numpy/pull/15867 I plan >> to deprecate ndarray.tostring(), which is just a confusing way to spell >> ndarray.tobytes(). >> >> This function has been documented as a compatibility alias since NumPy >> 1.9, but never emitted a warning upon use. >> Given array.array.tostring() has been warning as far back as Python 3.1, >> it seems like we should follow suit. >> >> In order to reduce the impact of such a deprecation, I?ve filed the >> necessary scipy PR: https://github.com/scipy/scipy/pull/11755. >> It?s unlikely we?ll remove this function entirely any time soon, but the >> act of deprecating it may cause a few failing CI runs in projects where >> warnings are turned into errors. >> > > +1 agree with the deprecation, and with making it a long one given the > handful of uses in Matplotlib and several handfuls in SciPy. > > +1 also. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Tue Mar 31 11:42:10 2020 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Tue, 31 Mar 2020 10:42:10 -0500 Subject: [Numpy-discussion] (Two hours later!) NumPy Community Meeting Wednesday, March 18 Message-ID: <536451648baa29c45917c97d63f81fda8d2e2f72.camel@sipsolutions.net> Hi all, There will be a NumPy Community meeting Wednesday April 1 at 1(!)pm Pacific Time. Everyone is invited and encouraged to join in and edit the work-in-progress meeting topics and notes: https://hackmd.io/76o-IxCjQX2mOXO_wwkcpg?both Best wishes Sebastian PS: Remember that we will try doing the meeting two hours later than previously to accommodate different timezones. I realize that this will be very late for those in Europe, but it seemed still possible in the poll (and with Matti was suggesting it), so it may be be the best time. I see this as a trial balloon for now, please complain here or to me if the time is bad for you and we can move it back. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From cimrman3 at ntc.zcu.cz Tue Mar 31 13:26:49 2020 From: cimrman3 at ntc.zcu.cz (Robert Cimrman) Date: Tue, 31 Mar 2020 19:26:49 +0200 Subject: [Numpy-discussion] ANN: SfePy 2020.1 Message-ID: I am pleased to announce release 2020.1 of SfePy. Description ----------- SfePy (simple finite elements in Python) is a software for solving systems of coupled partial differential equations by the finite element method or by the isogeometric analysis (limited support). It is distributed under the new BSD license. Home page: http://sfepy.org Mailing list: https://mail.python.org/mm3/mailman3/lists/sfepy.python.org/ Git (source) repository, issue tracker: https://github.com/sfepy/sfepy Highlights of this release -------------------------- - reading/writing of additional mesh formats by using meshio [2] - Python 3 only from now on For full release notes see [1]. Cheers, Robert Cimrman [1] http://docs.sfepy.org/doc/release_notes.html#id1 [2] https://github.com/nschloe/meshio --- Contributors to this release in alphabetical order: Robert Cimrman Lubos Kejzlar Vladimir Lukes