From matti.picus at gmail.com Tue Feb 4 10:08:25 2020 From: matti.picus at gmail.com (Matti Picus) Date: Tue, 4 Feb 2020 17:08:25 +0200 Subject: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics Message-ID: Together with Sayed Adel (cc) and Ralf, I am pleased to put the draft version of NEP 38 [0] up for discussion. As per NEP 0, this is the next step in the community accepting the approach layed out in the NEP. The NEP PR [1] has already garnered a fair amount of discussion about the viability of Universal SIMD Intrinsics, so I will try to capture some of that here as well. Abstract While compilers are getting better at using hardware-specific routines to optimize code, they sometimes do not produce optimal results. Also, we would like to be able to copy binary optimized C-extension modules from one machine to another with the same base architecture (x86, ARM, PowerPC) but with different capabilities without recompiling. We have a mechanism in the ufunc machinery to build alternative loops indexed by CPU feature name. At import (in InitOperators), the loop function that matches the run-time CPU info is chosen from the candidates.This NEP proposes a mechanism to build on that for many more features and architectures. The steps proposed are to: ??? Establish a set of well-defined, architecture-agnostic, universal intrisics which capture features available across architectures. ??? Capture these universal intrisics in a set of C macros and use the macros to build code paths for sets of features from the baseline up to the maximum set of features available on that architecture. Offer these as a limited number of compiled alternative code paths. ??? At runtime, discover which CPU features are available, and choose from among the possible code paths accordingly. Motivation and Scope Traditionally NumPy has counted on the compilers to generate optimal code specifically for the target architecture. However few users today compile NumPy locally for their machines. Most use the binary packages which must provide run-time support for the lowest-common denominator CPU architecture. Thus NumPy cannot take advantage of more advanced features of their CPU processors, since they may not be available on all users? systems. The ufunc machinery already has a loop-selection protocol based on dtypes, so it is easy to extend this to also select an optimal loop for specifically available CPU features at runtime. Traditionally, these features have been exposed through intrinsics which are compiler-specific instructions that map directly to assembly instructions. Recently there were discussions about the effectiveness of adding more intrinsics (e.g., `gh-11113`_ for AVX optimizations for floats). In the past, architecture-specific code was added to NumPy for fast avx512 routines in various ufuncs, using the mechanism described above to choose the best loop for the architecture. However the code is not generic and does not generalize to other architectures. Recently, OpenCV moved to using universal intrinsics in the Hardware Abstraction Layer (HAL) which provided a nice abstraction for common shared Single Instruction Multiple Data (SIMD) constructs. This NEP proposes a similar mechanism for NumPy. There are three stages to using the mechanism: - Infrastructure is provided in the code for abstract intrinsics. The ufunc machinery will be extended using sets of these abstract intrinsics, so that a single ufunc will be expressed as a set of loops, going from a minimal to a maximal set of possibly availabe intrinsics. - At compile time, compiler macros and CPU detection are used to turn the abstract intrinsics into concrete intrinsic calls. Any intrinsics not available on the platform, either because the CPU does not support them (and so cannot be tested) or because the abstract intrinsic does not have a parallel concrete intrinsic on the platform will not error, rather the corresponding loop will not be produced and added to the set of possibilities. - At runtime, the CPU detection code will further limit the set of loops available, and the optimal one will be chosen for the ufunc. The current NEP proposes only to use the runtime feature detection and optimal loop selection mechanism for ufuncs. Future NEPS may propose other uses for the proposed solution. Usage and Impact The end user will be able to get a list of intrinsics available for their platform and compiler. Optionally, the user may be able to specify which of the loops available at runtime will be used, perhaps via an environment variable to enable benchmarking the impact of the different loops. There should be no direct impact to naive end users, the results of all the loops should be identical to within a small number (1-3?) ULPs. On the other hand, users with more powerful machines should notice a significant performance boost. Binary releases - wheels on PyPI and conda packages The binaries released by this process will be larger since they include all possible loops for the architecture. Some packagers may prefer to limit the number of loops in order to limit the size of the binaries, we would hope they would still support a wide range of families of architectures. Note this problem already exists in the Intel MKL offering, where the binary package includes an extensive set of alternative shared objects (DLLs) for various CPU alternatives. Source builds See ?Detailed Description? below. A source build where the packager knows details of the target machine could theoretically produce a smaller binary by choosing to compile only the loops needed by the target via command line arguments. How to run benchmarks to assess performance benefits Adding more code which use intrinsics will make the code harder to maintain. Therefore, such code should only be added if it yields a significant performance benefit. Assessing this performance benefit can be nontrivial. To aid with this, the implementation for this NEP will add a way to select which instruction sets can be used at runtime via environment variables. (name TBD). This ablility is critical for CI code verification. Diagnostics A new dictionary __cpu_features__ will be available to python. The keys are the available features, the value is a boolean whether the feature is available or not. Various new private C functions will be used internally to query available features. These might be exposed via specific c-extension modules for testing. Workflow for adding a new CPU architecture-specific optimization NumPy will always have a baseline C implementation for any code that may be a candidate for SIMD vectorization. If a contributor wants to add SIMD support for some architecture (typically the one of most interest to them), this is the proposed workflow: TODO (see https://github.com/numpy/numpy/pull/13516#issuecomment-558859638, needs to be worked out more) Reuse by other projects It would be nice if the universal intrinsics would be available to other libraries like SciPy or Astropy that also build ufuncs, but that is not an explicit goal of the first implementation of this NEP. ----------------------------------------------------------------------------------- My biased summary of select comments from the PR: (Raghuveer): A very similar SIMD library has been proposed for C++. Here is the link to the details: 1. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0214r8.pdf 2. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/n4808.pdf There is good discussion on the minimal/common set of instructions across architectures (which narrows down to loads, stores, arithmetic, compare, bitwise and shuffle instructions). Based on my developer experience so far, these instructions aren't by themselves enough to implement and optimize NumPy ufuncs. As i pointed out earlier, I think I would find it useful to learn the workflow of how to use instructions that don't fit in the Universal Intrinsic framework. (Raguveer) gave a well laid out table of currently proposed unversal intrinsics by use: load/store, reorder, operators, conversions, arithmatic and misc [2] which led to a long response from Sayed [3] with some sample code, demonstrating how more complex operations can be built up from the primitives. (catree) mentioned the Simd Library [4] and Halide [5] and asked about maintainability. (Ralf) responded [6] with concerns about competent developer bandwidth for code review. He also mentioned that our CI system currently supports all the architectures we are targeting (x86, aarch64, s390x, ppc64le) although some of these machines may not have the most advanced hardware to support the latest intrinsics. I apologize if my summary is not accurate, pleas correct any mistakes or misconceptions. ---------------------------------------------------------------------------------------- Barring complete rejection of the idea here, we will be pushing forward with PRs to implement this. Comments either on the mailing list or in those PRs are welcome. Matti [0] https://numpy.org/neps/nep-0038-SIMD-optimizations.html [1] https://github.com/numpy/numpy/pull/15228 [2] https://github.com/numpy/numpy/pull/15228#issuecomment-580479336 [3] https://github.com/numpy/numpy/pull/15228#issuecomment-580605718 [4] https://github.com/ermig1979/Simd [5] https://halide-lang.org [6]?https://github.com/numpy/numpy/pull/15228#issuecomment-581029991 From daniele at grinta.net Tue Feb 4 13:00:36 2020 From: daniele at grinta.net (Daniele Nicolodi) Date: Tue, 4 Feb 2020 11:00:36 -0700 Subject: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics In-Reply-To: References: Message-ID: On 04-02-2020 08:08, Matti Picus wrote: > Together with Sayed Adel (cc) and Ralf, I am pleased to put the draft > version of NEP 38 [0] up for discussion. As per NEP 0, this is the next > step in the community accepting the approach layed out in the NEP. The > NEP PR [1] has already garnered a fair amount of discussion about the > viability of Universal SIMD Intrinsics, so I will try to capture some of > that here as well. Hello, more interesting prior art may be found in VOLK https://www.libvolk.org. VOLK is developed mainly to be used in GNURadio, and this reflects in the available kernels and in the supported data types, I think the approach used there may be of interest. Cheers, Dan From raghuveer.devulapalli at intel.com Tue Feb 4 14:36:28 2020 From: raghuveer.devulapalli at intel.com (Devulapalli, Raghuveer) Date: Tue, 4 Feb 2020 19:36:28 +0000 Subject: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics In-Reply-To: References: Message-ID: Hi everyone, I know had raised these questions in the PR, but wanted to post them in the mailing list as well. 1) Once NumPy adds the framework and initial set of Universal Intrinsic, if contributors want to leverage a new architecture specific SIMD instruction, will they be expected to add software implementation of this instruction for all other architectures too? 2) On whom does the burden lie to ensure that new implementations are benchmarked and shows benefits on every architecture? What happens if optimizing an Ufunc leads to improving performance on one architecture and worsens performance on another? Thanks, Raghuveer -----Original Message----- From: NumPy-Discussion On Behalf Of Daniele Nicolodi Sent: Tuesday, February 4, 2020 10:01 AM To: numpy-discussion at python.org Subject: Re: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics On 04-02-2020 08:08, Matti Picus wrote: > Together with Sayed Adel (cc) and Ralf, I am pleased to put the draft > version of NEP 38 [0] up for discussion. As per NEP 0, this is the > next step in the community accepting the approach layed out in the > NEP. The NEP PR [1] has already garnered a fair amount of discussion > about the viability of Universal SIMD Intrinsics, so I will try to > capture some of that here as well. Hello, more interesting prior art may be found in VOLK https://www.libvolk.org. VOLK is developed mainly to be used in GNURadio, and this reflects in the available kernels and in the supported data types, I think the approach used there may be of interest. Cheers, Dan _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion at python.org https://mail.python.org/mailman/listinfo/numpy-discussion From einstein.edison at gmail.com Tue Feb 4 14:59:54 2020 From: einstein.edison at gmail.com (Hameer Abbasi) Date: Tue, 4 Feb 2020 19:59:54 +0000 Subject: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics In-Reply-To: References: , Message-ID: ?snip? > 1) Once NumPy adds the framework and initial set of Universal Intrinsic, if contributors want to leverage a new architecture specific SIMD instruction, will they be expected to add software implementation of this instruction for all other architectures too? In my opinion, if the instructions are lower, then yes. For example, one cannot add AVX-512 without adding, for example adding AVX-256 and AVX-128 and SSE*. However, I would not expect one person or team to be an expert in all assemblies, so intrinsics for one architecture can be developed independently of another. > 2) On whom does the burden lie to ensure that new implementations are benchmarked and shows benefits on every architecture? What happens if optimizing an Ufunc leads to improving performance on one architecture and worsens performance on another? I would look at this from a maintainability point of view. If we are increasing the code size by 20% for a certain ufunc, there must be a domonstrable 20% increase in performance on any CPU. That is to say, micro-optimisation will be unwelcome, and code readability will be preferable. Usually we ask the submitter of the PR to test the PR with a machine they have on hand, and I would be inclined to keep this trend of self-reporting. Of course, if someone else came along and reported a performance regression of, say, 10%, then we have increased code by 20%, with only a net 5% gain in performance, and the PR will have to be reverted. ?snip? -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Tue Feb 4 16:06:02 2020 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Tue, 04 Feb 2020 13:06:02 -0800 Subject: [Numpy-discussion] NumPy Community Meeting Wednesday, Feb. 05 Message-ID: Hi all, There will be a NumPy Community meeting Wednesday February 5 at 11 am Pacific Time. Everyone is invited to join in and edit the work-in- progress meeting topics and notes: https://hackmd.io/76o-IxCjQX2mOXO_wwkcpg?both Best wishes Sebastian -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From charlesr.harris at gmail.com Tue Feb 4 16:17:04 2020 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 4 Feb 2020 14:17:04 -0700 Subject: [Numpy-discussion] manylinux upgrade for numpy wheels Message-ID: Hi All, Thought now would be a good time to decide on upgrading manylinux for the 1.19 release so that we can make sure that everything works as expected. The choices are manylinux1 -- CentOS 5, currently used, gcc 4.2 (in practice 4.5), only supports i686, x86_64. manylinux2010 -- CentOS 6, gcc 4.5, only supports i686, x86_64. manylinux2014 -- CentOS 7, gcc 4.8, supports many more architectures. The main advantage of manylinux2014 is that it supports many new architectures, some of which we are already testing against. The main disadvantage is that it requires pip >= 19.x, which may not be much of a problem 4 months from now but will undoubtedly cause some installation problems. Unfortunately, the compiler remains archaic, but folks interested in performance should be using a performance oriented distribution or compiling for their native architecture. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Tue Feb 4 17:36:42 2020 From: njs at pobox.com (Nathaniel Smith) Date: Tue, 4 Feb 2020 14:36:42 -0800 Subject: [Numpy-discussion] manylinux upgrade for numpy wheels In-Reply-To: References: Message-ID: Pretty sure the 2010 and 2014 images both have much newer compilers than that. There are still a lot of users on CentOS 6, so I'd still stick to 2010 for now on x86_64 at least. We could potentially start adding 2014 wheels for the other platforms where we currently don't ship wheels ? gotta be better than nothing, right? There probably still is some tail of end users whose pip is too old to know about 2010 wheels. I don't know how big that tail is. If we wanted to be really careful, we could ship both manylinux1 and manylinux2010 wheels for a bit ? pip will automatically pick the latest one it recognizes ? and see what the download numbers look like. On Tue, Feb 4, 2020, 13:18 Charles R Harris wrote: > Hi All, > > Thought now would be a good time to decide on upgrading manylinux for the > 1.19 release so that we can make sure that everything works as expected. > The choices are > > manylinux1 -- CentOS 5, > currently used, gcc 4.2 (in practice 4.5), only supports i686, x86_64. > manylinux2010 -- CentOS 6, > gcc 4.5, only supports i686, x86_64. > manylinux2014 -- CentOS 7, > gcc 4.8, supports many more architectures. > > The main advantage of manylinux2014 is that it supports many new > architectures, some of which we are already testing against. The main > disadvantage is that it requires pip >= 19.x, which may not be much of a > problem 4 months from now but will undoubtedly cause some installation > problems. Unfortunately, the compiler remains archaic, but folks interested > in performance should be using a performance oriented distribution or > compiling for their native architecture. > > Chuck > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Wed Feb 5 04:06:17 2020 From: matthew.brett at gmail.com (Matthew Brett) Date: Wed, 5 Feb 2020 09:06:17 +0000 Subject: [Numpy-discussion] manylinux upgrade for numpy wheels In-Reply-To: References: Message-ID: Hi, On Tue, Feb 4, 2020 at 10:38 PM Nathaniel Smith wrote: > > Pretty sure the 2010 and 2014 images both have much newer compilers than that. > > There are still a lot of users on CentOS 6, so I'd still stick to 2010 for now on x86_64 at least. We could potentially start adding 2014 wheels for the other platforms where we currently don't ship wheels ? gotta be better than nothing, right? > > There probably still is some tail of end users whose pip is too old to know about 2010 wheels. I don't know how big that tail is. If we wanted to be really careful, we could ship both manylinux1 and manylinux2010 wheels for a bit ? pip will automatically pick the latest one it recognizes ? and see what the download numbers look like. That all sounds right to me too. Cheers, Matthew > On Tue, Feb 4, 2020, 13:18 Charles R Harris wrote: >> >> Hi All, >> >> Thought now would be a good time to decide on upgrading manylinux for the 1.19 release so that we can make sure that everything works as expected. The choices are >> >> manylinux1 -- CentOS 5, currently used, gcc 4.2 (in practice 4.5), only supports i686, x86_64. >> manylinux2010 -- CentOS 6, gcc 4.5, only supports i686, x86_64. >> manylinux2014 -- CentOS 7, gcc 4.8, supports many more architectures. >> >> The main advantage of manylinux2014 is that it supports many new architectures, some of which we are already testing against. The main disadvantage is that it requires pip >= 19.x, which may not be much of a problem 4 months from now but will undoubtedly cause some installation problems. Unfortunately, the compiler remains archaic, but folks interested in performance should be using a performance oriented distribution or compiling for their native architecture. >> >> Chuck >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion From t3kcit at gmail.com Wed Feb 5 11:01:04 2020 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 5 Feb 2020 11:01:04 -0500 Subject: [Numpy-discussion] NEP 37: A dispatch protocol for NumPy-like modules In-Reply-To: References: Message-ID: <21692339-9f4b-029c-d422-ea549acbe6c3@gmail.com> A bit late to the NEP 37 party. I just wanted to say that at least from my perspective it seems a great solution that will help sklearn move towards more flexible compute engines. I think one of the biggest issues is array creation (including random arrays), and that's handled quite nicely with NEP 37. There's some discussion on the scikit-learn side here: https://github.com/scikit-learn/scikit-learn/pull/14963 https://github.com/scikit-learn/scikit-learn/issues/11447 Two different groups of people tried to use __array_function__ to delegate to MxNet and CuPy respectively in scikit-learn, and ran into the same issues. There's some remaining issues in sklearn that will not be handled by NEP 37 but they go beyond NumPy in some sense. Just to briefly bring them up: - We use scipy.linalg in many places, and we would need to do a separate dispatching to check whether we can use module.linalg instead ?(that might be an issue for many libraries but I'm not sure). - Some models have several possible optimization algorithms, some of which are pure numpy and some which are Cython. If someone provides a different array module, ?we might want to choose an algorithm that is actually supported by that module. While this exact issue is maybe sklearn specific, a similar issue could appear for most downstream libs that use Cython in some places. ?Many Cython algorithms could be implemented in pure numpy with a potential slowdown, but once we have NEP 37 there might be a benefit to having a pure NumPy implementation as an alternative code path. Anyway, NEP 37 seems a great step in the right direction and would enable sklearn to actually dispatch in some places. Dispatching just based on __array_function__ seems not really feasible so far. Best, Andreas Mueller On 1/6/20 11:29 PM, Stephan Hoyer wrote: > I am pleased to present a new NumPy Enhancement Proposal for > discussion: "NEP-37: A dispatch protocol for NumPy-like modules." > Feedback would be very welcome! > > The full text follows. The rendered proposal can also be found online > at https://numpy.org/neps/nep-0037-array-module.html > > Best, > Stephan Hoyer > > =================================================== > NEP 37 ? A dispatch protocol for NumPy-like modules > =================================================== > > :Author: Stephan Hoyer > > :Author: Hameer Abbasi > :Author: Sebastian Berg > :Status: Draft > :Type: Standards Track > :Created: 2019-12-29 > > Abstract > -------- > > NEP-18's ``__array_function__`` has been a mixed success. Some > projects (e.g., > dask, CuPy, xarray, sparse, Pint) have enthusiastically adopted it. Others > (e.g., PyTorch, JAX, SciPy) have been more reluctant. Here we propose > a new > protocol, ``__array_module__``, that we expect could eventually > subsume most > use-cases for ``__array_function__``. The protocol requires explicit > adoption > by both users and library authors, which ensures backwards > compatibility, and > is also significantly simpler than ``__array_function__``, both of > which we > expect will make it easier to adopt. > > Why ``__array_function__`` hasn't been enough > --------------------------------------------- > > There are two broad ways in which NEP-18 has fallen short of its goals: > > 1. **Maintainability concerns**. `__array_function__` has significant > ? ?implications for libraries that use it: > > ? ?- Projects like `PyTorch > ? ? ?`_, `JAX > ? ? ?`_ and even `scipy.sparse > ? ? ?`_ have been > reluctant to > ? ? ?implement `__array_function__` in part because they are concerned > about > ? ? ?**breaking existing code**: users expect NumPy functions like > ? ? ?``np.concatenate`` to return NumPy arrays. This is a fundamental > ? ? ?limitation of the ``__array_function__`` design, which we chose > to allow > ? ? ?overriding the existing ``numpy`` namespace. > ? ?- ``__array_function__`` currently requires an "all or nothing" > approach to > ? ? ?implementing NumPy's API. There is no good pathway for **incremental > ? ? ?adoption**, which is particularly problematic for established > projects > ? ? ?for which adopting ``__array_function__`` would result in breaking > ? ? ?changes. > ? ?- It is no longer possible to use **aliases to NumPy functions** within > ? ? ?modules that support overrides. For example, both CuPy and JAX set > ? ? ?``result_type = np.result_type``. > ? ?- Implementing **fall-back mechanisms** for unimplemented NumPy > functions > ? ? ?by using NumPy's implementation is hard to get right (but see the > ? ? ?`version from dask `_), > because > ? ? ?``__array_function__`` does not present a consistent interface. > ? ? ?Converting all arguments of array type requires recursing into > generic > ? ? ?arguments of the form ``*args, **kwargs``. > > 2. **Limitations on what can be overridden.** ``__array_function__`` > has some > ? ?important gaps, most notably array creation and coercion functions: > > ? ?- **Array creation** routines (e.g., ``np.arange`` and those in > ? ? ?``np.random``) need some other mechanism for indicating what type of > ? ? ?arrays to create. `NEP 36 > `_ > ? ? ?proposed adding optional ``like=`` arguments to functions without > ? ? ?existing array arguments. However, we still lack any mechanism to > ? ? ?override methods on objects, such as those needed by > ? ? ?``np.random.RandomState``. > ? ?- **Array conversion** can't reuse the existing coercion functions like > ? ? ?``np.asarray``, because ``np.asarray`` sometimes means "convert to an > ? ? ?exact ``np.ndarray``" and other times means "convert to something > _like_ > ? ? ?a NumPy array." This led to the `NEP 30 > ? ? ?`_ > proposal for > ? ? ?a separate ``np.duckarray`` function, but this still does not > resolve how > ? ? ?to cast one duck array into a type matching another duck array. > > ``get_array_module`` and the ``__array_module__`` protocol > ---------------------------------------------------------- > > We propose a new user-facing mechanism for dispatching to a duck-array > implementation, ``numpy.get_array_module``. ``get_array_module`` > performs the > same type resolution as ``__array_function__`` and returns a module > with an API > promised to match the standard interface of ``numpy`` that can implement > operations on all provided array types. > > The protocol itself is both simpler and more powerful than > ``__array_function__``, because it doesn't need to worry about actually > implementing functions. We believe it resolves most of the > maintainability and > functionality limitations of ``__array_function__``. > > The new protocol is opt-in, explicit and with local control; see > :ref:`appendix-design-choices` for discussion on the importance of > these design > features. > > The array module contract > ========================= > > Modules returned by ``get_array_module``/``__array_module__`` should > make a > best effort to implement NumPy's core functionality on new array types(s). > Unimplemented functionality should simply be omitted (e.g., accessing an > unimplemented function should raise ``AttributeError``). In the future, we > anticipate codifying a protocol for requesting restricted subsets of > ``numpy``; > see :ref:`requesting-restricted-subsets` for more details. > > How to use ``get_array_module`` > =============================== > > Code that wants to support generic duck arrays should explicitly call > ``get_array_module`` to determine an appropriate array module from > which to > call functions, rather than using the ``numpy`` namespace directly. For > example: > > .. code:: python > > ? ? # calls the appropriate version of np.something for x and y > ? ? module = np.get_array_module(x, y) > ? ? module.something(x, y) > > Both array creation and array conversion are supported, because > dispatching is > handled by ``get_array_module`` rather than via the types of function > arguments. For example, to use random number generation functions or > methods, > we can simply pull out the appropriate submodule: > > .. code:: python > > ? ? def duckarray_add_random(array): > ? ? ? ? module = np.get_array_module(array) > ? ? ? ? noise = module.random.randn(*array.shape) > ? ? ? ? return array + noise > > We can also write the duck-array ``stack`` function from `NEP 30 > `_, without > the need > for a new ``np.duckarray`` function: > > .. code:: python > > ? ? def duckarray_stack(arrays): > ? ? ? ? module = np.get_array_module(*arrays) > ? ? ? ? arrays = [module.asarray(arr) for arr in arrays] > ? ? ? ? shapes = {arr.shape for arr in arrays} > ? ? ? ? if len(shapes) != 1: > ? ? ? ? ? ? raise ValueError('all input arrays must have the same shape') > ? ? ? ? expanded_arrays = [arr[module.newaxis, ...] for arr in arrays] > ? ? ? ? return module.concatenate(expanded_arrays, axis=0) > > By default, ``get_array_module`` will return the ``numpy`` module if no > arguments are arrays. This fall-back can be explicitly controlled by > providing > the ``module`` keyword-only argument. It is also possible to indicate > that an > exception should be raised instead of returning a default array module by > setting ``module=None``. > > How to implement ``__array_module__`` > ===================================== > > Libraries implementing a duck array type that want to support > ``get_array_module`` need to implement the corresponding protocol, > ``__array_module__``. This new protocol is based on Python's dispatch > protocol > for arithmetic, and is essentially a simpler version of > ``__array_function__``. > > Only one argument is passed into ``__array_module__``, a Python > collection of > unique array types passed into ``get_array_module``, i.e., all > arguments with > an ``__array_module__`` attribute. > > The special method should either return an namespace with an API matching > ``numpy``, or ``NotImplemented``, indicating that it does not know how to > handle the operation: > > .. code:: python > > ? ? class MyArray: > ? ? ? ? def __array_module__(self, types): > ? ? ? ? ? ? if not all(issubclass(t, MyArray) for t in types): > ? ? ? ? ? ? ? ? return NotImplemented > ? ? ? ? ? ? return my_array_module > > Returning custom objects from ``__array_module__`` > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > ``my_array_module`` will typically, but need not always, be a Python > module. > Returning a custom objects (e.g., with functions implemented via > ``__getattr__``) may be useful for some advanced use cases. > > For example, custom objects could allow for partial implementations of > duck > array modules that fall-back to NumPy (although this is not recommended in > general because such fall-back behavior can be error prone): > > .. code:: python > > ? ? class MyArray: > ? ? ? ? def __array_module__(self, types): > ? ? ? ? ? ? if all(issubclass(t, MyArray) for t in types): > ? ? ? ? ? ? ? ? return ArrayModule() > ? ? ? ? ? ? else: > ? ? ? ? ? ? ? ? return NotImplemented > > ? ? class ArrayModule: > ? ? ? ? def __getattr__(self, name): > ? ? ? ? ? ? import base_module > ? ? ? ? ? ? return getattr(base_module, name, getattr(numpy, name)) > > Subclassing from ``numpy.ndarray`` > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > All of the same guidance about well-defined type casting hierarchies from > NEP-18 still applies. ``numpy.ndarray`` itself contains a matching > implementation of ``__array_module__``, ?which is convenient for > subclasses: > > .. code:: python > > ? ? class ndarray: > ? ? ? ? def __array_module__(self, types): > ? ? ? ? ? ? if all(issubclass(t, ndarray) for t in types): > ? ? ? ? ? ? ? ? return numpy > ? ? ? ? ? ? else: > ? ? ? ? ? ? ? ? return NotImplemented > > NumPy's internal machinery > ========================== > > The type resolution rules of ``get_array_module`` follow the same model as > Python and NumPy's existing dispatch protocols: subclasses are called > before > super-classes, and otherwise left to right. ``__array_module__`` is > guaranteed > to be called only ?a single time on each unique type. > > The actual implementation of `get_array_module` will be in C, but > should be > equivalent to this Python code: > > .. code:: python > > ? ? def get_array_module(*arrays, default=numpy): > ? ? ? ? implementing_arrays, types = > _implementing_arrays_and_types(arrays) > ? ? ? ? if not implementing_arrays and default is not None: > ? ? ? ? ? ? return default > ? ? ? ? for array in implementing_arrays: > ? ? ? ? ? ? module = array.__array_module__(types) > ? ? ? ? ? ? if module is not NotImplemented: > ? ? ? ? ? ? ? ? return module > ? ? ? ? raise TypeError("no common array module found") > > ? ? def _implementing_arrays_and_types(relevant_arrays): > ? ? ? ? types = [] > ? ? ? ? implementing_arrays = [] > ? ? ? ? for array in relevant_arrays: > ? ? ? ? ? ? t = type(array) > ? ? ? ? ? ? if t not in types and hasattr(t, '__array_module__'): > ? ? ? ? ? ? ? ? types.append(t) > ? ? ? ? ? ? ? ? # Subclasses before superclasses, otherwise left to right > ? ? ? ? ? ? ? ? index = len(implementing_arrays) > ? ? ? ? ? ? ? ? for i, old_array in enumerate(implementing_arrays): > ? ? ? ? ? ? ? ? ? ? if issubclass(t, type(old_array)): > ? ? ? ? ? ? ? ? ? ? ? ? index = i > ? ? ? ? ? ? ? ? ? ? ? ? break > ? ? ? ? ? ? ? ? implementing_arrays.insert(index, array) > ? ? ? ? return implementing_arrays, types > > Relationship with ``__array_ufunc__`` and ``__array_function__`` > ---------------------------------------------------------------- > > These older protocols have distinct use-cases and should remain > =============================================================== > > ``__array_module__`` is intended to resolve limitations of > ``__array_function__``, so it is natural to consider whether it could > entirely > replace ``__array_function__``. This would offer dual benefits: (1) > simplifying > the user-story about how to override NumPy and (2) removing the slowdown > associated with checking for dispatch when calling every NumPy function. > > However, ``__array_module__`` and ``__array_function__`` are pretty > different > from a user perspective: it requires explicit calls to > ``get_array_function``, > rather than simply reusing original ``numpy`` functions. This is > probably fine > for *libraries* that rely on duck-arrays, but may be frustratingly > verbose for > interactive use. > > Some of the dispatching use-cases for ``__array_ufunc__`` are also > solved by > ``__array_module__``, but not all of them. For example, it is still > useful to > be able to define non-NumPy ufuncs (e.g., from Numba or SciPy) in a > generic way > on non-NumPy arrays (e.g., with dask.array). > > Given their existing adoption and distinct use cases, we don't think > it makes > sense to remove or deprecate ``__array_function__`` and > ``__array_ufunc__`` at > this time. > > Mixin classes to implement ``__array_function__`` and ``__array_ufunc__`` > ========================================================================= > > Despite the user-facing differences, ``__array_module__`` and a module > implementing NumPy's API still contain sufficient functionality needed to > implement dispatching with the existing duck array protocols. > > For example, the following mixin classes would provide sensible > defaults for > these special methods in terms of ``get_array_module`` and > ``__array_module__``: > > .. code:: python > > ? ? class ArrayUfuncFromModuleMixin: > > ? ? ? ? def __array_ufunc__(self, ufunc, method, *inputs, **kwargs): > ? ? ? ? ? ? arrays = inputs + kwargs.get('out', ()) > ? ? ? ? ? ? try: > ? ? ? ? ? ? ? ? array_module = np.get_array_module(*arrays) > ? ? ? ? ? ? except TypeError: > ? ? ? ? ? ? ? ? return NotImplemented > > ? ? ? ? ? ? try: > ? ? ? ? ? ? ? ? # Note this may have false positive matches, if > ufunc.__name__ > ? ? ? ? ? ? ? ? # matches the name of a ufunc defined by NumPy. > Unfortunately > ? ? ? ? ? ? ? ? # there is no way to determine in which module a ufunc was > ? ? ? ? ? ? ? ? # defined. > ? ? ? ? ? ? ? ? new_ufunc = getattr(array_module, ufunc.__name__) > ? ? ? ? ? ? except AttributeError: > ? ? ? ? ? ? ? ? return NotImplemented > > ? ? ? ? ? ? try: > ? ? ? ? ? ? ? ? callable = getattr(new_ufunc, method) > ? ? ? ? ? ? except AttributeError: > ? ? ? ? ? ? ? ? return NotImplemented > > ? ? ? ? ? ? return callable(*inputs, **kwargs) > > ? ? class ArrayFunctionFromModuleMixin: > > ? ? ? ? def __array_function__(self, func, types, args, kwargs): > ? ? ? ? ? ? array_module = self.__array_module__(types) > ? ? ? ? ? ? if array_module is NotImplemented: > ? ? ? ? ? ? ? ? return NotImplemented > > ? ? ? ? ? ? # Traverse submodules to find the appropriate function > ? ? ? ? ? ? modules = func.__module__.split('.') > ? ? ? ? ? ? assert modules[0] == 'numpy' > ? ? ? ? ? ? for submodule in modules[1:]: > ? ? ? ? ? ? ? ? module = getattr(module, submodule, None) > ? ? ? ? ? ? new_func = getattr(module, func.__name__, None) > ? ? ? ? ? ? if new_func is None: > ? ? ? ? ? ? ? ? return NotImplemented > > ? ? ? ? ? ? return new_func(*args, **kwargs) > > To make it easier to write duck arrays, we could also add these mixin > classes > into ``numpy.lib.mixins`` (but the examples above may suffice). > > Alternatives considered > ----------------------- > > Naming > ====== > > We like the name ``__array_module__`` because it mirrors the existing > ``__array_function__`` and ``__array_ufunc__`` protocols. Another > reasonable > choice could be ``__array_namespace__``. > > It is less clear what the NumPy function that calls this protocol > should be > called (``get_array_module`` in this proposal). Some possible > alternatives: > ``array_module``, ``common_array_module``, ``resolve_array_module``, > ``get_namespace``, ``get_numpy``, ``get_numpylike_module``, > ``get_duck_array_module``. > > .. _requesting-restricted-subsets: > > Requesting restricted subsets of NumPy's API > ============================================ > > Over time, NumPy has accumulated a very large API surface, with over 600 > attributes in the top level ``numpy`` module alone. It is unlikely > that any > duck array library could or would want to implement all of these > functions and > classes, because the frequently used subset of NumPy is much smaller. > > We think it would be useful exercise to define "minimal" subset(s) of > NumPy's > API, omitting rarely used or non-recommended functionality. For example, > minimal NumPy might include ``stack``, but not the other stacking > functions > ``column_stack``, ``dstack``, ``hstack`` and ``vstack``. This could > clearly > indicate to duck array authors and users want functionality is core > and what > functionality they can skip. > > Support for requesting a restricted subset of NumPy's API would be a > natural > feature to include in ?``get_array_function`` and > ``__array_module__``, e.g., > > .. code:: python > > ? ? # array_module is only guaranteed to contain "minimal" NumPy > ? ? array_module = np.get_array_module(*arrays, request='minimal') > > To facilitate testing with NumPy and use with any valid duck array > library, > NumPy itself would return restricted versions of the ``numpy`` module when > ``get_array_module`` is called only on NumPy arrays. Omitted functions > would > simply not exist. > > Unfortunately, we have not yet figured out what these restricted > subsets should > be, so it doesn't make sense to do this yet. When/if we do, we could > either add > new keyword arguments to ``get_array_module`` or add new top level > functions, > e.g., ``get_minimal_array_module``. We would also need to add either a new > protocol patterned off of ``__array_module__`` (e.g., > ``__array_module_minimal__``), or could add an optional second argument to > ``__array_module__`` (catching errors with ``try``/``except``). > > A new namespace for implicit dispatch > ===================================== > > Instead of supporting overrides in the main `numpy` namespace with > ``__array_function__``, we could create a new opt-in namespace, e.g., > ``numpy.api``, with versions of NumPy functions that support > dispatching. These > overrides would need new opt-in protocols, e.g., > ``__array_function_api__`` > patterned off of ``__array_function__``. > > This would resolve the biggest limitations of ``__array_function__`` > by being > opt-in and would also allow for unambiguously overriding functions like > ``asarray``, because ``np.api.asarray`` would always mean "convert an > array-like object." ?But it wouldn't solve all the dispatching needs > met by > ``__array_module__``, and would leave us with supporting a > considerably more > complex protocol both for array users and implementors. > > We could potentially implement such a new namespace *via* the > ``__array_module__`` protocol. Certainly some users would find this > convenient, > because it is slightly less boilerplate. But this would leave users with a > confusing choice: when should they use `get_array_module` vs. > `np.api.something`. Also, we would have to add and maintain a whole > new module, > which is considerably more expensive than merely adding a function. > > Dispatching on both types and arrays instead of only types > ========================================================== > > Instead of supporting dispatch only via unique array types, we could also > support dispatch via array objects, e.g., by passing an ``arrays`` > argument as > part of the ``__array_module__`` protocol. This could potentially be > useful for > dispatch for arrays with metadata, such provided by Dask and Pint, but > would > impose costs in terms of type safety and complexity. > > For example, a library that supports arrays on both CPUs and GPUs > might decide > on which device to create a new arrays from functions like ``ones`` > based on > input arguments: > > .. code:: python > > ? ? class Array: > ? ? ? ? def __array_module__(self, types, arrays): > ? ? ? ? ? ? useful_arrays = tuple(a in arrays if isinstance(a, Array)) > ? ? ? ? ? ? if not useful_arrays: > ? ? ? ? ? ? ? ? return NotImplemented > ? ? ? ? ? ? prefer_gpu = any(a.prefer_gpu for a in useful_arrays) > ? ? ? ? ? ? return ArrayModule(prefer_gpu) > > ? ? class ArrayModule: > ? ? ? ? def __init__(self, prefer_gpu): > ? ? ? ? ? ? self.prefer_gpu = prefer_gpu > > ? ? ? ? def __getattr__(self, name): > ? ? ? ? ? ? import base_module > ? ? ? ? ? ? base_func = getattr(base_module, name) > ? ? ? ? ? ? return functools.partial(base_func, > prefer_gpu=self.prefer_gpu) > > This might be useful, but it's not clear if we really need it. Pint > seems to > get along OK without any explicit array creation routines (favoring > multiplication by units, e.g., ``np.ones(5) * ureg.m``), and for the > most part > Dask is also OK with existing ``__array_function__`` style overides (e.g., > favoring ``np.ones_like`` over ``np.ones``). Choosing whether to place > an array > on the CPU or GPU could be solved by `making array creation lazy > `_. > > .. _appendix-design-choices: > > Appendix: design choices for API overrides > ------------------------------------------ > > There is a large range of possible design choices for overriding > NumPy's API. > Here we discuss three major axes of the design decision that guided > our design > for ``__array_module__``. > > Opt-in vs. opt-out for users > ============================ > > The ``__array_ufunc__`` and ``__array_function__`` protocols provide a > mechanism for overriding NumPy functions *within NumPy's existing > namespace*. > This means that users need to explicitly opt-out if they do not want any > overridden behavior, e.g., by casting arrays with ``np.asarray()``. > > In theory, this approach lowers the barrier for adopting these > protocols in > user code and libraries, because code that uses the standard NumPy > namespace is > automatically compatible. But in practice, this hasn't worked out. For > example, > most well-maintained libraries that use NumPy follow the best practice of > casting all inputs with ``np.asarray()``, which they would have to > explicitly > relax to use ``__array_function__``. Our experience has been that making a > library compatible with a new duck array type typically requires at > least a > small amount of work to accommodate differences in the data model and > operations > that can be implemented efficiently. > > These opt-out approaches also considerably complicate backwards > compatibility > for libraries that adopt these protocols, because by opting in as a > library > they also opt-in their users, whether they expect it or not. For > winning over > libraries that have been unable to adopt ``__array_function__``, an opt-in > approach seems like a must. > > Explicit vs. implicit choice of implementation > ============================================== > > Both ``__array_ufunc__`` and ``__array_function__`` have implicit > control over > dispatching: the dispatched functions are determined via the appropriate > protocols in every function call. This generalizes well to handling many > different types of objects, as evidenced by its use for implementing > arithmetic > operators in Python, but it has two downsides: > > 1. *Speed*: it imposes additional overhead in every function call, > because each > ? ?function call needs to inspect each of its arguments for overrides. > This is > ? ?why arithmetic on builtin Python numbers is slow. > 2. *Readability*: it is not longer immediately evident to readers of > code what > ? ?happens when a function is called, because the function's > implementation > ? ?could be overridden by any of its arguments. > > In contrast, importing a new library (e.g., ``import ?dask.array as > da``) with > an API matching NumPy is entirely explicit. There is no overhead from > dispatch > or ambiguity about which implementation is being used. > > Explicit and implicit choice of implementations are not mutually exclusive > options. Indeed, most implementations of NumPy API overrides via > ``__array_function__`` that we are familiar with (namely, dask, CuPy and > sparse, but not Pint) also include an explicit way to use their version of > NumPy's API by importing a module directly (``dask.array``, ``cupy`` or > ``sparse``, respectively). > > Local vs. non-local vs. global control > ====================================== > > The final design axis is how users control the choice of API: > > - **Local control**, as exemplified by multiple dispatch and Python > protocols for > ? arithmetic, determines which implementation to use either by > checking types > ? or calling methods on the direct arguments of a function. > - **Non-local control** such as `np.errstate > ? > `_ > ? overrides behavior with global-state via function decorators or > ? context-managers. Control is determined hierarchically, via the > inner-most > ? context. > - **Global control** provides a mechanism for users to set default > behavior, > ? either via function calls or configuration files. For example, > matplotlib > ? allows setting a global choice of plotting backend. > > Local control is generally considered a best practice for API design, > because > control flow is entirely explicit, which makes it the easiest to > understand. > Non-local and global control are occasionally used, but generally > either due to > ignorance or a lack of better alternatives. > > In the case of duck typing for NumPy's public API, we think non-local > or global > control would be mistakes, mostly because they **don't compose well**. > If one > library sets/needs one set of overrides and then internally calls a > routine > that expects another set of overrides, the resulting behavior may be very > surprising. Higher order functions are especially problematic, because the > context in which functions are evaluated may not be the context in > which they > are defined. > > One class of override use cases where we think non-local and global > control are > appropriate is for choosing a backend system that is guaranteed to have an > entirely consistent interface, such as a faster alternative > implementation of > ``numpy.fft`` on NumPy arrays. However, these are out of scope for the > current > proposal, which is focused on duck arrays. > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at gmail.com Wed Feb 5 12:37:53 2020 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Wed, 5 Feb 2020 11:37:53 -0600 Subject: [Numpy-discussion] NEP 37: A dispatch protocol for NumPy-like modules In-Reply-To: <21692339-9f4b-029c-d422-ea549acbe6c3@gmail.com> References: <21692339-9f4b-029c-d422-ea549acbe6c3@gmail.com> Message-ID: On Wed, Feb 5, 2020 at 10:01 AM Andreas Mueller wrote: > A bit late to the NEP 37 party. > I just wanted to say that at least from my perspective it seems a great > solution that will help sklearn move towards more flexible compute engines. > I think one of the biggest issues is array creation (including random > arrays), and that's handled quite nicely with NEP 37. > > There's some discussion on the scikit-learn side here: > https://github.com/scikit-learn/scikit-learn/pull/14963 > https://github.com/scikit-learn/scikit-learn/issues/11447 > > Two different groups of people tried to use __array_function__ to delegate > to MxNet and CuPy respectively in scikit-learn, and ran into the same > issues. > > There's some remaining issues in sklearn that will not be handled by NEP > 37 but they go beyond NumPy in some sense. > Just to briefly bring them up: > > - We use scipy.linalg in many places, and we would need to do a separate > dispatching to check whether we can use module.linalg instead > (that might be an issue for many libraries but I'm not sure). > That is an issue, and goes in the opposite direction we need - scipy.linalg is a superset of numpy.linalg, so we'd like to encourage using scipy. This is something we may want to consider fixing by making the dispatch decorator public in numpy and adopting in scipy. Cheers, Ralf > > - Some models have several possible optimization algorithms, some of which > are pure numpy and some which are Cython. If someone provides a different > array module, > we might want to choose an algorithm that is actually supported by that > module. While this exact issue is maybe sklearn specific, a similar issue > could appear for most downstream libs that use Cython in some places. > Many Cython algorithms could be implemented in pure numpy with a > potential slowdown, but once we have NEP 37 there might be a benefit to > having a pure NumPy implementation as an alternative code path. > > > Anyway, NEP 37 seems a great step in the right direction and would enable > sklearn to actually dispatch in some places. Dispatching just based on > __array_function__ seems not really feasible so far. > > Best, > Andreas Mueller > > > On 1/6/20 11:29 PM, Stephan Hoyer wrote: > > I am pleased to present a new NumPy Enhancement Proposal for discussion: > "NEP-37: A dispatch protocol for NumPy-like modules." Feedback would be > very welcome! > > The full text follows. The rendered proposal can also be found online at > https://numpy.org/neps/nep-0037-array-module.html > > Best, > Stephan Hoyer > > =================================================== > NEP 37 ? A dispatch protocol for NumPy-like modules > =================================================== > > :Author: Stephan Hoyer > :Author: Hameer Abbasi > :Author: Sebastian Berg > :Status: Draft > :Type: Standards Track > :Created: 2019-12-29 > > Abstract > -------- > > NEP-18's ``__array_function__`` has been a mixed success. Some projects > (e.g., > dask, CuPy, xarray, sparse, Pint) have enthusiastically adopted it. Others > (e.g., PyTorch, JAX, SciPy) have been more reluctant. Here we propose a new > protocol, ``__array_module__``, that we expect could eventually subsume > most > use-cases for ``__array_function__``. The protocol requires explicit > adoption > by both users and library authors, which ensures backwards compatibility, > and > is also significantly simpler than ``__array_function__``, both of which we > expect will make it easier to adopt. > > Why ``__array_function__`` hasn't been enough > --------------------------------------------- > > There are two broad ways in which NEP-18 has fallen short of its goals: > > 1. **Maintainability concerns**. `__array_function__` has significant > implications for libraries that use it: > > - Projects like `PyTorch > `_, `JAX > `_ and even `scipy.sparse > `_ have been reluctant > to > implement `__array_function__` in part because they are concerned > about > **breaking existing code**: users expect NumPy functions like > ``np.concatenate`` to return NumPy arrays. This is a fundamental > limitation of the ``__array_function__`` design, which we chose to > allow > overriding the existing ``numpy`` namespace. > - ``__array_function__`` currently requires an "all or nothing" > approach to > implementing NumPy's API. There is no good pathway for **incremental > adoption**, which is particularly problematic for established projects > for which adopting ``__array_function__`` would result in breaking > changes. > - It is no longer possible to use **aliases to NumPy functions** within > modules that support overrides. For example, both CuPy and JAX set > ``result_type = np.result_type``. > - Implementing **fall-back mechanisms** for unimplemented NumPy > functions > by using NumPy's implementation is hard to get right (but see the > `version from dask `_), > because > ``__array_function__`` does not present a consistent interface. > Converting all arguments of array type requires recursing into generic > arguments of the form ``*args, **kwargs``. > > 2. **Limitations on what can be overridden.** ``__array_function__`` has > some > important gaps, most notably array creation and coercion functions: > > - **Array creation** routines (e.g., ``np.arange`` and those in > ``np.random``) need some other mechanism for indicating what type of > arrays to create. `NEP 36 >`_ > proposed adding optional ``like=`` arguments to functions without > existing array arguments. However, we still lack any mechanism to > override methods on objects, such as those needed by > ``np.random.RandomState``. > - **Array conversion** can't reuse the existing coercion functions like > ``np.asarray``, because ``np.asarray`` sometimes means "convert to an > exact ``np.ndarray``" and other times means "convert to something > _like_ > a NumPy array." This led to the `NEP 30 > `_ > proposal for > a separate ``np.duckarray`` function, but this still does not resolve > how > to cast one duck array into a type matching another duck array. > > ``get_array_module`` and the ``__array_module__`` protocol > ---------------------------------------------------------- > > We propose a new user-facing mechanism for dispatching to a duck-array > implementation, ``numpy.get_array_module``. ``get_array_module`` performs > the > same type resolution as ``__array_function__`` and returns a module with > an API > promised to match the standard interface of ``numpy`` that can implement > operations on all provided array types. > > The protocol itself is both simpler and more powerful than > ``__array_function__``, because it doesn't need to worry about actually > implementing functions. We believe it resolves most of the maintainability > and > functionality limitations of ``__array_function__``. > > The new protocol is opt-in, explicit and with local control; see > :ref:`appendix-design-choices` for discussion on the importance of these > design > features. > > The array module contract > ========================= > > Modules returned by ``get_array_module``/``__array_module__`` should make a > best effort to implement NumPy's core functionality on new array types(s). > Unimplemented functionality should simply be omitted (e.g., accessing an > unimplemented function should raise ``AttributeError``). In the future, we > anticipate codifying a protocol for requesting restricted subsets of > ``numpy``; > see :ref:`requesting-restricted-subsets` for more details. > > How to use ``get_array_module`` > =============================== > > Code that wants to support generic duck arrays should explicitly call > ``get_array_module`` to determine an appropriate array module from which to > call functions, rather than using the ``numpy`` namespace directly. For > example: > > .. code:: python > > # calls the appropriate version of np.something for x and y > module = np.get_array_module(x, y) > module.something(x, y) > > Both array creation and array conversion are supported, because > dispatching is > handled by ``get_array_module`` rather than via the types of function > arguments. For example, to use random number generation functions or > methods, > we can simply pull out the appropriate submodule: > > .. code:: python > > def duckarray_add_random(array): > module = np.get_array_module(array) > noise = module.random.randn(*array.shape) > return array + noise > > We can also write the duck-array ``stack`` function from `NEP 30 > `_, without the > need > for a new ``np.duckarray`` function: > > .. code:: python > > def duckarray_stack(arrays): > module = np.get_array_module(*arrays) > arrays = [module.asarray(arr) for arr in arrays] > shapes = {arr.shape for arr in arrays} > if len(shapes) != 1: > raise ValueError('all input arrays must have the same shape') > expanded_arrays = [arr[module.newaxis, ...] for arr in arrays] > return module.concatenate(expanded_arrays, axis=0) > > By default, ``get_array_module`` will return the ``numpy`` module if no > arguments are arrays. This fall-back can be explicitly controlled by > providing > the ``module`` keyword-only argument. It is also possible to indicate that > an > exception should be raised instead of returning a default array module by > setting ``module=None``. > > How to implement ``__array_module__`` > ===================================== > > Libraries implementing a duck array type that want to support > ``get_array_module`` need to implement the corresponding protocol, > ``__array_module__``. This new protocol is based on Python's dispatch > protocol > for arithmetic, and is essentially a simpler version of > ``__array_function__``. > > Only one argument is passed into ``__array_module__``, a Python collection > of > unique array types passed into ``get_array_module``, i.e., all arguments > with > an ``__array_module__`` attribute. > > The special method should either return an namespace with an API matching > ``numpy``, or ``NotImplemented``, indicating that it does not know how to > handle the operation: > > .. code:: python > > class MyArray: > def __array_module__(self, types): > if not all(issubclass(t, MyArray) for t in types): > return NotImplemented > return my_array_module > > Returning custom objects from ``__array_module__`` > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > ``my_array_module`` will typically, but need not always, be a Python > module. > Returning a custom objects (e.g., with functions implemented via > ``__getattr__``) may be useful for some advanced use cases. > > For example, custom objects could allow for partial implementations of duck > array modules that fall-back to NumPy (although this is not recommended in > general because such fall-back behavior can be error prone): > > .. code:: python > > class MyArray: > def __array_module__(self, types): > if all(issubclass(t, MyArray) for t in types): > return ArrayModule() > else: > return NotImplemented > > class ArrayModule: > def __getattr__(self, name): > import base_module > return getattr(base_module, name, getattr(numpy, name)) > > Subclassing from ``numpy.ndarray`` > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > All of the same guidance about well-defined type casting hierarchies from > NEP-18 still applies. ``numpy.ndarray`` itself contains a matching > implementation of ``__array_module__``, which is convenient for > subclasses: > > .. code:: python > > class ndarray: > def __array_module__(self, types): > if all(issubclass(t, ndarray) for t in types): > return numpy > else: > return NotImplemented > > NumPy's internal machinery > ========================== > > The type resolution rules of ``get_array_module`` follow the same model as > Python and NumPy's existing dispatch protocols: subclasses are called > before > super-classes, and otherwise left to right. ``__array_module__`` is > guaranteed > to be called only a single time on each unique type. > > The actual implementation of `get_array_module` will be in C, but should be > equivalent to this Python code: > > .. code:: python > > def get_array_module(*arrays, default=numpy): > implementing_arrays, types = _implementing_arrays_and_types(arrays) > if not implementing_arrays and default is not None: > return default > for array in implementing_arrays: > module = array.__array_module__(types) > if module is not NotImplemented: > return module > raise TypeError("no common array module found") > > def _implementing_arrays_and_types(relevant_arrays): > types = [] > implementing_arrays = [] > for array in relevant_arrays: > t = type(array) > if t not in types and hasattr(t, '__array_module__'): > types.append(t) > # Subclasses before superclasses, otherwise left to right > index = len(implementing_arrays) > for i, old_array in enumerate(implementing_arrays): > if issubclass(t, type(old_array)): > index = i > break > implementing_arrays.insert(index, array) > return implementing_arrays, types > > Relationship with ``__array_ufunc__`` and ``__array_function__`` > ---------------------------------------------------------------- > > These older protocols have distinct use-cases and should remain > =============================================================== > > ``__array_module__`` is intended to resolve limitations of > ``__array_function__``, so it is natural to consider whether it could > entirely > replace ``__array_function__``. This would offer dual benefits: (1) > simplifying > the user-story about how to override NumPy and (2) removing the slowdown > associated with checking for dispatch when calling every NumPy function. > > However, ``__array_module__`` and ``__array_function__`` are pretty > different > from a user perspective: it requires explicit calls to > ``get_array_function``, > rather than simply reusing original ``numpy`` functions. This is probably > fine > for *libraries* that rely on duck-arrays, but may be frustratingly verbose > for > interactive use. > > Some of the dispatching use-cases for ``__array_ufunc__`` are also solved > by > ``__array_module__``, but not all of them. For example, it is still useful > to > be able to define non-NumPy ufuncs (e.g., from Numba or SciPy) in a > generic way > on non-NumPy arrays (e.g., with dask.array). > > Given their existing adoption and distinct use cases, we don't think it > makes > sense to remove or deprecate ``__array_function__`` and > ``__array_ufunc__`` at > this time. > > Mixin classes to implement ``__array_function__`` and ``__array_ufunc__`` > ========================================================================= > > Despite the user-facing differences, ``__array_module__`` and a module > implementing NumPy's API still contain sufficient functionality needed to > implement dispatching with the existing duck array protocols. > > For example, the following mixin classes would provide sensible defaults > for > these special methods in terms of ``get_array_module`` and > ``__array_module__``: > > .. code:: python > > class ArrayUfuncFromModuleMixin: > > def __array_ufunc__(self, ufunc, method, *inputs, **kwargs): > arrays = inputs + kwargs.get('out', ()) > try: > array_module = np.get_array_module(*arrays) > except TypeError: > return NotImplemented > > try: > # Note this may have false positive matches, if > ufunc.__name__ > # matches the name of a ufunc defined by NumPy. > Unfortunately > # there is no way to determine in which module a ufunc was > # defined. > new_ufunc = getattr(array_module, ufunc.__name__) > except AttributeError: > return NotImplemented > > try: > callable = getattr(new_ufunc, method) > except AttributeError: > return NotImplemented > > return callable(*inputs, **kwargs) > > class ArrayFunctionFromModuleMixin: > > def __array_function__(self, func, types, args, kwargs): > array_module = self.__array_module__(types) > if array_module is NotImplemented: > return NotImplemented > > # Traverse submodules to find the appropriate function > modules = func.__module__.split('.') > assert modules[0] == 'numpy' > for submodule in modules[1:]: > module = getattr(module, submodule, None) > new_func = getattr(module, func.__name__, None) > if new_func is None: > return NotImplemented > > return new_func(*args, **kwargs) > > To make it easier to write duck arrays, we could also add these mixin > classes > into ``numpy.lib.mixins`` (but the examples above may suffice). > > Alternatives considered > ----------------------- > > Naming > ====== > > We like the name ``__array_module__`` because it mirrors the existing > ``__array_function__`` and ``__array_ufunc__`` protocols. Another > reasonable > choice could be ``__array_namespace__``. > > It is less clear what the NumPy function that calls this protocol should be > called (``get_array_module`` in this proposal). Some possible alternatives: > ``array_module``, ``common_array_module``, ``resolve_array_module``, > ``get_namespace``, ``get_numpy``, ``get_numpylike_module``, > ``get_duck_array_module``. > > .. _requesting-restricted-subsets: > > Requesting restricted subsets of NumPy's API > ============================================ > > Over time, NumPy has accumulated a very large API surface, with over 600 > attributes in the top level ``numpy`` module alone. It is unlikely that any > duck array library could or would want to implement all of these functions > and > classes, because the frequently used subset of NumPy is much smaller. > > We think it would be useful exercise to define "minimal" subset(s) of > NumPy's > API, omitting rarely used or non-recommended functionality. For example, > minimal NumPy might include ``stack``, but not the other stacking functions > ``column_stack``, ``dstack``, ``hstack`` and ``vstack``. This could clearly > indicate to duck array authors and users want functionality is core and > what > functionality they can skip. > > Support for requesting a restricted subset of NumPy's API would be a > natural > feature to include in ``get_array_function`` and ``__array_module__``, > e.g., > > .. code:: python > > # array_module is only guaranteed to contain "minimal" NumPy > array_module = np.get_array_module(*arrays, request='minimal') > > To facilitate testing with NumPy and use with any valid duck array library, > NumPy itself would return restricted versions of the ``numpy`` module when > ``get_array_module`` is called only on NumPy arrays. Omitted functions > would > simply not exist. > > Unfortunately, we have not yet figured out what these restricted subsets > should > be, so it doesn't make sense to do this yet. When/if we do, we could > either add > new keyword arguments to ``get_array_module`` or add new top level > functions, > e.g., ``get_minimal_array_module``. We would also need to add either a new > protocol patterned off of ``__array_module__`` (e.g., > ``__array_module_minimal__``), or could add an optional second argument to > ``__array_module__`` (catching errors with ``try``/``except``). > > A new namespace for implicit dispatch > ===================================== > > Instead of supporting overrides in the main `numpy` namespace with > ``__array_function__``, we could create a new opt-in namespace, e.g., > ``numpy.api``, with versions of NumPy functions that support dispatching. > These > overrides would need new opt-in protocols, e.g., ``__array_function_api__`` > patterned off of ``__array_function__``. > > This would resolve the biggest limitations of ``__array_function__`` by > being > opt-in and would also allow for unambiguously overriding functions like > ``asarray``, because ``np.api.asarray`` would always mean "convert an > array-like object." But it wouldn't solve all the dispatching needs met by > ``__array_module__``, and would leave us with supporting a considerably > more > complex protocol both for array users and implementors. > > We could potentially implement such a new namespace *via* the > ``__array_module__`` protocol. Certainly some users would find this > convenient, > because it is slightly less boilerplate. But this would leave users with a > confusing choice: when should they use `get_array_module` vs. > `np.api.something`. Also, we would have to add and maintain a whole new > module, > which is considerably more expensive than merely adding a function. > > Dispatching on both types and arrays instead of only types > ========================================================== > > Instead of supporting dispatch only via unique array types, we could also > support dispatch via array objects, e.g., by passing an ``arrays`` > argument as > part of the ``__array_module__`` protocol. This could potentially be > useful for > dispatch for arrays with metadata, such provided by Dask and Pint, but > would > impose costs in terms of type safety and complexity. > > For example, a library that supports arrays on both CPUs and GPUs might > decide > on which device to create a new arrays from functions like ``ones`` based > on > input arguments: > > .. code:: python > > class Array: > def __array_module__(self, types, arrays): > useful_arrays = tuple(a in arrays if isinstance(a, Array)) > if not useful_arrays: > return NotImplemented > prefer_gpu = any(a.prefer_gpu for a in useful_arrays) > return ArrayModule(prefer_gpu) > > class ArrayModule: > def __init__(self, prefer_gpu): > self.prefer_gpu = prefer_gpu > > def __getattr__(self, name): > import base_module > base_func = getattr(base_module, name) > return functools.partial(base_func, prefer_gpu=self.prefer_gpu) > > This might be useful, but it's not clear if we really need it. Pint seems > to > get along OK without any explicit array creation routines (favoring > multiplication by units, e.g., ``np.ones(5) * ureg.m``), and for the most > part > Dask is also OK with existing ``__array_function__`` style overides (e.g., > favoring ``np.ones_like`` over ``np.ones``). Choosing whether to place an > array > on the CPU or GPU could be solved by `making array creation lazy > `_. > > .. _appendix-design-choices: > > Appendix: design choices for API overrides > ------------------------------------------ > > There is a large range of possible design choices for overriding NumPy's > API. > Here we discuss three major axes of the design decision that guided our > design > for ``__array_module__``. > > Opt-in vs. opt-out for users > ============================ > > The ``__array_ufunc__`` and ``__array_function__`` protocols provide a > mechanism for overriding NumPy functions *within NumPy's existing > namespace*. > This means that users need to explicitly opt-out if they do not want any > overridden behavior, e.g., by casting arrays with ``np.asarray()``. > > In theory, this approach lowers the barrier for adopting these protocols in > user code and libraries, because code that uses the standard NumPy > namespace is > automatically compatible. But in practice, this hasn't worked out. For > example, > most well-maintained libraries that use NumPy follow the best practice of > casting all inputs with ``np.asarray()``, which they would have to > explicitly > relax to use ``__array_function__``. Our experience has been that making a > library compatible with a new duck array type typically requires at least a > small amount of work to accommodate differences in the data model and > operations > that can be implemented efficiently. > > These opt-out approaches also considerably complicate backwards > compatibility > for libraries that adopt these protocols, because by opting in as a library > they also opt-in their users, whether they expect it or not. For winning > over > libraries that have been unable to adopt ``__array_function__``, an opt-in > approach seems like a must. > > Explicit vs. implicit choice of implementation > ============================================== > > Both ``__array_ufunc__`` and ``__array_function__`` have implicit control > over > dispatching: the dispatched functions are determined via the appropriate > protocols in every function call. This generalizes well to handling many > different types of objects, as evidenced by its use for implementing > arithmetic > operators in Python, but it has two downsides: > > 1. *Speed*: it imposes additional overhead in every function call, because > each > function call needs to inspect each of its arguments for overrides. > This is > why arithmetic on builtin Python numbers is slow. > 2. *Readability*: it is not longer immediately evident to readers of code > what > happens when a function is called, because the function's implementation > could be overridden by any of its arguments. > > In contrast, importing a new library (e.g., ``import dask.array as da``) > with > an API matching NumPy is entirely explicit. There is no overhead from > dispatch > or ambiguity about which implementation is being used. > > Explicit and implicit choice of implementations are not mutually exclusive > options. Indeed, most implementations of NumPy API overrides via > ``__array_function__`` that we are familiar with (namely, dask, CuPy and > sparse, but not Pint) also include an explicit way to use their version of > NumPy's API by importing a module directly (``dask.array``, ``cupy`` or > ``sparse``, respectively). > > Local vs. non-local vs. global control > ====================================== > > The final design axis is how users control the choice of API: > > - **Local control**, as exemplified by multiple dispatch and Python > protocols for > arithmetic, determines which implementation to use either by checking > types > or calling methods on the direct arguments of a function. > - **Non-local control** such as `np.errstate > < > https://docs.scipy.org/doc/numpy/reference/generated/numpy.errstate.html > >`_ > overrides behavior with global-state via function decorators or > context-managers. Control is determined hierarchically, via the > inner-most > context. > - **Global control** provides a mechanism for users to set default > behavior, > either via function calls or configuration files. For example, matplotlib > allows setting a global choice of plotting backend. > > Local control is generally considered a best practice for API design, > because > control flow is entirely explicit, which makes it the easiest to > understand. > Non-local and global control are occasionally used, but generally either > due to > ignorance or a lack of better alternatives. > > In the case of duck typing for NumPy's public API, we think non-local or > global > control would be mistakes, mostly because they **don't compose well**. If > one > library sets/needs one set of overrides and then internally calls a routine > that expects another set of overrides, the resulting behavior may be very > surprising. Higher order functions are especially problematic, because the > context in which functions are evaluated may not be the context in which > they > are defined. > > One class of override use cases where we think non-local and global > control are > appropriate is for choosing a backend system that is guaranteed to have an > entirely consistent interface, such as a faster alternative implementation > of > ``numpy.fft`` on NumPy arrays. However, these are out of scope for the > current > proposal, which is focused on duck arrays. > > _______________________________________________ > NumPy-Discussion mailing listNumPy-Discussion at python.orghttps://mail.python.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wieser.eric+numpy at gmail.com Wed Feb 5 13:13:52 2020 From: wieser.eric+numpy at gmail.com (Eric Wieser) Date: Wed, 5 Feb 2020 18:13:52 +0000 Subject: [Numpy-discussion] NEP 37: A dispatch protocol for NumPy-like modules In-Reply-To: References: <21692339-9f4b-029c-d422-ea549acbe6c3@gmail.com> Message-ID: > scipy.linalg is a superset of numpy.linalg This isn't completely accurate - numpy.linalg supports almost all operations* over stacks of matrices via gufuncs, but scipy.linalg does not appear to. Eric *: not lstsq due to an ungeneralizable public API On Wed, 5 Feb 2020 at 17:38, Ralf Gommers wrote: > > > On Wed, Feb 5, 2020 at 10:01 AM Andreas Mueller wrote: > >> A bit late to the NEP 37 party. >> I just wanted to say that at least from my perspective it seems a great >> solution that will help sklearn move towards more flexible compute engines. >> I think one of the biggest issues is array creation (including random >> arrays), and that's handled quite nicely with NEP 37. >> >> There's some discussion on the scikit-learn side here: >> https://github.com/scikit-learn/scikit-learn/pull/14963 >> https://github.com/scikit-learn/scikit-learn/issues/11447 >> >> Two different groups of people tried to use __array_function__ to >> delegate to MxNet and CuPy respectively in scikit-learn, and ran into the >> same issues. >> >> There's some remaining issues in sklearn that will not be handled by NEP >> 37 but they go beyond NumPy in some sense. >> Just to briefly bring them up: >> >> - We use scipy.linalg in many places, and we would need to do a separate >> dispatching to check whether we can use module.linalg instead >> (that might be an issue for many libraries but I'm not sure). >> > > That is an issue, and goes in the opposite direction we need - > scipy.linalg is a superset of numpy.linalg, so we'd like to encourage using > scipy. This is something we may want to consider fixing by making the > dispatch decorator public in numpy and adopting in scipy. > > Cheers, > Ralf > > > >> >> - Some models have several possible optimization algorithms, some of >> which are pure numpy and some which are Cython. If someone provides a >> different array module, >> we might want to choose an algorithm that is actually supported by that >> module. While this exact issue is maybe sklearn specific, a similar issue >> could appear for most downstream libs that use Cython in some places. >> Many Cython algorithms could be implemented in pure numpy with a >> potential slowdown, but once we have NEP 37 there might be a benefit to >> having a pure NumPy implementation as an alternative code path. >> >> >> Anyway, NEP 37 seems a great step in the right direction and would enable >> sklearn to actually dispatch in some places. Dispatching just based on >> __array_function__ seems not really feasible so far. >> >> Best, >> Andreas Mueller >> >> >> On 1/6/20 11:29 PM, Stephan Hoyer wrote: >> >> I am pleased to present a new NumPy Enhancement Proposal for discussion: >> "NEP-37: A dispatch protocol for NumPy-like modules." Feedback would be >> very welcome! >> >> The full text follows. The rendered proposal can also be found online at >> https://numpy.org/neps/nep-0037-array-module.html >> >> Best, >> Stephan Hoyer >> >> =================================================== >> NEP 37 ? A dispatch protocol for NumPy-like modules >> =================================================== >> >> :Author: Stephan Hoyer >> :Author: Hameer Abbasi >> :Author: Sebastian Berg >> :Status: Draft >> :Type: Standards Track >> :Created: 2019-12-29 >> >> Abstract >> -------- >> >> NEP-18's ``__array_function__`` has been a mixed success. Some projects >> (e.g., >> dask, CuPy, xarray, sparse, Pint) have enthusiastically adopted it. Others >> (e.g., PyTorch, JAX, SciPy) have been more reluctant. Here we propose a >> new >> protocol, ``__array_module__``, that we expect could eventually subsume >> most >> use-cases for ``__array_function__``. The protocol requires explicit >> adoption >> by both users and library authors, which ensures backwards compatibility, >> and >> is also significantly simpler than ``__array_function__``, both of which >> we >> expect will make it easier to adopt. >> >> Why ``__array_function__`` hasn't been enough >> --------------------------------------------- >> >> There are two broad ways in which NEP-18 has fallen short of its goals: >> >> 1. **Maintainability concerns**. `__array_function__` has significant >> implications for libraries that use it: >> >> - Projects like `PyTorch >> `_, `JAX >> `_ and even `scipy.sparse >> `_ have been reluctant >> to >> implement `__array_function__` in part because they are concerned >> about >> **breaking existing code**: users expect NumPy functions like >> ``np.concatenate`` to return NumPy arrays. This is a fundamental >> limitation of the ``__array_function__`` design, which we chose to >> allow >> overriding the existing ``numpy`` namespace. >> - ``__array_function__`` currently requires an "all or nothing" >> approach to >> implementing NumPy's API. There is no good pathway for **incremental >> adoption**, which is particularly problematic for established >> projects >> for which adopting ``__array_function__`` would result in breaking >> changes. >> - It is no longer possible to use **aliases to NumPy functions** within >> modules that support overrides. For example, both CuPy and JAX set >> ``result_type = np.result_type``. >> - Implementing **fall-back mechanisms** for unimplemented NumPy >> functions >> by using NumPy's implementation is hard to get right (but see the >> `version from dask `_), >> because >> ``__array_function__`` does not present a consistent interface. >> Converting all arguments of array type requires recursing into >> generic >> arguments of the form ``*args, **kwargs``. >> >> 2. **Limitations on what can be overridden.** ``__array_function__`` has >> some >> important gaps, most notably array creation and coercion functions: >> >> - **Array creation** routines (e.g., ``np.arange`` and those in >> ``np.random``) need some other mechanism for indicating what type of >> arrays to create. `NEP 36 > >`_ >> proposed adding optional ``like=`` arguments to functions without >> existing array arguments. However, we still lack any mechanism to >> override methods on objects, such as those needed by >> ``np.random.RandomState``. >> - **Array conversion** can't reuse the existing coercion functions like >> ``np.asarray``, because ``np.asarray`` sometimes means "convert to an >> exact ``np.ndarray``" and other times means "convert to something >> _like_ >> a NumPy array." This led to the `NEP 30 >> `_ >> proposal for >> a separate ``np.duckarray`` function, but this still does not >> resolve how >> to cast one duck array into a type matching another duck array. >> >> ``get_array_module`` and the ``__array_module__`` protocol >> ---------------------------------------------------------- >> >> We propose a new user-facing mechanism for dispatching to a duck-array >> implementation, ``numpy.get_array_module``. ``get_array_module`` performs >> the >> same type resolution as ``__array_function__`` and returns a module with >> an API >> promised to match the standard interface of ``numpy`` that can implement >> operations on all provided array types. >> >> The protocol itself is both simpler and more powerful than >> ``__array_function__``, because it doesn't need to worry about actually >> implementing functions. We believe it resolves most of the >> maintainability and >> functionality limitations of ``__array_function__``. >> >> The new protocol is opt-in, explicit and with local control; see >> :ref:`appendix-design-choices` for discussion on the importance of these >> design >> features. >> >> The array module contract >> ========================= >> >> Modules returned by ``get_array_module``/``__array_module__`` should make >> a >> best effort to implement NumPy's core functionality on new array types(s). >> Unimplemented functionality should simply be omitted (e.g., accessing an >> unimplemented function should raise ``AttributeError``). In the future, we >> anticipate codifying a protocol for requesting restricted subsets of >> ``numpy``; >> see :ref:`requesting-restricted-subsets` for more details. >> >> How to use ``get_array_module`` >> =============================== >> >> Code that wants to support generic duck arrays should explicitly call >> ``get_array_module`` to determine an appropriate array module from which >> to >> call functions, rather than using the ``numpy`` namespace directly. For >> example: >> >> .. code:: python >> >> # calls the appropriate version of np.something for x and y >> module = np.get_array_module(x, y) >> module.something(x, y) >> >> Both array creation and array conversion are supported, because >> dispatching is >> handled by ``get_array_module`` rather than via the types of function >> arguments. For example, to use random number generation functions or >> methods, >> we can simply pull out the appropriate submodule: >> >> .. code:: python >> >> def duckarray_add_random(array): >> module = np.get_array_module(array) >> noise = module.random.randn(*array.shape) >> return array + noise >> >> We can also write the duck-array ``stack`` function from `NEP 30 >> `_, without >> the need >> for a new ``np.duckarray`` function: >> >> .. code:: python >> >> def duckarray_stack(arrays): >> module = np.get_array_module(*arrays) >> arrays = [module.asarray(arr) for arr in arrays] >> shapes = {arr.shape for arr in arrays} >> if len(shapes) != 1: >> raise ValueError('all input arrays must have the same shape') >> expanded_arrays = [arr[module.newaxis, ...] for arr in arrays] >> return module.concatenate(expanded_arrays, axis=0) >> >> By default, ``get_array_module`` will return the ``numpy`` module if no >> arguments are arrays. This fall-back can be explicitly controlled by >> providing >> the ``module`` keyword-only argument. It is also possible to indicate >> that an >> exception should be raised instead of returning a default array module by >> setting ``module=None``. >> >> How to implement ``__array_module__`` >> ===================================== >> >> Libraries implementing a duck array type that want to support >> ``get_array_module`` need to implement the corresponding protocol, >> ``__array_module__``. This new protocol is based on Python's dispatch >> protocol >> for arithmetic, and is essentially a simpler version of >> ``__array_function__``. >> >> Only one argument is passed into ``__array_module__``, a Python >> collection of >> unique array types passed into ``get_array_module``, i.e., all arguments >> with >> an ``__array_module__`` attribute. >> >> The special method should either return an namespace with an API matching >> ``numpy``, or ``NotImplemented``, indicating that it does not know how to >> handle the operation: >> >> .. code:: python >> >> class MyArray: >> def __array_module__(self, types): >> if not all(issubclass(t, MyArray) for t in types): >> return NotImplemented >> return my_array_module >> >> Returning custom objects from ``__array_module__`` >> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> >> ``my_array_module`` will typically, but need not always, be a Python >> module. >> Returning a custom objects (e.g., with functions implemented via >> ``__getattr__``) may be useful for some advanced use cases. >> >> For example, custom objects could allow for partial implementations of >> duck >> array modules that fall-back to NumPy (although this is not recommended in >> general because such fall-back behavior can be error prone): >> >> .. code:: python >> >> class MyArray: >> def __array_module__(self, types): >> if all(issubclass(t, MyArray) for t in types): >> return ArrayModule() >> else: >> return NotImplemented >> >> class ArrayModule: >> def __getattr__(self, name): >> import base_module >> return getattr(base_module, name, getattr(numpy, name)) >> >> Subclassing from ``numpy.ndarray`` >> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> >> All of the same guidance about well-defined type casting hierarchies from >> NEP-18 still applies. ``numpy.ndarray`` itself contains a matching >> implementation of ``__array_module__``, which is convenient for >> subclasses: >> >> .. code:: python >> >> class ndarray: >> def __array_module__(self, types): >> if all(issubclass(t, ndarray) for t in types): >> return numpy >> else: >> return NotImplemented >> >> NumPy's internal machinery >> ========================== >> >> The type resolution rules of ``get_array_module`` follow the same model as >> Python and NumPy's existing dispatch protocols: subclasses are called >> before >> super-classes, and otherwise left to right. ``__array_module__`` is >> guaranteed >> to be called only a single time on each unique type. >> >> The actual implementation of `get_array_module` will be in C, but should >> be >> equivalent to this Python code: >> >> .. code:: python >> >> def get_array_module(*arrays, default=numpy): >> implementing_arrays, types = >> _implementing_arrays_and_types(arrays) >> if not implementing_arrays and default is not None: >> return default >> for array in implementing_arrays: >> module = array.__array_module__(types) >> if module is not NotImplemented: >> return module >> raise TypeError("no common array module found") >> >> def _implementing_arrays_and_types(relevant_arrays): >> types = [] >> implementing_arrays = [] >> for array in relevant_arrays: >> t = type(array) >> if t not in types and hasattr(t, '__array_module__'): >> types.append(t) >> # Subclasses before superclasses, otherwise left to right >> index = len(implementing_arrays) >> for i, old_array in enumerate(implementing_arrays): >> if issubclass(t, type(old_array)): >> index = i >> break >> implementing_arrays.insert(index, array) >> return implementing_arrays, types >> >> Relationship with ``__array_ufunc__`` and ``__array_function__`` >> ---------------------------------------------------------------- >> >> These older protocols have distinct use-cases and should remain >> =============================================================== >> >> ``__array_module__`` is intended to resolve limitations of >> ``__array_function__``, so it is natural to consider whether it could >> entirely >> replace ``__array_function__``. This would offer dual benefits: (1) >> simplifying >> the user-story about how to override NumPy and (2) removing the slowdown >> associated with checking for dispatch when calling every NumPy function. >> >> However, ``__array_module__`` and ``__array_function__`` are pretty >> different >> from a user perspective: it requires explicit calls to >> ``get_array_function``, >> rather than simply reusing original ``numpy`` functions. This is probably >> fine >> for *libraries* that rely on duck-arrays, but may be frustratingly >> verbose for >> interactive use. >> >> Some of the dispatching use-cases for ``__array_ufunc__`` are also solved >> by >> ``__array_module__``, but not all of them. For example, it is still >> useful to >> be able to define non-NumPy ufuncs (e.g., from Numba or SciPy) in a >> generic way >> on non-NumPy arrays (e.g., with dask.array). >> >> Given their existing adoption and distinct use cases, we don't think it >> makes >> sense to remove or deprecate ``__array_function__`` and >> ``__array_ufunc__`` at >> this time. >> >> Mixin classes to implement ``__array_function__`` and ``__array_ufunc__`` >> ========================================================================= >> >> Despite the user-facing differences, ``__array_module__`` and a module >> implementing NumPy's API still contain sufficient functionality needed to >> implement dispatching with the existing duck array protocols. >> >> For example, the following mixin classes would provide sensible defaults >> for >> these special methods in terms of ``get_array_module`` and >> ``__array_module__``: >> >> .. code:: python >> >> class ArrayUfuncFromModuleMixin: >> >> def __array_ufunc__(self, ufunc, method, *inputs, **kwargs): >> arrays = inputs + kwargs.get('out', ()) >> try: >> array_module = np.get_array_module(*arrays) >> except TypeError: >> return NotImplemented >> >> try: >> # Note this may have false positive matches, if >> ufunc.__name__ >> # matches the name of a ufunc defined by NumPy. >> Unfortunately >> # there is no way to determine in which module a ufunc was >> # defined. >> new_ufunc = getattr(array_module, ufunc.__name__) >> except AttributeError: >> return NotImplemented >> >> try: >> callable = getattr(new_ufunc, method) >> except AttributeError: >> return NotImplemented >> >> return callable(*inputs, **kwargs) >> >> class ArrayFunctionFromModuleMixin: >> >> def __array_function__(self, func, types, args, kwargs): >> array_module = self.__array_module__(types) >> if array_module is NotImplemented: >> return NotImplemented >> >> # Traverse submodules to find the appropriate function >> modules = func.__module__.split('.') >> assert modules[0] == 'numpy' >> for submodule in modules[1:]: >> module = getattr(module, submodule, None) >> new_func = getattr(module, func.__name__, None) >> if new_func is None: >> return NotImplemented >> >> return new_func(*args, **kwargs) >> >> To make it easier to write duck arrays, we could also add these mixin >> classes >> into ``numpy.lib.mixins`` (but the examples above may suffice). >> >> Alternatives considered >> ----------------------- >> >> Naming >> ====== >> >> We like the name ``__array_module__`` because it mirrors the existing >> ``__array_function__`` and ``__array_ufunc__`` protocols. Another >> reasonable >> choice could be ``__array_namespace__``. >> >> It is less clear what the NumPy function that calls this protocol should >> be >> called (``get_array_module`` in this proposal). Some possible >> alternatives: >> ``array_module``, ``common_array_module``, ``resolve_array_module``, >> ``get_namespace``, ``get_numpy``, ``get_numpylike_module``, >> ``get_duck_array_module``. >> >> .. _requesting-restricted-subsets: >> >> Requesting restricted subsets of NumPy's API >> ============================================ >> >> Over time, NumPy has accumulated a very large API surface, with over 600 >> attributes in the top level ``numpy`` module alone. It is unlikely that >> any >> duck array library could or would want to implement all of these >> functions and >> classes, because the frequently used subset of NumPy is much smaller. >> >> We think it would be useful exercise to define "minimal" subset(s) of >> NumPy's >> API, omitting rarely used or non-recommended functionality. For example, >> minimal NumPy might include ``stack``, but not the other stacking >> functions >> ``column_stack``, ``dstack``, ``hstack`` and ``vstack``. This could >> clearly >> indicate to duck array authors and users want functionality is core and >> what >> functionality they can skip. >> >> Support for requesting a restricted subset of NumPy's API would be a >> natural >> feature to include in ``get_array_function`` and ``__array_module__``, >> e.g., >> >> .. code:: python >> >> # array_module is only guaranteed to contain "minimal" NumPy >> array_module = np.get_array_module(*arrays, request='minimal') >> >> To facilitate testing with NumPy and use with any valid duck array >> library, >> NumPy itself would return restricted versions of the ``numpy`` module when >> ``get_array_module`` is called only on NumPy arrays. Omitted functions >> would >> simply not exist. >> >> Unfortunately, we have not yet figured out what these restricted subsets >> should >> be, so it doesn't make sense to do this yet. When/if we do, we could >> either add >> new keyword arguments to ``get_array_module`` or add new top level >> functions, >> e.g., ``get_minimal_array_module``. We would also need to add either a new >> protocol patterned off of ``__array_module__`` (e.g., >> ``__array_module_minimal__``), or could add an optional second argument to >> ``__array_module__`` (catching errors with ``try``/``except``). >> >> A new namespace for implicit dispatch >> ===================================== >> >> Instead of supporting overrides in the main `numpy` namespace with >> ``__array_function__``, we could create a new opt-in namespace, e.g., >> ``numpy.api``, with versions of NumPy functions that support dispatching. >> These >> overrides would need new opt-in protocols, e.g., >> ``__array_function_api__`` >> patterned off of ``__array_function__``. >> >> This would resolve the biggest limitations of ``__array_function__`` by >> being >> opt-in and would also allow for unambiguously overriding functions like >> ``asarray``, because ``np.api.asarray`` would always mean "convert an >> array-like object." But it wouldn't solve all the dispatching needs met >> by >> ``__array_module__``, and would leave us with supporting a considerably >> more >> complex protocol both for array users and implementors. >> >> We could potentially implement such a new namespace *via* the >> ``__array_module__`` protocol. Certainly some users would find this >> convenient, >> because it is slightly less boilerplate. But this would leave users with a >> confusing choice: when should they use `get_array_module` vs. >> `np.api.something`. Also, we would have to add and maintain a whole new >> module, >> which is considerably more expensive than merely adding a function. >> >> Dispatching on both types and arrays instead of only types >> ========================================================== >> >> Instead of supporting dispatch only via unique array types, we could also >> support dispatch via array objects, e.g., by passing an ``arrays`` >> argument as >> part of the ``__array_module__`` protocol. This could potentially be >> useful for >> dispatch for arrays with metadata, such provided by Dask and Pint, but >> would >> impose costs in terms of type safety and complexity. >> >> For example, a library that supports arrays on both CPUs and GPUs might >> decide >> on which device to create a new arrays from functions like ``ones`` based >> on >> input arguments: >> >> .. code:: python >> >> class Array: >> def __array_module__(self, types, arrays): >> useful_arrays = tuple(a in arrays if isinstance(a, Array)) >> if not useful_arrays: >> return NotImplemented >> prefer_gpu = any(a.prefer_gpu for a in useful_arrays) >> return ArrayModule(prefer_gpu) >> >> class ArrayModule: >> def __init__(self, prefer_gpu): >> self.prefer_gpu = prefer_gpu >> >> def __getattr__(self, name): >> import base_module >> base_func = getattr(base_module, name) >> return functools.partial(base_func, >> prefer_gpu=self.prefer_gpu) >> >> This might be useful, but it's not clear if we really need it. Pint seems >> to >> get along OK without any explicit array creation routines (favoring >> multiplication by units, e.g., ``np.ones(5) * ureg.m``), and for the most >> part >> Dask is also OK with existing ``__array_function__`` style overides (e.g., >> favoring ``np.ones_like`` over ``np.ones``). Choosing whether to place an >> array >> on the CPU or GPU could be solved by `making array creation lazy >> `_. >> >> .. _appendix-design-choices: >> >> Appendix: design choices for API overrides >> ------------------------------------------ >> >> There is a large range of possible design choices for overriding NumPy's >> API. >> Here we discuss three major axes of the design decision that guided our >> design >> for ``__array_module__``. >> >> Opt-in vs. opt-out for users >> ============================ >> >> The ``__array_ufunc__`` and ``__array_function__`` protocols provide a >> mechanism for overriding NumPy functions *within NumPy's existing >> namespace*. >> This means that users need to explicitly opt-out if they do not want any >> overridden behavior, e.g., by casting arrays with ``np.asarray()``. >> >> In theory, this approach lowers the barrier for adopting these protocols >> in >> user code and libraries, because code that uses the standard NumPy >> namespace is >> automatically compatible. But in practice, this hasn't worked out. For >> example, >> most well-maintained libraries that use NumPy follow the best practice of >> casting all inputs with ``np.asarray()``, which they would have to >> explicitly >> relax to use ``__array_function__``. Our experience has been that making a >> library compatible with a new duck array type typically requires at least >> a >> small amount of work to accommodate differences in the data model and >> operations >> that can be implemented efficiently. >> >> These opt-out approaches also considerably complicate backwards >> compatibility >> for libraries that adopt these protocols, because by opting in as a >> library >> they also opt-in their users, whether they expect it or not. For winning >> over >> libraries that have been unable to adopt ``__array_function__``, an opt-in >> approach seems like a must. >> >> Explicit vs. implicit choice of implementation >> ============================================== >> >> Both ``__array_ufunc__`` and ``__array_function__`` have implicit control >> over >> dispatching: the dispatched functions are determined via the appropriate >> protocols in every function call. This generalizes well to handling many >> different types of objects, as evidenced by its use for implementing >> arithmetic >> operators in Python, but it has two downsides: >> >> 1. *Speed*: it imposes additional overhead in every function call, >> because each >> function call needs to inspect each of its arguments for overrides. >> This is >> why arithmetic on builtin Python numbers is slow. >> 2. *Readability*: it is not longer immediately evident to readers of code >> what >> happens when a function is called, because the function's >> implementation >> could be overridden by any of its arguments. >> >> In contrast, importing a new library (e.g., ``import dask.array as da``) >> with >> an API matching NumPy is entirely explicit. There is no overhead from >> dispatch >> or ambiguity about which implementation is being used. >> >> Explicit and implicit choice of implementations are not mutually exclusive >> options. Indeed, most implementations of NumPy API overrides via >> ``__array_function__`` that we are familiar with (namely, dask, CuPy and >> sparse, but not Pint) also include an explicit way to use their version of >> NumPy's API by importing a module directly (``dask.array``, ``cupy`` or >> ``sparse``, respectively). >> >> Local vs. non-local vs. global control >> ====================================== >> >> The final design axis is how users control the choice of API: >> >> - **Local control**, as exemplified by multiple dispatch and Python >> protocols for >> arithmetic, determines which implementation to use either by checking >> types >> or calling methods on the direct arguments of a function. >> - **Non-local control** such as `np.errstate >> < >> https://docs.scipy.org/doc/numpy/reference/generated/numpy.errstate.html >> >`_ >> overrides behavior with global-state via function decorators or >> context-managers. Control is determined hierarchically, via the >> inner-most >> context. >> - **Global control** provides a mechanism for users to set default >> behavior, >> either via function calls or configuration files. For example, >> matplotlib >> allows setting a global choice of plotting backend. >> >> Local control is generally considered a best practice for API design, >> because >> control flow is entirely explicit, which makes it the easiest to >> understand. >> Non-local and global control are occasionally used, but generally either >> due to >> ignorance or a lack of better alternatives. >> >> In the case of duck typing for NumPy's public API, we think non-local or >> global >> control would be mistakes, mostly because they **don't compose well**. If >> one >> library sets/needs one set of overrides and then internally calls a >> routine >> that expects another set of overrides, the resulting behavior may be very >> surprising. Higher order functions are especially problematic, because the >> context in which functions are evaluated may not be the context in which >> they >> are defined. >> >> One class of override use cases where we think non-local and global >> control are >> appropriate is for choosing a backend system that is guaranteed to have an >> entirely consistent interface, such as a faster alternative >> implementation of >> ``numpy.fft`` on NumPy arrays. However, these are out of scope for the >> current >> proposal, which is focused on duck arrays. >> >> _______________________________________________ >> NumPy-Discussion mailing listNumPy-Discussion at python.orghttps://mail.python.org/mailman/listinfo/numpy-discussion >> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at gmail.com Wed Feb 5 13:22:19 2020 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Wed, 5 Feb 2020 12:22:19 -0600 Subject: [Numpy-discussion] NEP 37: A dispatch protocol for NumPy-like modules In-Reply-To: References: <21692339-9f4b-029c-d422-ea549acbe6c3@gmail.com> Message-ID: On Wed, Feb 5, 2020 at 12:14 PM Eric Wieser wrote: > > scipy.linalg is a superset of numpy.linalg > > This isn't completely accurate - numpy.linalg supports almost all > operations* over stacks of matrices via gufuncs, but scipy.linalg does not > appear to. > > Eric > > *: not lstsq due to an ungeneralizable public API > That's true for `qr` as well I believe. Indeed some functions have diverged slightly, but that's not on purpose, more like a lack of time to coordinate. We would like to fix that so everything is in sync and fully API-compatible again. Ralf > On Wed, 5 Feb 2020 at 17:38, Ralf Gommers wrote: > >> >> >> On Wed, Feb 5, 2020 at 10:01 AM Andreas Mueller wrote: >> >>> A bit late to the NEP 37 party. >>> I just wanted to say that at least from my perspective it seems a great >>> solution that will help sklearn move towards more flexible compute engines. >>> I think one of the biggest issues is array creation (including random >>> arrays), and that's handled quite nicely with NEP 37. >>> >>> There's some discussion on the scikit-learn side here: >>> https://github.com/scikit-learn/scikit-learn/pull/14963 >>> https://github.com/scikit-learn/scikit-learn/issues/11447 >>> >>> Two different groups of people tried to use __array_function__ to >>> delegate to MxNet and CuPy respectively in scikit-learn, and ran into the >>> same issues. >>> >>> There's some remaining issues in sklearn that will not be handled by NEP >>> 37 but they go beyond NumPy in some sense. >>> Just to briefly bring them up: >>> >>> - We use scipy.linalg in many places, and we would need to do a separate >>> dispatching to check whether we can use module.linalg instead >>> (that might be an issue for many libraries but I'm not sure). >>> >> >> That is an issue, and goes in the opposite direction we need - >> scipy.linalg is a superset of numpy.linalg, so we'd like to encourage using >> scipy. This is something we may want to consider fixing by making the >> dispatch decorator public in numpy and adopting in scipy. >> >> Cheers, >> Ralf >> >> >> >>> >>> - Some models have several possible optimization algorithms, some of >>> which are pure numpy and some which are Cython. If someone provides a >>> different array module, >>> we might want to choose an algorithm that is actually supported by that >>> module. While this exact issue is maybe sklearn specific, a similar issue >>> could appear for most downstream libs that use Cython in some places. >>> Many Cython algorithms could be implemented in pure numpy with a >>> potential slowdown, but once we have NEP 37 there might be a benefit to >>> having a pure NumPy implementation as an alternative code path. >>> >>> >>> Anyway, NEP 37 seems a great step in the right direction and would >>> enable sklearn to actually dispatch in some places. Dispatching just based >>> on __array_function__ seems not really feasible so far. >>> >>> Best, >>> Andreas Mueller >>> >>> >>> On 1/6/20 11:29 PM, Stephan Hoyer wrote: >>> >>> I am pleased to present a new NumPy Enhancement Proposal for discussion: >>> "NEP-37: A dispatch protocol for NumPy-like modules." Feedback would be >>> very welcome! >>> >>> The full text follows. The rendered proposal can also be found online at >>> https://numpy.org/neps/nep-0037-array-module.html >>> >>> Best, >>> Stephan Hoyer >>> >>> =================================================== >>> NEP 37 ? A dispatch protocol for NumPy-like modules >>> =================================================== >>> >>> :Author: Stephan Hoyer >>> :Author: Hameer Abbasi >>> :Author: Sebastian Berg >>> :Status: Draft >>> :Type: Standards Track >>> :Created: 2019-12-29 >>> >>> Abstract >>> -------- >>> >>> NEP-18's ``__array_function__`` has been a mixed success. Some projects >>> (e.g., >>> dask, CuPy, xarray, sparse, Pint) have enthusiastically adopted it. >>> Others >>> (e.g., PyTorch, JAX, SciPy) have been more reluctant. Here we propose a >>> new >>> protocol, ``__array_module__``, that we expect could eventually subsume >>> most >>> use-cases for ``__array_function__``. The protocol requires explicit >>> adoption >>> by both users and library authors, which ensures backwards >>> compatibility, and >>> is also significantly simpler than ``__array_function__``, both of which >>> we >>> expect will make it easier to adopt. >>> >>> Why ``__array_function__`` hasn't been enough >>> --------------------------------------------- >>> >>> There are two broad ways in which NEP-18 has fallen short of its goals: >>> >>> 1. **Maintainability concerns**. `__array_function__` has significant >>> implications for libraries that use it: >>> >>> - Projects like `PyTorch >>> `_, `JAX >>> `_ and even >>> `scipy.sparse >>> `_ have been >>> reluctant to >>> implement `__array_function__` in part because they are concerned >>> about >>> **breaking existing code**: users expect NumPy functions like >>> ``np.concatenate`` to return NumPy arrays. This is a fundamental >>> limitation of the ``__array_function__`` design, which we chose to >>> allow >>> overriding the existing ``numpy`` namespace. >>> - ``__array_function__`` currently requires an "all or nothing" >>> approach to >>> implementing NumPy's API. There is no good pathway for **incremental >>> adoption**, which is particularly problematic for established >>> projects >>> for which adopting ``__array_function__`` would result in breaking >>> changes. >>> - It is no longer possible to use **aliases to NumPy functions** >>> within >>> modules that support overrides. For example, both CuPy and JAX set >>> ``result_type = np.result_type``. >>> - Implementing **fall-back mechanisms** for unimplemented NumPy >>> functions >>> by using NumPy's implementation is hard to get right (but see the >>> `version from dask `_), >>> because >>> ``__array_function__`` does not present a consistent interface. >>> Converting all arguments of array type requires recursing into >>> generic >>> arguments of the form ``*args, **kwargs``. >>> >>> 2. **Limitations on what can be overridden.** ``__array_function__`` has >>> some >>> important gaps, most notably array creation and coercion functions: >>> >>> - **Array creation** routines (e.g., ``np.arange`` and those in >>> ``np.random``) need some other mechanism for indicating what type of >>> arrays to create. `NEP 36 < >>> https://github.com/numpy/numpy/pull/14715>`_ >>> proposed adding optional ``like=`` arguments to functions without >>> existing array arguments. However, we still lack any mechanism to >>> override methods on objects, such as those needed by >>> ``np.random.RandomState``. >>> - **Array conversion** can't reuse the existing coercion functions >>> like >>> ``np.asarray``, because ``np.asarray`` sometimes means "convert to >>> an >>> exact ``np.ndarray``" and other times means "convert to something >>> _like_ >>> a NumPy array." This led to the `NEP 30 >>> `_ >>> proposal for >>> a separate ``np.duckarray`` function, but this still does not >>> resolve how >>> to cast one duck array into a type matching another duck array. >>> >>> ``get_array_module`` and the ``__array_module__`` protocol >>> ---------------------------------------------------------- >>> >>> We propose a new user-facing mechanism for dispatching to a duck-array >>> implementation, ``numpy.get_array_module``. ``get_array_module`` >>> performs the >>> same type resolution as ``__array_function__`` and returns a module with >>> an API >>> promised to match the standard interface of ``numpy`` that can implement >>> operations on all provided array types. >>> >>> The protocol itself is both simpler and more powerful than >>> ``__array_function__``, because it doesn't need to worry about actually >>> implementing functions. We believe it resolves most of the >>> maintainability and >>> functionality limitations of ``__array_function__``. >>> >>> The new protocol is opt-in, explicit and with local control; see >>> :ref:`appendix-design-choices` for discussion on the importance of these >>> design >>> features. >>> >>> The array module contract >>> ========================= >>> >>> Modules returned by ``get_array_module``/``__array_module__`` should >>> make a >>> best effort to implement NumPy's core functionality on new array >>> types(s). >>> Unimplemented functionality should simply be omitted (e.g., accessing an >>> unimplemented function should raise ``AttributeError``). In the future, >>> we >>> anticipate codifying a protocol for requesting restricted subsets of >>> ``numpy``; >>> see :ref:`requesting-restricted-subsets` for more details. >>> >>> How to use ``get_array_module`` >>> =============================== >>> >>> Code that wants to support generic duck arrays should explicitly call >>> ``get_array_module`` to determine an appropriate array module from which >>> to >>> call functions, rather than using the ``numpy`` namespace directly. For >>> example: >>> >>> .. code:: python >>> >>> # calls the appropriate version of np.something for x and y >>> module = np.get_array_module(x, y) >>> module.something(x, y) >>> >>> Both array creation and array conversion are supported, because >>> dispatching is >>> handled by ``get_array_module`` rather than via the types of function >>> arguments. For example, to use random number generation functions or >>> methods, >>> we can simply pull out the appropriate submodule: >>> >>> .. code:: python >>> >>> def duckarray_add_random(array): >>> module = np.get_array_module(array) >>> noise = module.random.randn(*array.shape) >>> return array + noise >>> >>> We can also write the duck-array ``stack`` function from `NEP 30 >>> `_, without >>> the need >>> for a new ``np.duckarray`` function: >>> >>> .. code:: python >>> >>> def duckarray_stack(arrays): >>> module = np.get_array_module(*arrays) >>> arrays = [module.asarray(arr) for arr in arrays] >>> shapes = {arr.shape for arr in arrays} >>> if len(shapes) != 1: >>> raise ValueError('all input arrays must have the same shape') >>> expanded_arrays = [arr[module.newaxis, ...] for arr in arrays] >>> return module.concatenate(expanded_arrays, axis=0) >>> >>> By default, ``get_array_module`` will return the ``numpy`` module if no >>> arguments are arrays. This fall-back can be explicitly controlled by >>> providing >>> the ``module`` keyword-only argument. It is also possible to indicate >>> that an >>> exception should be raised instead of returning a default array module by >>> setting ``module=None``. >>> >>> How to implement ``__array_module__`` >>> ===================================== >>> >>> Libraries implementing a duck array type that want to support >>> ``get_array_module`` need to implement the corresponding protocol, >>> ``__array_module__``. This new protocol is based on Python's dispatch >>> protocol >>> for arithmetic, and is essentially a simpler version of >>> ``__array_function__``. >>> >>> Only one argument is passed into ``__array_module__``, a Python >>> collection of >>> unique array types passed into ``get_array_module``, i.e., all arguments >>> with >>> an ``__array_module__`` attribute. >>> >>> The special method should either return an namespace with an API matching >>> ``numpy``, or ``NotImplemented``, indicating that it does not know how to >>> handle the operation: >>> >>> .. code:: python >>> >>> class MyArray: >>> def __array_module__(self, types): >>> if not all(issubclass(t, MyArray) for t in types): >>> return NotImplemented >>> return my_array_module >>> >>> Returning custom objects from ``__array_module__`` >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> >>> ``my_array_module`` will typically, but need not always, be a Python >>> module. >>> Returning a custom objects (e.g., with functions implemented via >>> ``__getattr__``) may be useful for some advanced use cases. >>> >>> For example, custom objects could allow for partial implementations of >>> duck >>> array modules that fall-back to NumPy (although this is not recommended >>> in >>> general because such fall-back behavior can be error prone): >>> >>> .. code:: python >>> >>> class MyArray: >>> def __array_module__(self, types): >>> if all(issubclass(t, MyArray) for t in types): >>> return ArrayModule() >>> else: >>> return NotImplemented >>> >>> class ArrayModule: >>> def __getattr__(self, name): >>> import base_module >>> return getattr(base_module, name, getattr(numpy, name)) >>> >>> Subclassing from ``numpy.ndarray`` >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> >>> All of the same guidance about well-defined type casting hierarchies from >>> NEP-18 still applies. ``numpy.ndarray`` itself contains a matching >>> implementation of ``__array_module__``, which is convenient for >>> subclasses: >>> >>> .. code:: python >>> >>> class ndarray: >>> def __array_module__(self, types): >>> if all(issubclass(t, ndarray) for t in types): >>> return numpy >>> else: >>> return NotImplemented >>> >>> NumPy's internal machinery >>> ========================== >>> >>> The type resolution rules of ``get_array_module`` follow the same model >>> as >>> Python and NumPy's existing dispatch protocols: subclasses are called >>> before >>> super-classes, and otherwise left to right. ``__array_module__`` is >>> guaranteed >>> to be called only a single time on each unique type. >>> >>> The actual implementation of `get_array_module` will be in C, but should >>> be >>> equivalent to this Python code: >>> >>> .. code:: python >>> >>> def get_array_module(*arrays, default=numpy): >>> implementing_arrays, types = >>> _implementing_arrays_and_types(arrays) >>> if not implementing_arrays and default is not None: >>> return default >>> for array in implementing_arrays: >>> module = array.__array_module__(types) >>> if module is not NotImplemented: >>> return module >>> raise TypeError("no common array module found") >>> >>> def _implementing_arrays_and_types(relevant_arrays): >>> types = [] >>> implementing_arrays = [] >>> for array in relevant_arrays: >>> t = type(array) >>> if t not in types and hasattr(t, '__array_module__'): >>> types.append(t) >>> # Subclasses before superclasses, otherwise left to right >>> index = len(implementing_arrays) >>> for i, old_array in enumerate(implementing_arrays): >>> if issubclass(t, type(old_array)): >>> index = i >>> break >>> implementing_arrays.insert(index, array) >>> return implementing_arrays, types >>> >>> Relationship with ``__array_ufunc__`` and ``__array_function__`` >>> ---------------------------------------------------------------- >>> >>> These older protocols have distinct use-cases and should remain >>> =============================================================== >>> >>> ``__array_module__`` is intended to resolve limitations of >>> ``__array_function__``, so it is natural to consider whether it could >>> entirely >>> replace ``__array_function__``. This would offer dual benefits: (1) >>> simplifying >>> the user-story about how to override NumPy and (2) removing the slowdown >>> associated with checking for dispatch when calling every NumPy function. >>> >>> However, ``__array_module__`` and ``__array_function__`` are pretty >>> different >>> from a user perspective: it requires explicit calls to >>> ``get_array_function``, >>> rather than simply reusing original ``numpy`` functions. This is >>> probably fine >>> for *libraries* that rely on duck-arrays, but may be frustratingly >>> verbose for >>> interactive use. >>> >>> Some of the dispatching use-cases for ``__array_ufunc__`` are also >>> solved by >>> ``__array_module__``, but not all of them. For example, it is still >>> useful to >>> be able to define non-NumPy ufuncs (e.g., from Numba or SciPy) in a >>> generic way >>> on non-NumPy arrays (e.g., with dask.array). >>> >>> Given their existing adoption and distinct use cases, we don't think it >>> makes >>> sense to remove or deprecate ``__array_function__`` and >>> ``__array_ufunc__`` at >>> this time. >>> >>> Mixin classes to implement ``__array_function__`` and ``__array_ufunc__`` >>> ========================================================================= >>> >>> Despite the user-facing differences, ``__array_module__`` and a module >>> implementing NumPy's API still contain sufficient functionality needed to >>> implement dispatching with the existing duck array protocols. >>> >>> For example, the following mixin classes would provide sensible defaults >>> for >>> these special methods in terms of ``get_array_module`` and >>> ``__array_module__``: >>> >>> .. code:: python >>> >>> class ArrayUfuncFromModuleMixin: >>> >>> def __array_ufunc__(self, ufunc, method, *inputs, **kwargs): >>> arrays = inputs + kwargs.get('out', ()) >>> try: >>> array_module = np.get_array_module(*arrays) >>> except TypeError: >>> return NotImplemented >>> >>> try: >>> # Note this may have false positive matches, if >>> ufunc.__name__ >>> # matches the name of a ufunc defined by NumPy. >>> Unfortunately >>> # there is no way to determine in which module a ufunc >>> was >>> # defined. >>> new_ufunc = getattr(array_module, ufunc.__name__) >>> except AttributeError: >>> return NotImplemented >>> >>> try: >>> callable = getattr(new_ufunc, method) >>> except AttributeError: >>> return NotImplemented >>> >>> return callable(*inputs, **kwargs) >>> >>> class ArrayFunctionFromModuleMixin: >>> >>> def __array_function__(self, func, types, args, kwargs): >>> array_module = self.__array_module__(types) >>> if array_module is NotImplemented: >>> return NotImplemented >>> >>> # Traverse submodules to find the appropriate function >>> modules = func.__module__.split('.') >>> assert modules[0] == 'numpy' >>> for submodule in modules[1:]: >>> module = getattr(module, submodule, None) >>> new_func = getattr(module, func.__name__, None) >>> if new_func is None: >>> return NotImplemented >>> >>> return new_func(*args, **kwargs) >>> >>> To make it easier to write duck arrays, we could also add these mixin >>> classes >>> into ``numpy.lib.mixins`` (but the examples above may suffice). >>> >>> Alternatives considered >>> ----------------------- >>> >>> Naming >>> ====== >>> >>> We like the name ``__array_module__`` because it mirrors the existing >>> ``__array_function__`` and ``__array_ufunc__`` protocols. Another >>> reasonable >>> choice could be ``__array_namespace__``. >>> >>> It is less clear what the NumPy function that calls this protocol should >>> be >>> called (``get_array_module`` in this proposal). Some possible >>> alternatives: >>> ``array_module``, ``common_array_module``, ``resolve_array_module``, >>> ``get_namespace``, ``get_numpy``, ``get_numpylike_module``, >>> ``get_duck_array_module``. >>> >>> .. _requesting-restricted-subsets: >>> >>> Requesting restricted subsets of NumPy's API >>> ============================================ >>> >>> Over time, NumPy has accumulated a very large API surface, with over 600 >>> attributes in the top level ``numpy`` module alone. It is unlikely that >>> any >>> duck array library could or would want to implement all of these >>> functions and >>> classes, because the frequently used subset of NumPy is much smaller. >>> >>> We think it would be useful exercise to define "minimal" subset(s) of >>> NumPy's >>> API, omitting rarely used or non-recommended functionality. For example, >>> minimal NumPy might include ``stack``, but not the other stacking >>> functions >>> ``column_stack``, ``dstack``, ``hstack`` and ``vstack``. This could >>> clearly >>> indicate to duck array authors and users want functionality is core and >>> what >>> functionality they can skip. >>> >>> Support for requesting a restricted subset of NumPy's API would be a >>> natural >>> feature to include in ``get_array_function`` and ``__array_module__``, >>> e.g., >>> >>> .. code:: python >>> >>> # array_module is only guaranteed to contain "minimal" NumPy >>> array_module = np.get_array_module(*arrays, request='minimal') >>> >>> To facilitate testing with NumPy and use with any valid duck array >>> library, >>> NumPy itself would return restricted versions of the ``numpy`` module >>> when >>> ``get_array_module`` is called only on NumPy arrays. Omitted functions >>> would >>> simply not exist. >>> >>> Unfortunately, we have not yet figured out what these restricted subsets >>> should >>> be, so it doesn't make sense to do this yet. When/if we do, we could >>> either add >>> new keyword arguments to ``get_array_module`` or add new top level >>> functions, >>> e.g., ``get_minimal_array_module``. We would also need to add either a >>> new >>> protocol patterned off of ``__array_module__`` (e.g., >>> ``__array_module_minimal__``), or could add an optional second argument >>> to >>> ``__array_module__`` (catching errors with ``try``/``except``). >>> >>> A new namespace for implicit dispatch >>> ===================================== >>> >>> Instead of supporting overrides in the main `numpy` namespace with >>> ``__array_function__``, we could create a new opt-in namespace, e.g., >>> ``numpy.api``, with versions of NumPy functions that support >>> dispatching. These >>> overrides would need new opt-in protocols, e.g., >>> ``__array_function_api__`` >>> patterned off of ``__array_function__``. >>> >>> This would resolve the biggest limitations of ``__array_function__`` by >>> being >>> opt-in and would also allow for unambiguously overriding functions like >>> ``asarray``, because ``np.api.asarray`` would always mean "convert an >>> array-like object." But it wouldn't solve all the dispatching needs met >>> by >>> ``__array_module__``, and would leave us with supporting a considerably >>> more >>> complex protocol both for array users and implementors. >>> >>> We could potentially implement such a new namespace *via* the >>> ``__array_module__`` protocol. Certainly some users would find this >>> convenient, >>> because it is slightly less boilerplate. But this would leave users with >>> a >>> confusing choice: when should they use `get_array_module` vs. >>> `np.api.something`. Also, we would have to add and maintain a whole new >>> module, >>> which is considerably more expensive than merely adding a function. >>> >>> Dispatching on both types and arrays instead of only types >>> ========================================================== >>> >>> Instead of supporting dispatch only via unique array types, we could also >>> support dispatch via array objects, e.g., by passing an ``arrays`` >>> argument as >>> part of the ``__array_module__`` protocol. This could potentially be >>> useful for >>> dispatch for arrays with metadata, such provided by Dask and Pint, but >>> would >>> impose costs in terms of type safety and complexity. >>> >>> For example, a library that supports arrays on both CPUs and GPUs might >>> decide >>> on which device to create a new arrays from functions like ``ones`` >>> based on >>> input arguments: >>> >>> .. code:: python >>> >>> class Array: >>> def __array_module__(self, types, arrays): >>> useful_arrays = tuple(a in arrays if isinstance(a, Array)) >>> if not useful_arrays: >>> return NotImplemented >>> prefer_gpu = any(a.prefer_gpu for a in useful_arrays) >>> return ArrayModule(prefer_gpu) >>> >>> class ArrayModule: >>> def __init__(self, prefer_gpu): >>> self.prefer_gpu = prefer_gpu >>> >>> def __getattr__(self, name): >>> import base_module >>> base_func = getattr(base_module, name) >>> return functools.partial(base_func, >>> prefer_gpu=self.prefer_gpu) >>> >>> This might be useful, but it's not clear if we really need it. Pint >>> seems to >>> get along OK without any explicit array creation routines (favoring >>> multiplication by units, e.g., ``np.ones(5) * ureg.m``), and for the >>> most part >>> Dask is also OK with existing ``__array_function__`` style overides >>> (e.g., >>> favoring ``np.ones_like`` over ``np.ones``). Choosing whether to place >>> an array >>> on the CPU or GPU could be solved by `making array creation lazy >>> `_. >>> >>> .. _appendix-design-choices: >>> >>> Appendix: design choices for API overrides >>> ------------------------------------------ >>> >>> There is a large range of possible design choices for overriding NumPy's >>> API. >>> Here we discuss three major axes of the design decision that guided our >>> design >>> for ``__array_module__``. >>> >>> Opt-in vs. opt-out for users >>> ============================ >>> >>> The ``__array_ufunc__`` and ``__array_function__`` protocols provide a >>> mechanism for overriding NumPy functions *within NumPy's existing >>> namespace*. >>> This means that users need to explicitly opt-out if they do not want any >>> overridden behavior, e.g., by casting arrays with ``np.asarray()``. >>> >>> In theory, this approach lowers the barrier for adopting these protocols >>> in >>> user code and libraries, because code that uses the standard NumPy >>> namespace is >>> automatically compatible. But in practice, this hasn't worked out. For >>> example, >>> most well-maintained libraries that use NumPy follow the best practice of >>> casting all inputs with ``np.asarray()``, which they would have to >>> explicitly >>> relax to use ``__array_function__``. Our experience has been that making >>> a >>> library compatible with a new duck array type typically requires at >>> least a >>> small amount of work to accommodate differences in the data model and >>> operations >>> that can be implemented efficiently. >>> >>> These opt-out approaches also considerably complicate backwards >>> compatibility >>> for libraries that adopt these protocols, because by opting in as a >>> library >>> they also opt-in their users, whether they expect it or not. For winning >>> over >>> libraries that have been unable to adopt ``__array_function__``, an >>> opt-in >>> approach seems like a must. >>> >>> Explicit vs. implicit choice of implementation >>> ============================================== >>> >>> Both ``__array_ufunc__`` and ``__array_function__`` have implicit >>> control over >>> dispatching: the dispatched functions are determined via the appropriate >>> protocols in every function call. This generalizes well to handling many >>> different types of objects, as evidenced by its use for implementing >>> arithmetic >>> operators in Python, but it has two downsides: >>> >>> 1. *Speed*: it imposes additional overhead in every function call, >>> because each >>> function call needs to inspect each of its arguments for overrides. >>> This is >>> why arithmetic on builtin Python numbers is slow. >>> 2. *Readability*: it is not longer immediately evident to readers of >>> code what >>> happens when a function is called, because the function's >>> implementation >>> could be overridden by any of its arguments. >>> >>> In contrast, importing a new library (e.g., ``import dask.array as >>> da``) with >>> an API matching NumPy is entirely explicit. There is no overhead from >>> dispatch >>> or ambiguity about which implementation is being used. >>> >>> Explicit and implicit choice of implementations are not mutually >>> exclusive >>> options. Indeed, most implementations of NumPy API overrides via >>> ``__array_function__`` that we are familiar with (namely, dask, CuPy and >>> sparse, but not Pint) also include an explicit way to use their version >>> of >>> NumPy's API by importing a module directly (``dask.array``, ``cupy`` or >>> ``sparse``, respectively). >>> >>> Local vs. non-local vs. global control >>> ====================================== >>> >>> The final design axis is how users control the choice of API: >>> >>> - **Local control**, as exemplified by multiple dispatch and Python >>> protocols for >>> arithmetic, determines which implementation to use either by checking >>> types >>> or calling methods on the direct arguments of a function. >>> - **Non-local control** such as `np.errstate >>> < >>> https://docs.scipy.org/doc/numpy/reference/generated/numpy.errstate.html >>> >`_ >>> overrides behavior with global-state via function decorators or >>> context-managers. Control is determined hierarchically, via the >>> inner-most >>> context. >>> - **Global control** provides a mechanism for users to set default >>> behavior, >>> either via function calls or configuration files. For example, >>> matplotlib >>> allows setting a global choice of plotting backend. >>> >>> Local control is generally considered a best practice for API design, >>> because >>> control flow is entirely explicit, which makes it the easiest to >>> understand. >>> Non-local and global control are occasionally used, but generally either >>> due to >>> ignorance or a lack of better alternatives. >>> >>> In the case of duck typing for NumPy's public API, we think non-local or >>> global >>> control would be mistakes, mostly because they **don't compose well**. >>> If one >>> library sets/needs one set of overrides and then internally calls a >>> routine >>> that expects another set of overrides, the resulting behavior may be very >>> surprising. Higher order functions are especially problematic, because >>> the >>> context in which functions are evaluated may not be the context in which >>> they >>> are defined. >>> >>> One class of override use cases where we think non-local and global >>> control are >>> appropriate is for choosing a backend system that is guaranteed to have >>> an >>> entirely consistent interface, such as a faster alternative >>> implementation of >>> ``numpy.fft`` on NumPy arrays. However, these are out of scope for the >>> current >>> proposal, which is focused on duck arrays. >>> >>> _______________________________________________ >>> NumPy-Discussion mailing listNumPy-Discussion at python.orghttps://mail.python.org/mailman/listinfo/numpy-discussion >>> >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion at python.org >>> https://mail.python.org/mailman/listinfo/numpy-discussion >>> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ndbecker2 at gmail.com Thu Feb 6 08:00:59 2020 From: ndbecker2 at gmail.com (Neal Becker) Date: Thu, 6 Feb 2020 08:00:59 -0500 Subject: [Numpy-discussion] manylinux upgrade for numpy wheels In-Reply-To: References: Message-ID: Slightly off topic perhaps, it is recommended to perform custom compilation for best performance, yet is there an easy way to do this? I don't think a simple pip will do. On Wed, Feb 5, 2020 at 4:07 AM Matthew Brett wrote: > Hi, > > On Tue, Feb 4, 2020 at 10:38 PM Nathaniel Smith wrote: > > > > Pretty sure the 2010 and 2014 images both have much newer compilers than > that. > > > > There are still a lot of users on CentOS 6, so I'd still stick to 2010 > for now on x86_64 at least. We could potentially start adding 2014 wheels > for the other platforms where we currently don't ship wheels ? gotta be > better than nothing, right? > > > > There probably still is some tail of end users whose pip is too old to > know about 2010 wheels. I don't know how big that tail is. If we wanted to > be really careful, we could ship both manylinux1 and manylinux2010 wheels > for a bit ? pip will automatically pick the latest one it recognizes ? and > see what the download numbers look like. > > That all sounds right to me too. > > Cheers, > > Matthew > > > On Tue, Feb 4, 2020, 13:18 Charles R Harris > wrote: > >> > >> Hi All, > >> > >> Thought now would be a good time to decide on upgrading manylinux for > the 1.19 release so that we can make sure that everything works as > expected. The choices are > >> > >> manylinux1 -- CentOS 5, currently used, gcc 4.2 (in practice 4.5), only > supports i686, x86_64. > >> manylinux2010 -- CentOS 6, gcc 4.5, only supports i686, x86_64. > >> manylinux2014 -- CentOS 7, gcc 4.8, supports many more architectures. > >> > >> The main advantage of manylinux2014 is that it supports many new > architectures, some of which we are already testing against. The main > disadvantage is that it requires pip >= 19.x, which may not be much of a > problem 4 months from now but will undoubtedly cause some installation > problems. Unfortunately, the compiler remains archaic, but folks interested > in performance should be using a performance oriented distribution or > compiling for their native architecture. > >> > >> Chuck > >> > >> _______________________________________________ > >> NumPy-Discussion mailing list > >> NumPy-Discussion at python.org > >> https://mail.python.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -- *Those who don't understand recursion are doomed to repeat it* -------------- next part -------------- An HTML attachment was scrubbed... URL: From aldcroft at head.cfa.harvard.edu Thu Feb 6 08:24:16 2020 From: aldcroft at head.cfa.harvard.edu (Aldcroft, Thomas) Date: Thu, 6 Feb 2020 08:24:16 -0500 Subject: [Numpy-discussion] manylinux upgrade for numpy wheels In-Reply-To: References: Message-ID: Our organization is still using CentOS-6, so my vote is for that. Thanks, Tom On Tue, Feb 4, 2020 at 5:38 PM Nathaniel Smith wrote: > Pretty sure the 2010 and 2014 images both have much newer compilers than > that. > > There are still a lot of users on CentOS 6, so I'd still stick to 2010 for > now on x86_64 at least. We could potentially start adding 2014 wheels for > the other platforms where we currently don't ship wheels ? gotta be better > than nothing, right? > > There probably still is some tail of end users whose pip is too old to > know about 2010 wheels. I don't know how big that tail is. If we wanted to > be really careful, we could ship both manylinux1 and manylinux2010 wheels > for a bit ? pip will automatically pick the latest one it recognizes ? and > see what the download numbers look like. > > On Tue, Feb 4, 2020, 13:18 Charles R Harris > wrote: > >> Hi All, >> >> Thought now would be a good time to decide on upgrading manylinux for the >> 1.19 release so that we can make sure that everything works as expected. >> The choices are >> >> manylinux1 -- CentOS 5, >> currently used, gcc 4.2 (in practice 4.5), only supports i686, x86_64. >> manylinux2010 -- CentOS 6, >> gcc 4.5, only supports i686, x86_64. >> manylinux2014 -- CentOS 7, >> gcc 4.8, supports many more architectures. >> >> The main advantage of manylinux2014 is that it supports many new >> architectures, some of which we are already testing against. The main >> disadvantage is that it requires pip >= 19.x, which may not be much of a >> problem 4 months from now but will undoubtedly cause some installation >> problems. Unfortunately, the compiler remains archaic, but folks interested >> in performance should be using a performance oriented distribution or >> compiling for their native architecture. >> >> Chuck >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Thu Feb 6 12:35:31 2020 From: shoyer at gmail.com (Stephan Hoyer) Date: Thu, 6 Feb 2020 09:35:31 -0800 Subject: [Numpy-discussion] NEP 37: A dispatch protocol for NumPy-like modules In-Reply-To: <21692339-9f4b-029c-d422-ea549acbe6c3@gmail.com> References: <21692339-9f4b-029c-d422-ea549acbe6c3@gmail.com> Message-ID: On Wed, Feb 5, 2020 at 8:02 AM Andreas Mueller wrote: > A bit late to the NEP 37 party. > I just wanted to say that at least from my perspective it seems a great > solution that will help sklearn move towards more flexible compute engines. > I think one of the biggest issues is array creation (including random > arrays), and that's handled quite nicely with NEP 37. > Andreas, thanks for sharing your feedback here! Your perspective is really appreciated. > - We use scipy.linalg in many places, and we would need to do a separate > dispatching to check whether we can use module.linalg instead > (that might be an issue for many libraries but I'm not sure). > This brings up a good question -- obviously the final decision here is up to SciPy maintainers, but how should we encourage SciPy to support dispatching? We could pretty easily make __array_function__ cover SciPy by simply exposing NumPy's internal utilities. SciPy could simply use the np.array_function_dispatch decorator internally and that would be enough. It is less clear how this could work for __array_module__, because __array_module__ and get_array_module() are not generic -- they refers explicitly to a NumPy like module. If we want to extend it to SciPy (for which I agree there are good use-cases), what should that look like? The obvious choices would be to either add a new protocol, e.g., __scipy_module__ (but then NumPy needs to know about SciPy), or to add some sort of "module request" parameter to np.get_array_module(), to indicate the requested API, e.g., np.get_array_module(*arrays, matching='scipy'). This is pretty similar to the "default" argument but would need to get passed into the __array_module__ protocol, too. > - Some models have several possible optimization algorithms, some of which > are pure numpy and some which are Cython. If someone provides a different > array module, > we might want to choose an algorithm that is actually supported by that > module. While this exact issue is maybe sklearn specific, a similar issue > could appear for most downstream libs that use Cython in some places. > Many Cython algorithms could be implemented in pure numpy with a > potential slowdown, but once we have NEP 37 there might be a benefit to > having a pure NumPy implementation as an alternative code path. > > > Anyway, NEP 37 seems a great step in the right direction and would enable > sklearn to actually dispatch in some places. Dispatching just based on > __array_function__ seems not really feasible so far. > > Best, > Andreas Mueller > > > On 1/6/20 11:29 PM, Stephan Hoyer wrote: > > I am pleased to present a new NumPy Enhancement Proposal for discussion: > "NEP-37: A dispatch protocol for NumPy-like modules." Feedback would be > very welcome! > > The full text follows. The rendered proposal can also be found online at > https://numpy.org/neps/nep-0037-array-module.html > > Best, > Stephan Hoyer > > =================================================== > NEP 37 ? A dispatch protocol for NumPy-like modules > =================================================== > > :Author: Stephan Hoyer > :Author: Hameer Abbasi > :Author: Sebastian Berg > :Status: Draft > :Type: Standards Track > :Created: 2019-12-29 > > Abstract > -------- > > NEP-18's ``__array_function__`` has been a mixed success. Some projects > (e.g., > dask, CuPy, xarray, sparse, Pint) have enthusiastically adopted it. Others > (e.g., PyTorch, JAX, SciPy) have been more reluctant. Here we propose a new > protocol, ``__array_module__``, that we expect could eventually subsume > most > use-cases for ``__array_function__``. The protocol requires explicit > adoption > by both users and library authors, which ensures backwards compatibility, > and > is also significantly simpler than ``__array_function__``, both of which we > expect will make it easier to adopt. > > Why ``__array_function__`` hasn't been enough > --------------------------------------------- > > There are two broad ways in which NEP-18 has fallen short of its goals: > > 1. **Maintainability concerns**. `__array_function__` has significant > implications for libraries that use it: > > - Projects like `PyTorch > `_, `JAX > `_ and even `scipy.sparse > `_ have been reluctant > to > implement `__array_function__` in part because they are concerned > about > **breaking existing code**: users expect NumPy functions like > ``np.concatenate`` to return NumPy arrays. This is a fundamental > limitation of the ``__array_function__`` design, which we chose to > allow > overriding the existing ``numpy`` namespace. > - ``__array_function__`` currently requires an "all or nothing" > approach to > implementing NumPy's API. There is no good pathway for **incremental > adoption**, which is particularly problematic for established projects > for which adopting ``__array_function__`` would result in breaking > changes. > - It is no longer possible to use **aliases to NumPy functions** within > modules that support overrides. For example, both CuPy and JAX set > ``result_type = np.result_type``. > - Implementing **fall-back mechanisms** for unimplemented NumPy > functions > by using NumPy's implementation is hard to get right (but see the > `version from dask `_), > because > ``__array_function__`` does not present a consistent interface. > Converting all arguments of array type requires recursing into generic > arguments of the form ``*args, **kwargs``. > > 2. **Limitations on what can be overridden.** ``__array_function__`` has > some > important gaps, most notably array creation and coercion functions: > > - **Array creation** routines (e.g., ``np.arange`` and those in > ``np.random``) need some other mechanism for indicating what type of > arrays to create. `NEP 36 >`_ > proposed adding optional ``like=`` arguments to functions without > existing array arguments. However, we still lack any mechanism to > override methods on objects, such as those needed by > ``np.random.RandomState``. > - **Array conversion** can't reuse the existing coercion functions like > ``np.asarray``, because ``np.asarray`` sometimes means "convert to an > exact ``np.ndarray``" and other times means "convert to something > _like_ > a NumPy array." This led to the `NEP 30 > `_ > proposal for > a separate ``np.duckarray`` function, but this still does not resolve > how > to cast one duck array into a type matching another duck array. > > ``get_array_module`` and the ``__array_module__`` protocol > ---------------------------------------------------------- > > We propose a new user-facing mechanism for dispatching to a duck-array > implementation, ``numpy.get_array_module``. ``get_array_module`` performs > the > same type resolution as ``__array_function__`` and returns a module with > an API > promised to match the standard interface of ``numpy`` that can implement > operations on all provided array types. > > The protocol itself is both simpler and more powerful than > ``__array_function__``, because it doesn't need to worry about actually > implementing functions. We believe it resolves most of the maintainability > and > functionality limitations of ``__array_function__``. > > The new protocol is opt-in, explicit and with local control; see > :ref:`appendix-design-choices` for discussion on the importance of these > design > features. > > The array module contract > ========================= > > Modules returned by ``get_array_module``/``__array_module__`` should make a > best effort to implement NumPy's core functionality on new array types(s). > Unimplemented functionality should simply be omitted (e.g., accessing an > unimplemented function should raise ``AttributeError``). In the future, we > anticipate codifying a protocol for requesting restricted subsets of > ``numpy``; > see :ref:`requesting-restricted-subsets` for more details. > > How to use ``get_array_module`` > =============================== > > Code that wants to support generic duck arrays should explicitly call > ``get_array_module`` to determine an appropriate array module from which to > call functions, rather than using the ``numpy`` namespace directly. For > example: > > .. code:: python > > # calls the appropriate version of np.something for x and y > module = np.get_array_module(x, y) > module.something(x, y) > > Both array creation and array conversion are supported, because > dispatching is > handled by ``get_array_module`` rather than via the types of function > arguments. For example, to use random number generation functions or > methods, > we can simply pull out the appropriate submodule: > > .. code:: python > > def duckarray_add_random(array): > module = np.get_array_module(array) > noise = module.random.randn(*array.shape) > return array + noise > > We can also write the duck-array ``stack`` function from `NEP 30 > `_, without the > need > for a new ``np.duckarray`` function: > > .. code:: python > > def duckarray_stack(arrays): > module = np.get_array_module(*arrays) > arrays = [module.asarray(arr) for arr in arrays] > shapes = {arr.shape for arr in arrays} > if len(shapes) != 1: > raise ValueError('all input arrays must have the same shape') > expanded_arrays = [arr[module.newaxis, ...] for arr in arrays] > return module.concatenate(expanded_arrays, axis=0) > > By default, ``get_array_module`` will return the ``numpy`` module if no > arguments are arrays. This fall-back can be explicitly controlled by > providing > the ``module`` keyword-only argument. It is also possible to indicate that > an > exception should be raised instead of returning a default array module by > setting ``module=None``. > > How to implement ``__array_module__`` > ===================================== > > Libraries implementing a duck array type that want to support > ``get_array_module`` need to implement the corresponding protocol, > ``__array_module__``. This new protocol is based on Python's dispatch > protocol > for arithmetic, and is essentially a simpler version of > ``__array_function__``. > > Only one argument is passed into ``__array_module__``, a Python collection > of > unique array types passed into ``get_array_module``, i.e., all arguments > with > an ``__array_module__`` attribute. > > The special method should either return an namespace with an API matching > ``numpy``, or ``NotImplemented``, indicating that it does not know how to > handle the operation: > > .. code:: python > > class MyArray: > def __array_module__(self, types): > if not all(issubclass(t, MyArray) for t in types): > return NotImplemented > return my_array_module > > Returning custom objects from ``__array_module__`` > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > ``my_array_module`` will typically, but need not always, be a Python > module. > Returning a custom objects (e.g., with functions implemented via > ``__getattr__``) may be useful for some advanced use cases. > > For example, custom objects could allow for partial implementations of duck > array modules that fall-back to NumPy (although this is not recommended in > general because such fall-back behavior can be error prone): > > .. code:: python > > class MyArray: > def __array_module__(self, types): > if all(issubclass(t, MyArray) for t in types): > return ArrayModule() > else: > return NotImplemented > > class ArrayModule: > def __getattr__(self, name): > import base_module > return getattr(base_module, name, getattr(numpy, name)) > > Subclassing from ``numpy.ndarray`` > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > All of the same guidance about well-defined type casting hierarchies from > NEP-18 still applies. ``numpy.ndarray`` itself contains a matching > implementation of ``__array_module__``, which is convenient for > subclasses: > > .. code:: python > > class ndarray: > def __array_module__(self, types): > if all(issubclass(t, ndarray) for t in types): > return numpy > else: > return NotImplemented > > NumPy's internal machinery > ========================== > > The type resolution rules of ``get_array_module`` follow the same model as > Python and NumPy's existing dispatch protocols: subclasses are called > before > super-classes, and otherwise left to right. ``__array_module__`` is > guaranteed > to be called only a single time on each unique type. > > The actual implementation of `get_array_module` will be in C, but should be > equivalent to this Python code: > > .. code:: python > > def get_array_module(*arrays, default=numpy): > implementing_arrays, types = _implementing_arrays_and_types(arrays) > if not implementing_arrays and default is not None: > return default > for array in implementing_arrays: > module = array.__array_module__(types) > if module is not NotImplemented: > return module > raise TypeError("no common array module found") > > def _implementing_arrays_and_types(relevant_arrays): > types = [] > implementing_arrays = [] > for array in relevant_arrays: > t = type(array) > if t not in types and hasattr(t, '__array_module__'): > types.append(t) > # Subclasses before superclasses, otherwise left to right > index = len(implementing_arrays) > for i, old_array in enumerate(implementing_arrays): > if issubclass(t, type(old_array)): > index = i > break > implementing_arrays.insert(index, array) > return implementing_arrays, types > > Relationship with ``__array_ufunc__`` and ``__array_function__`` > ---------------------------------------------------------------- > > These older protocols have distinct use-cases and should remain > =============================================================== > > ``__array_module__`` is intended to resolve limitations of > ``__array_function__``, so it is natural to consider whether it could > entirely > replace ``__array_function__``. This would offer dual benefits: (1) > simplifying > the user-story about how to override NumPy and (2) removing the slowdown > associated with checking for dispatch when calling every NumPy function. > > However, ``__array_module__`` and ``__array_function__`` are pretty > different > from a user perspective: it requires explicit calls to > ``get_array_function``, > rather than simply reusing original ``numpy`` functions. This is probably > fine > for *libraries* that rely on duck-arrays, but may be frustratingly verbose > for > interactive use. > > Some of the dispatching use-cases for ``__array_ufunc__`` are also solved > by > ``__array_module__``, but not all of them. For example, it is still useful > to > be able to define non-NumPy ufuncs (e.g., from Numba or SciPy) in a > generic way > on non-NumPy arrays (e.g., with dask.array). > > Given their existing adoption and distinct use cases, we don't think it > makes > sense to remove or deprecate ``__array_function__`` and > ``__array_ufunc__`` at > this time. > > Mixin classes to implement ``__array_function__`` and ``__array_ufunc__`` > ========================================================================= > > Despite the user-facing differences, ``__array_module__`` and a module > implementing NumPy's API still contain sufficient functionality needed to > implement dispatching with the existing duck array protocols. > > For example, the following mixin classes would provide sensible defaults > for > these special methods in terms of ``get_array_module`` and > ``__array_module__``: > > .. code:: python > > class ArrayUfuncFromModuleMixin: > > def __array_ufunc__(self, ufunc, method, *inputs, **kwargs): > arrays = inputs + kwargs.get('out', ()) > try: > array_module = np.get_array_module(*arrays) > except TypeError: > return NotImplemented > > try: > # Note this may have false positive matches, if > ufunc.__name__ > # matches the name of a ufunc defined by NumPy. > Unfortunately > # there is no way to determine in which module a ufunc was > # defined. > new_ufunc = getattr(array_module, ufunc.__name__) > except AttributeError: > return NotImplemented > > try: > callable = getattr(new_ufunc, method) > except AttributeError: > return NotImplemented > > return callable(*inputs, **kwargs) > > class ArrayFunctionFromModuleMixin: > > def __array_function__(self, func, types, args, kwargs): > array_module = self.__array_module__(types) > if array_module is NotImplemented: > return NotImplemented > > # Traverse submodules to find the appropriate function > modules = func.__module__.split('.') > assert modules[0] == 'numpy' > for submodule in modules[1:]: > module = getattr(module, submodule, None) > new_func = getattr(module, func.__name__, None) > if new_func is None: > return NotImplemented > > return new_func(*args, **kwargs) > > To make it easier to write duck arrays, we could also add these mixin > classes > into ``numpy.lib.mixins`` (but the examples above may suffice). > > Alternatives considered > ----------------------- > > Naming > ====== > > We like the name ``__array_module__`` because it mirrors the existing > ``__array_function__`` and ``__array_ufunc__`` protocols. Another > reasonable > choice could be ``__array_namespace__``. > > It is less clear what the NumPy function that calls this protocol should be > called (``get_array_module`` in this proposal). Some possible alternatives: > ``array_module``, ``common_array_module``, ``resolve_array_module``, > ``get_namespace``, ``get_numpy``, ``get_numpylike_module``, > ``get_duck_array_module``. > > .. _requesting-restricted-subsets: > > Requesting restricted subsets of NumPy's API > ============================================ > > Over time, NumPy has accumulated a very large API surface, with over 600 > attributes in the top level ``numpy`` module alone. It is unlikely that any > duck array library could or would want to implement all of these functions > and > classes, because the frequently used subset of NumPy is much smaller. > > We think it would be useful exercise to define "minimal" subset(s) of > NumPy's > API, omitting rarely used or non-recommended functionality. For example, > minimal NumPy might include ``stack``, but not the other stacking functions > ``column_stack``, ``dstack``, ``hstack`` and ``vstack``. This could clearly > indicate to duck array authors and users want functionality is core and > what > functionality they can skip. > > Support for requesting a restricted subset of NumPy's API would be a > natural > feature to include in ``get_array_function`` and ``__array_module__``, > e.g., > > .. code:: python > > # array_module is only guaranteed to contain "minimal" NumPy > array_module = np.get_array_module(*arrays, request='minimal') > > To facilitate testing with NumPy and use with any valid duck array library, > NumPy itself would return restricted versions of the ``numpy`` module when > ``get_array_module`` is called only on NumPy arrays. Omitted functions > would > simply not exist. > > Unfortunately, we have not yet figured out what these restricted subsets > should > be, so it doesn't make sense to do this yet. When/if we do, we could > either add > new keyword arguments to ``get_array_module`` or add new top level > functions, > e.g., ``get_minimal_array_module``. We would also need to add either a new > protocol patterned off of ``__array_module__`` (e.g., > ``__array_module_minimal__``), or could add an optional second argument to > ``__array_module__`` (catching errors with ``try``/``except``). > > A new namespace for implicit dispatch > ===================================== > > Instead of supporting overrides in the main `numpy` namespace with > ``__array_function__``, we could create a new opt-in namespace, e.g., > ``numpy.api``, with versions of NumPy functions that support dispatching. > These > overrides would need new opt-in protocols, e.g., ``__array_function_api__`` > patterned off of ``__array_function__``. > > This would resolve the biggest limitations of ``__array_function__`` by > being > opt-in and would also allow for unambiguously overriding functions like > ``asarray``, because ``np.api.asarray`` would always mean "convert an > array-like object." But it wouldn't solve all the dispatching needs met by > ``__array_module__``, and would leave us with supporting a considerably > more > complex protocol both for array users and implementors. > > We could potentially implement such a new namespace *via* the > ``__array_module__`` protocol. Certainly some users would find this > convenient, > because it is slightly less boilerplate. But this would leave users with a > confusing choice: when should they use `get_array_module` vs. > `np.api.something`. Also, we would have to add and maintain a whole new > module, > which is considerably more expensive than merely adding a function. > > Dispatching on both types and arrays instead of only types > ========================================================== > > Instead of supporting dispatch only via unique array types, we could also > support dispatch via array objects, e.g., by passing an ``arrays`` > argument as > part of the ``__array_module__`` protocol. This could potentially be > useful for > dispatch for arrays with metadata, such provided by Dask and Pint, but > would > impose costs in terms of type safety and complexity. > > For example, a library that supports arrays on both CPUs and GPUs might > decide > on which device to create a new arrays from functions like ``ones`` based > on > input arguments: > > .. code:: python > > class Array: > def __array_module__(self, types, arrays): > useful_arrays = tuple(a in arrays if isinstance(a, Array)) > if not useful_arrays: > return NotImplemented > prefer_gpu = any(a.prefer_gpu for a in useful_arrays) > return ArrayModule(prefer_gpu) > > class ArrayModule: > def __init__(self, prefer_gpu): > self.prefer_gpu = prefer_gpu > > def __getattr__(self, name): > import base_module > base_func = getattr(base_module, name) > return functools.partial(base_func, prefer_gpu=self.prefer_gpu) > > This might be useful, but it's not clear if we really need it. Pint seems > to > get along OK without any explicit array creation routines (favoring > multiplication by units, e.g., ``np.ones(5) * ureg.m``), and for the most > part > Dask is also OK with existing ``__array_function__`` style overides (e.g., > favoring ``np.ones_like`` over ``np.ones``). Choosing whether to place an > array > on the CPU or GPU could be solved by `making array creation lazy > `_. > > .. _appendix-design-choices: > > Appendix: design choices for API overrides > ------------------------------------------ > > There is a large range of possible design choices for overriding NumPy's > API. > Here we discuss three major axes of the design decision that guided our > design > for ``__array_module__``. > > Opt-in vs. opt-out for users > ============================ > > The ``__array_ufunc__`` and ``__array_function__`` protocols provide a > mechanism for overriding NumPy functions *within NumPy's existing > namespace*. > This means that users need to explicitly opt-out if they do not want any > overridden behavior, e.g., by casting arrays with ``np.asarray()``. > > In theory, this approach lowers the barrier for adopting these protocols in > user code and libraries, because code that uses the standard NumPy > namespace is > automatically compatible. But in practice, this hasn't worked out. For > example, > most well-maintained libraries that use NumPy follow the best practice of > casting all inputs with ``np.asarray()``, which they would have to > explicitly > relax to use ``__array_function__``. Our experience has been that making a > library compatible with a new duck array type typically requires at least a > small amount of work to accommodate differences in the data model and > operations > that can be implemented efficiently. > > These opt-out approaches also considerably complicate backwards > compatibility > for libraries that adopt these protocols, because by opting in as a library > they also opt-in their users, whether they expect it or not. For winning > over > libraries that have been unable to adopt ``__array_function__``, an opt-in > approach seems like a must. > > Explicit vs. implicit choice of implementation > ============================================== > > Both ``__array_ufunc__`` and ``__array_function__`` have implicit control > over > dispatching: the dispatched functions are determined via the appropriate > protocols in every function call. This generalizes well to handling many > different types of objects, as evidenced by its use for implementing > arithmetic > operators in Python, but it has two downsides: > > 1. *Speed*: it imposes additional overhead in every function call, because > each > function call needs to inspect each of its arguments for overrides. > This is > why arithmetic on builtin Python numbers is slow. > 2. *Readability*: it is not longer immediately evident to readers of code > what > happens when a function is called, because the function's implementation > could be overridden by any of its arguments. > > In contrast, importing a new library (e.g., ``import dask.array as da``) > with > an API matching NumPy is entirely explicit. There is no overhead from > dispatch > or ambiguity about which implementation is being used. > > Explicit and implicit choice of implementations are not mutually exclusive > options. Indeed, most implementations of NumPy API overrides via > ``__array_function__`` that we are familiar with (namely, dask, CuPy and > sparse, but not Pint) also include an explicit way to use their version of > NumPy's API by importing a module directly (``dask.array``, ``cupy`` or > ``sparse``, respectively). > > Local vs. non-local vs. global control > ====================================== > > The final design axis is how users control the choice of API: > > - **Local control**, as exemplified by multiple dispatch and Python > protocols for > arithmetic, determines which implementation to use either by checking > types > or calling methods on the direct arguments of a function. > - **Non-local control** such as `np.errstate > < > https://docs.scipy.org/doc/numpy/reference/generated/numpy.errstate.html > >`_ > overrides behavior with global-state via function decorators or > context-managers. Control is determined hierarchically, via the > inner-most > context. > - **Global control** provides a mechanism for users to set default > behavior, > either via function calls or configuration files. For example, matplotlib > allows setting a global choice of plotting backend. > > Local control is generally considered a best practice for API design, > because > control flow is entirely explicit, which makes it the easiest to > understand. > Non-local and global control are occasionally used, but generally either > due to > ignorance or a lack of better alternatives. > > In the case of duck typing for NumPy's public API, we think non-local or > global > control would be mistakes, mostly because they **don't compose well**. If > one > library sets/needs one set of overrides and then internally calls a routine > that expects another set of overrides, the resulting behavior may be very > surprising. Higher order functions are especially problematic, because the > context in which functions are evaluated may not be the context in which > they > are defined. > > One class of override use cases where we think non-local and global > control are > appropriate is for choosing a backend system that is guaranteed to have an > entirely consistent interface, such as a faster alternative implementation > of > ``numpy.fft`` on NumPy arrays. However, these are out of scope for the > current > proposal, which is focused on duck arrays. > > _______________________________________________ > NumPy-Discussion mailing listNumPy-Discussion at python.orghttps://mail.python.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Thu Feb 6 15:20:15 2020 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Thu, 06 Feb 2020 12:20:15 -0800 Subject: [Numpy-discussion] NEP 37: A dispatch protocol for NumPy-like modules In-Reply-To: References: <21692339-9f4b-029c-d422-ea549acbe6c3@gmail.com> Message-ID: <1cfce715d48b847e91739c2a56b9750f15b1958f.camel@sipsolutions.net> On Thu, 2020-02-06 at 09:35 -0800, Stephan Hoyer wrote: > On Wed, Feb 5, 2020 at 8:02 AM Andreas Mueller > wrote: > > > - We use scipy.linalg in many places, and we would need to do a > > separate dispatching to check whether we can use module.linalg > > instead > > (that might be an issue for many libraries but I'm not sure). > > > > This brings up a good question -- obviously the final decision here > is up to SciPy maintainers, but how should we encourage SciPy to > support dispatching? > We could pretty easily make __array_function__ cover SciPy by simply > exposing NumPy's internal utilities. SciPy could simply use the > np.array_function_dispatch decorator internally and that would be > enough. Hmmm, in NumPy we can easily force basically 100% of (desired) coverage, i.e. JAX can return a namespace that implements everything. With SciPy that is already muss less feasible, and as you go to domain specific tools it seems implausible. `get_array_module` solves the issue of a library that wants to support all array likes. As long as: * most functions rely only on the NumPy API * the domain specific library is expected to implement support for specific array objects if necessary. E.g. sklearn can include special code for Dask support. Dask does not replace sklearn code. > It is less clear how this could work for __array_module__, because > __array_module__ and get_array_module() are not generic -- they > refers explicitly to a NumPy like module. If we want to extend it to > SciPy (for which I agree there are good use-cases), what should that > look __array_module__` I suppose the question is here, where should the code reside? For SciPy, I agree there is a good reason why you may want to "reverse" the implementation. The code to support JAX arrays, should live inside JAX. One, probably silly, option is to return a "global" namespace, so that: np = get_array_module(*arrays).numpy` We have to distinct issues: Where should e.g. SciPy put a generic implementation (assuming they to provide implementations that only require NumPy-API support to not require overriding)? And, also if a library provides generic support, should we define a standard of how the context/namespace may be passed in/provided? sklearn's main namespace is expected to support many array objects/types, but it could be nice to pass in an already known context/namespace (say scikit-image already found it, and then calls scikit-learn internally). A "generic" namespace may even require this to infer the correct output array object. Another thing about backward compatibility: What is our vision there actually? This NEP will *not* give the *end user* the option to opt-in! Here, opt-in is really reserved to the *library user* (e.g. sklearn). (I did not realize this clearly before) Thinking about that for a bit now, that seems like the right choice. But it also means that the library requires an easy way of giving a FutureWarning, to notify the end-user of the upcoming change. The end- user will easily be able to convert to a NumPy array to keep the old behaviour. Once this warning is given (maybe during `get_array_module()`, the array module object/context would preferably be passed around, hopefully even between libraries. That provides a reasonable way to opt-in to the new behaviour without a warning (mainly for library users, end-users can silence the warning if they wish so). - Sebastian > The obvious choices would be to either add a new protocol, e.g., > __scipy_module__ (but then NumPy needs to know about SciPy), or to > add some sort of "module request" parameter to np.get_array_module(), > to indicate the requested API, e.g., np.get_array_module(*arrays, > matching='scipy'). This is pretty similar to the "default" argument > but would need to get passed into the __array_module__ protocol, too. > > > - Some models have several possible optimization algorithms, some > > of which are pure numpy and some which are Cython. If someone > > provides a different array module, > > we might want to choose an algorithm that is actually supported by > > that module. While this exact issue is maybe sklearn specific, a > > similar issue could appear for most downstream libs that use Cython > > in some places. > > Many Cython algorithms could be implemented in pure numpy with a > > potential slowdown, but once we have NEP 37 there might be a > > benefit to having a pure NumPy implementation as an alternative > > code path. > > > > > > Anyway, NEP 37 seems a great step in the right direction and would > > enable sklearn to actually dispatch in some places. Dispatching > > just based on __array_function__ seems not really feasible so far. > > > > Best, > > Andreas Mueller > > > > > > On 1/6/20 11:29 PM, Stephan Hoyer wrote: > > > I am pleased to present a new NumPy Enhancement Proposal for > > > discussion: "NEP-37: A dispatch protocol for NumPy-like modules." > > > Feedback would be very welcome! > > > > > > The full text follows. The rendered proposal can also be found > > > online at https://numpy.org/neps/nep-0037-array-module.html > > > > > > Best, > > > Stephan Hoyer > > > > > > =================================================== > > > NEP 37 ? A dispatch protocol for NumPy-like modules > > > =================================================== > > > > > > :Author: Stephan Hoyer > > > :Author: Hameer Abbasi > > > :Author: Sebastian Berg > > > :Status: Draft > > > :Type: Standards Track > > > :Created: 2019-12-29 > > > > > > Abstract > > > -------- > > > > > > NEP-18's ``__array_function__`` has been a mixed success. Some > > > projects (e.g., > > > dask, CuPy, xarray, sparse, Pint) have enthusiastically adopted > > > it. Others > > > (e.g., PyTorch, JAX, SciPy) have been more reluctant. Here we > > > propose a new > > > protocol, ``__array_module__``, that we expect could eventually > > > subsume most > > > use-cases for ``__array_function__``. The protocol requires > > > explicit adoption > > > by both users and library authors, which ensures backwards > > > compatibility, and > > > is also significantly simpler than ``__array_function__``, both > > > of which we > > > expect will make it easier to adopt. > > > > > > Why ``__array_function__`` hasn't been enough > > > --------------------------------------------- > > > > > > There are two broad ways in which NEP-18 has fallen short of its > > > goals: > > > > > > 1. **Maintainability concerns**. `__array_function__` has > > > significant > > > implications for libraries that use it: > > > > > > - Projects like `PyTorch > > > `_, `JAX > > > `_ and even > > > `scipy.sparse > > > `_ have been > > > reluctant to > > > implement `__array_function__` in part because they are > > > concerned about > > > **breaking existing code**: users expect NumPy functions > > > like > > > ``np.concatenate`` to return NumPy arrays. This is a > > > fundamental > > > limitation of the ``__array_function__`` design, which we > > > chose to allow > > > overriding the existing ``numpy`` namespace. > > > - ``__array_function__`` currently requires an "all or > > > nothing" approach to > > > implementing NumPy's API. There is no good pathway for > > > **incremental > > > adoption**, which is particularly problematic for > > > established projects > > > for which adopting ``__array_function__`` would result in > > > breaking > > > changes. > > > - It is no longer possible to use **aliases to NumPy > > > functions** within > > > modules that support overrides. For example, both CuPy and > > > JAX set > > > ``result_type = np.result_type``. > > > - Implementing **fall-back mechanisms** for unimplemented > > > NumPy functions > > > by using NumPy's implementation is hard to get right (but > > > see the > > > `version from dask < > > > https://github.com/dask/dask/pull/5043>`_), because > > > ``__array_function__`` does not present a consistent > > > interface. > > > Converting all arguments of array type requires recursing > > > into generic > > > arguments of the form ``*args, **kwargs``. > > > > > > 2. **Limitations on what can be overridden.** > > > ``__array_function__`` has some > > > important gaps, most notably array creation and coercion > > > functions: > > > > > > - **Array creation** routines (e.g., ``np.arange`` and those > > > in > > > ``np.random``) need some other mechanism for indicating what > > > type of > > > arrays to create. `NEP 36 < > > > https://github.com/numpy/numpy/pull/14715>`_ > > > proposed adding optional ``like=`` arguments to functions > > > without > > > existing array arguments. However, we still lack any > > > mechanism to > > > override methods on objects, such as those needed by > > > ``np.random.RandomState``. > > > - **Array conversion** can't reuse the existing coercion > > > functions like > > > ``np.asarray``, because ``np.asarray`` sometimes means > > > "convert to an > > > exact ``np.ndarray``" and other times means "convert to > > > something _like_ > > > a NumPy array." This led to the `NEP 30 > > > `_ > > > proposal for > > > a separate ``np.duckarray`` function, but this still does > > > not resolve how > > > to cast one duck array into a type matching another duck > > > array. > > > > > > ``get_array_module`` and the ``__array_module__`` protocol > > > ---------------------------------------------------------- > > > > > > We propose a new user-facing mechanism for dispatching to a duck- > > > array > > > implementation, ``numpy.get_array_module``. ``get_array_module`` > > > performs the > > > same type resolution as ``__array_function__`` and returns a > > > module with an API > > > promised to match the standard interface of ``numpy`` that can > > > implement > > > operations on all provided array types. > > > > > > The protocol itself is both simpler and more powerful than > > > ``__array_function__``, because it doesn't need to worry about > > > actually > > > implementing functions. We believe it resolves most of the > > > maintainability and > > > functionality limitations of ``__array_function__``. > > > > > > The new protocol is opt-in, explicit and with local control; see > > > :ref:`appendix-design-choices` for discussion on the importance > > > of these design > > > features. > > > > > > The array module contract > > > ========================= > > > > > > Modules returned by ``get_array_module``/``__array_module__`` > > > should make a > > > best effort to implement NumPy's core functionality on new array > > > types(s). > > > Unimplemented functionality should simply be omitted (e.g., > > > accessing an > > > unimplemented function should raise ``AttributeError``). In the > > > future, we > > > anticipate codifying a protocol for requesting restricted subsets > > > of ``numpy``; > > > see :ref:`requesting-restricted-subsets` for more details. > > > > > > How to use ``get_array_module`` > > > =============================== > > > > > > Code that wants to support generic duck arrays should explicitly > > > call > > > ``get_array_module`` to determine an appropriate array module > > > from which to > > > call functions, rather than using the ``numpy`` namespace > > > directly. For > > > example: > > > > > > .. code:: python > > > > > > # calls the appropriate version of np.something for x and y > > > module = np.get_array_module(x, y) > > > module.something(x, y) > > > > > > Both array creation and array conversion are supported, because > > > dispatching is > > > handled by ``get_array_module`` rather than via the types of > > > function > > > arguments. For example, to use random number generation functions > > > or methods, > > > we can simply pull out the appropriate submodule: > > > > > > .. code:: python > > > > > > def duckarray_add_random(array): > > > module = np.get_array_module(array) > > > noise = module.random.randn(*array.shape) > > > return array + noise > > > > > > We can also write the duck-array ``stack`` function from `NEP 30 > > > `_, > > > without the need > > > for a new ``np.duckarray`` function: > > > > > > .. code:: python > > > > > > def duckarray_stack(arrays): > > > module = np.get_array_module(*arrays) > > > arrays = [module.asarray(arr) for arr in arrays] > > > shapes = {arr.shape for arr in arrays} > > > if len(shapes) != 1: > > > raise ValueError('all input arrays must have the same > > > shape') > > > expanded_arrays = [arr[module.newaxis, ...] for arr in > > > arrays] > > > return module.concatenate(expanded_arrays, axis=0) > > > > > > By default, ``get_array_module`` will return the ``numpy`` module > > > if no > > > arguments are arrays. This fall-back can be explicitly controlled > > > by providing > > > the ``module`` keyword-only argument. It is also possible to > > > indicate that an > > > exception should be raised instead of returning a default array > > > module by > > > setting ``module=None``. > > > > > > How to implement ``__array_module__`` > > > ===================================== > > > > > > Libraries implementing a duck array type that want to support > > > ``get_array_module`` need to implement the corresponding > > > protocol, > > > ``__array_module__``. This new protocol is based on Python's > > > dispatch protocol > > > for arithmetic, and is essentially a simpler version of > > > ``__array_function__``. > > > > > > Only one argument is passed into ``__array_module__``, a Python > > > collection of > > > unique array types passed into ``get_array_module``, i.e., all > > > arguments with > > > an ``__array_module__`` attribute. > > > > > > The special method should either return an namespace with an API > > > matching > > > ``numpy``, or ``NotImplemented``, indicating that it does not > > > know how to > > > handle the operation: > > > > > > .. code:: python > > > > > > class MyArray: > > > def __array_module__(self, types): > > > if not all(issubclass(t, MyArray) for t in types): > > > return NotImplemented > > > return my_array_module > > > > > > Returning custom objects from ``__array_module__`` > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > ``my_array_module`` will typically, but need not always, be a > > > Python module. > > > Returning a custom objects (e.g., with functions implemented via > > > ``__getattr__``) may be useful for some advanced use cases. > > > > > > For example, custom objects could allow for partial > > > implementations of duck > > > array modules that fall-back to NumPy (although this is not > > > recommended in > > > general because such fall-back behavior can be error prone): > > > > > > .. code:: python > > > > > > class MyArray: > > > def __array_module__(self, types): > > > if all(issubclass(t, MyArray) for t in types): > > > return ArrayModule() > > > else: > > > return NotImplemented > > > > > > class ArrayModule: > > > def __getattr__(self, name): > > > import base_module > > > return getattr(base_module, name, getattr(numpy, > > > name)) > > > > > > Subclassing from ``numpy.ndarray`` > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > All of the same guidance about well-defined type casting > > > hierarchies from > > > NEP-18 still applies. ``numpy.ndarray`` itself contains a > > > matching > > > implementation of ``__array_module__``, which is convenient for > > > subclasses: > > > > > > .. code:: python > > > > > > class ndarray: > > > def __array_module__(self, types): > > > if all(issubclass(t, ndarray) for t in types): > > > return numpy > > > else: > > > return NotImplemented > > > > > > NumPy's internal machinery > > > ========================== > > > > > > The type resolution rules of ``get_array_module`` follow the same > > > model as > > > Python and NumPy's existing dispatch protocols: subclasses are > > > called before > > > super-classes, and otherwise left to right. ``__array_module__`` > > > is guaranteed > > > to be called only a single time on each unique type. > > > > > > The actual implementation of `get_array_module` will be in C, but > > > should be > > > equivalent to this Python code: > > > > > > .. code:: python > > > > > > def get_array_module(*arrays, default=numpy): > > > implementing_arrays, types = > > > _implementing_arrays_and_types(arrays) > > > if not implementing_arrays and default is not None: > > > return default > > > for array in implementing_arrays: > > > module = array.__array_module__(types) > > > if module is not NotImplemented: > > > return module > > > raise TypeError("no common array module found") > > > > > > def _implementing_arrays_and_types(relevant_arrays): > > > types = [] > > > implementing_arrays = [] > > > for array in relevant_arrays: > > > t = type(array) > > > if t not in types and hasattr(t, '__array_module__'): > > > types.append(t) > > > # Subclasses before superclasses, otherwise left > > > to right > > > index = len(implementing_arrays) > > > for i, old_array in > > > enumerate(implementing_arrays): > > > if issubclass(t, type(old_array)): > > > index = i > > > break > > > implementing_arrays.insert(index, array) > > > return implementing_arrays, types > > > > > > Relationship with ``__array_ufunc__`` and ``__array_function__`` > > > ---------------------------------------------------------------- > > > > > > These older protocols have distinct use-cases and should remain > > > =============================================================== > > > > > > ``__array_module__`` is intended to resolve limitations of > > > ``__array_function__``, so it is natural to consider whether it > > > could entirely > > > replace ``__array_function__``. This would offer dual benefits: > > > (1) simplifying > > > the user-story about how to override NumPy and (2) removing the > > > slowdown > > > associated with checking for dispatch when calling every NumPy > > > function. > > > > > > However, ``__array_module__`` and ``__array_function__`` are > > > pretty different > > > from a user perspective: it requires explicit calls to > > > ``get_array_function``, > > > rather than simply reusing original ``numpy`` functions. This is > > > probably fine > > > for *libraries* that rely on duck-arrays, but may be > > > frustratingly verbose for > > > interactive use. > > > > > > Some of the dispatching use-cases for ``__array_ufunc__`` are > > > also solved by > > > ``__array_module__``, but not all of them. For example, it is > > > still useful to > > > be able to define non-NumPy ufuncs (e.g., from Numba or SciPy) in > > > a generic way > > > on non-NumPy arrays (e.g., with dask.array). > > > > > > Given their existing adoption and distinct use cases, we don't > > > think it makes > > > sense to remove or deprecate ``__array_function__`` and > > > ``__array_ufunc__`` at > > > this time. > > > > > > Mixin classes to implement ``__array_function__`` and > > > ``__array_ufunc__`` > > > ================================================================= > > > ======== > > > > > > Despite the user-facing differences, ``__array_module__`` and a > > > module > > > implementing NumPy's API still contain sufficient functionality > > > needed to > > > implement dispatching with the existing duck array protocols. > > > > > > For example, the following mixin classes would provide sensible > > > defaults for > > > these special methods in terms of ``get_array_module`` and > > > ``__array_module__``: > > > > > > .. code:: python > > > > > > class ArrayUfuncFromModuleMixin: > > > > > > def __array_ufunc__(self, ufunc, method, *inputs, > > > **kwargs): > > > arrays = inputs + kwargs.get('out', ()) > > > try: > > > array_module = np.get_array_module(*arrays) > > > except TypeError: > > > return NotImplemented > > > > > > try: > > > # Note this may have false positive matches, if > > > ufunc.__name__ > > > # matches the name of a ufunc defined by NumPy. > > > Unfortunately > > > # there is no way to determine in which module a > > > ufunc was > > > # defined. > > > new_ufunc = getattr(array_module, ufunc.__name__) > > > except AttributeError: > > > return NotImplemented > > > > > > try: > > > callable = getattr(new_ufunc, method) > > > except AttributeError: > > > return NotImplemented > > > > > > return callable(*inputs, **kwargs) > > > > > > class ArrayFunctionFromModuleMixin: > > > > > > def __array_function__(self, func, types, args, kwargs): > > > array_module = self.__array_module__(types) > > > if array_module is NotImplemented: > > > return NotImplemented > > > > > > # Traverse submodules to find the appropriate > > > function > > > modules = func.__module__.split('.') > > > assert modules[0] == 'numpy' > > > for submodule in modules[1:]: > > > module = getattr(module, submodule, None) > > > new_func = getattr(module, func.__name__, None) > > > if new_func is None: > > > return NotImplemented > > > > > > return new_func(*args, **kwargs) > > > > > > To make it easier to write duck arrays, we could also add these > > > mixin classes > > > into ``numpy.lib.mixins`` (but the examples above may suffice). > > > > > > Alternatives considered > > > ----------------------- > > > > > > Naming > > > ====== > > > > > > We like the name ``__array_module__`` because it mirrors the > > > existing > > > ``__array_function__`` and ``__array_ufunc__`` protocols. Another > > > reasonable > > > choice could be ``__array_namespace__``. > > > > > > It is less clear what the NumPy function that calls this protocol > > > should be > > > called (``get_array_module`` in this proposal). Some possible > > > alternatives: > > > ``array_module``, ``common_array_module``, > > > ``resolve_array_module``, > > > ``get_namespace``, ``get_numpy``, ``get_numpylike_module``, > > > ``get_duck_array_module``. > > > > > > .. _requesting-restricted-subsets: > > > > > > Requesting restricted subsets of NumPy's API > > > ============================================ > > > > > > Over time, NumPy has accumulated a very large API surface, with > > > over 600 > > > attributes in the top level ``numpy`` module alone. It is > > > unlikely that any > > > duck array library could or would want to implement all of these > > > functions and > > > classes, because the frequently used subset of NumPy is much > > > smaller. > > > > > > We think it would be useful exercise to define "minimal" > > > subset(s) of NumPy's > > > API, omitting rarely used or non-recommended functionality. For > > > example, > > > minimal NumPy might include ``stack``, but not the other stacking > > > functions > > > ``column_stack``, ``dstack``, ``hstack`` and ``vstack``. This > > > could clearly > > > indicate to duck array authors and users want functionality is > > > core and what > > > functionality they can skip. > > > > > > Support for requesting a restricted subset of NumPy's API would > > > be a natural > > > feature to include in ``get_array_function`` and > > > ``__array_module__``, e.g., > > > > > > .. code:: python > > > > > > # array_module is only guaranteed to contain "minimal" NumPy > > > array_module = np.get_array_module(*arrays, > > > request='minimal') > > > > > > To facilitate testing with NumPy and use with any valid duck > > > array library, > > > NumPy itself would return restricted versions of the ``numpy`` > > > module when > > > ``get_array_module`` is called only on NumPy arrays. Omitted > > > functions would > > > simply not exist. > > > > > > Unfortunately, we have not yet figured out what these restricted > > > subsets should > > > be, so it doesn't make sense to do this yet. When/if we do, we > > > could either add > > > new keyword arguments to ``get_array_module`` or add new top > > > level functions, > > > e.g., ``get_minimal_array_module``. We would also need to add > > > either a new > > > protocol patterned off of ``__array_module__`` (e.g., > > > ``__array_module_minimal__``), or could add an optional second > > > argument to > > > ``__array_module__`` (catching errors with ``try``/``except``). > > > > > > A new namespace for implicit dispatch > > > ===================================== > > > > > > Instead of supporting overrides in the main `numpy` namespace > > > with > > > ``__array_function__``, we could create a new opt-in namespace, > > > e.g., > > > ``numpy.api``, with versions of NumPy functions that support > > > dispatching. These > > > overrides would need new opt-in protocols, e.g., > > > ``__array_function_api__`` > > > patterned off of ``__array_function__``. > > > > > > This would resolve the biggest limitations of > > > ``__array_function__`` by being > > > opt-in and would also allow for unambiguously overriding > > > functions like > > > ``asarray``, because ``np.api.asarray`` would always mean > > > "convert an > > > array-like object." But it wouldn't solve all the dispatching > > > needs met by > > > ``__array_module__``, and would leave us with supporting a > > > considerably more > > > complex protocol both for array users and implementors. > > > > > > We could potentially implement such a new namespace *via* the > > > ``__array_module__`` protocol. Certainly some users would find > > > this convenient, > > > because it is slightly less boilerplate. But this would leave > > > users with a > > > confusing choice: when should they use `get_array_module` vs. > > > `np.api.something`. Also, we would have to add and maintain a > > > whole new module, > > > which is considerably more expensive than merely adding a > > > function. > > > > > > Dispatching on both types and arrays instead of only types > > > ========================================================== > > > > > > Instead of supporting dispatch only via unique array types, we > > > could also > > > support dispatch via array objects, e.g., by passing an > > > ``arrays`` argument as > > > part of the ``__array_module__`` protocol. This could potentially > > > be useful for > > > dispatch for arrays with metadata, such provided by Dask and > > > Pint, but would > > > impose costs in terms of type safety and complexity. > > > > > > For example, a library that supports arrays on both CPUs and GPUs > > > might decide > > > on which device to create a new arrays from functions like > > > ``ones`` based on > > > input arguments: > > > > > > .. code:: python > > > > > > class Array: > > > def __array_module__(self, types, arrays): > > > useful_arrays = tuple(a in arrays if isinstance(a, > > > Array)) > > > if not useful_arrays: > > > return NotImplemented > > > prefer_gpu = any(a.prefer_gpu for a in useful_arrays) > > > return ArrayModule(prefer_gpu) > > > > > > class ArrayModule: > > > def __init__(self, prefer_gpu): > > > self.prefer_gpu = prefer_gpu > > > > > > def __getattr__(self, name): > > > import base_module > > > base_func = getattr(base_module, name) > > > return functools.partial(base_func, > > > prefer_gpu=self.prefer_gpu) > > > > > > This might be useful, but it's not clear if we really need it. > > > Pint seems to > > > get along OK without any explicit array creation routines > > > (favoring > > > multiplication by units, e.g., ``np.ones(5) * ureg.m``), and for > > > the most part > > > Dask is also OK with existing ``__array_function__`` style > > > overides (e.g., > > > favoring ``np.ones_like`` over ``np.ones``). Choosing whether to > > > place an array > > > on the CPU or GPU could be solved by `making array creation lazy > > > `_. > > > > > > .. _appendix-design-choices: > > > > > > Appendix: design choices for API overrides > > > ------------------------------------------ > > > > > > There is a large range of possible design choices for overriding > > > NumPy's API. > > > Here we discuss three major axes of the design decision that > > > guided our design > > > for ``__array_module__``. > > > > > > Opt-in vs. opt-out for users > > > ============================ > > > > > > The ``__array_ufunc__`` and ``__array_function__`` protocols > > > provide a > > > mechanism for overriding NumPy functions *within NumPy's existing > > > namespace*. > > > This means that users need to explicitly opt-out if they do not > > > want any > > > overridden behavior, e.g., by casting arrays with > > > ``np.asarray()``. > > > > > > In theory, this approach lowers the barrier for adopting these > > > protocols in > > > user code and libraries, because code that uses the standard > > > NumPy namespace is > > > automatically compatible. But in practice, this hasn't worked > > > out. For example, > > > most well-maintained libraries that use NumPy follow the best > > > practice of > > > casting all inputs with ``np.asarray()``, which they would have > > > to explicitly > > > relax to use ``__array_function__``. Our experience has been that > > > making a > > > library compatible with a new duck array type typically requires > > > at least a > > > small amount of work to accommodate differences in the data model > > > and operations > > > that can be implemented efficiently. > > > > > > These opt-out approaches also considerably complicate backwards > > > compatibility > > > for libraries that adopt these protocols, because by opting in as > > > a library > > > they also opt-in their users, whether they expect it or not. For > > > winning over > > > libraries that have been unable to adopt ``__array_function__``, > > > an opt-in > > > approach seems like a must. > > > > > > Explicit vs. implicit choice of implementation > > > ============================================== > > > > > > Both ``__array_ufunc__`` and ``__array_function__`` have implicit > > > control over > > > dispatching: the dispatched functions are determined via the > > > appropriate > > > protocols in every function call. This generalizes well to > > > handling many > > > different types of objects, as evidenced by its use for > > > implementing arithmetic > > > operators in Python, but it has two downsides: > > > > > > 1. *Speed*: it imposes additional overhead in every function > > > call, because each > > > function call needs to inspect each of its arguments for > > > overrides. This is > > > why arithmetic on builtin Python numbers is slow. > > > 2. *Readability*: it is not longer immediately evident to readers > > > of code what > > > happens when a function is called, because the function's > > > implementation > > > could be overridden by any of its arguments. > > > > > > In contrast, importing a new library (e.g., ``import dask.array > > > as da``) with > > > an API matching NumPy is entirely explicit. There is no overhead > > > from dispatch > > > or ambiguity about which implementation is being used. > > > > > > Explicit and implicit choice of implementations are not mutually > > > exclusive > > > options. Indeed, most implementations of NumPy API overrides via > > > ``__array_function__`` that we are familiar with (namely, dask, > > > CuPy and > > > sparse, but not Pint) also include an explicit way to use their > > > version of > > > NumPy's API by importing a module directly (``dask.array``, > > > ``cupy`` or > > > ``sparse``, respectively). > > > > > > Local vs. non-local vs. global control > > > ====================================== > > > > > > The final design axis is how users control the choice of API: > > > > > > - **Local control**, as exemplified by multiple dispatch and > > > Python protocols for > > > arithmetic, determines which implementation to use either by > > > checking types > > > or calling methods on the direct arguments of a function. > > > - **Non-local control** such as `np.errstate > > > < > > > https://docs.scipy.org/doc/numpy/reference/generated/numpy.errstate.html>`_ > > > overrides behavior with global-state via function decorators or > > > context-managers. Control is determined hierarchically, via the > > > inner-most > > > context. > > > - **Global control** provides a mechanism for users to set > > > default behavior, > > > either via function calls or configuration files. For example, > > > matplotlib > > > allows setting a global choice of plotting backend. > > > > > > Local control is generally considered a best practice for API > > > design, because > > > control flow is entirely explicit, which makes it the easiest to > > > understand. > > > Non-local and global control are occasionally used, but generally > > > either due to > > > ignorance or a lack of better alternatives. > > > > > > In the case of duck typing for NumPy's public API, we think non- > > > local or global > > > control would be mistakes, mostly because they **don't compose > > > well**. If one > > > library sets/needs one set of overrides and then internally calls > > > a routine > > > that expects another set of overrides, the resulting behavior may > > > be very > > > surprising. Higher order functions are especially problematic, > > > because the > > > context in which functions are evaluated may not be the context > > > in which they > > > are defined. > > > > > > One class of override use cases where we think non-local and > > > global control are > > > appropriate is for choosing a backend system that is guaranteed > > > to have an > > > entirely consistent interface, such as a faster alternative > > > implementation of > > > ``numpy.fft`` on NumPy arrays. However, these are out of scope > > > for the current > > > proposal, which is focused on duck arrays. > > > > > > > > > _______________________________________________ > > > NumPy-Discussion mailing list > > > NumPy-Discussion at python.org > > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From lxndr.rvs at gmail.com Mon Feb 10 21:07:40 2020 From: lxndr.rvs at gmail.com (Alexander Reeves) Date: Mon, 10 Feb 2020 18:07:40 -0800 Subject: [Numpy-discussion] Proposal - extend histograms api to allow uneven bins Message-ID: Greetings, I have a PR that warrants discussion according to @seberg. See https://github.com/numpy/numpy/pull/14278. It is an enhancement that fixes a bug. The original bug is that when using the fd estimator on a dataset with small inter-quartile range and large outliers, the current codebase produces more bins than memory allows. There are several related bug reports (see #11879, #10297, #8203). In terms of scope, I restricted my changes to conditions where np.histogram(bins='auto') defaults to the 'fd'. For the actual fix, I actually enhanced the API. I used a suggestion from @eric-wieser to merge empty histogram bins. In practice this solves the outsized bins issue. However @seberg is concerned that extending the API in this way may not be the way to go. For example, if you use "auto" once, and then re-use the bins, the uneven bins may not be what you want. Furthermore @eric-wieser is concerned that there may be a floating-point devil in the details. He advocates using the hypothesis testing package to increase our confidence that the current implementation adequately handles corner cases. I would like to do my part in improving the code base. I don't have strong opinions but I have to admit that I would like to eventually make a PR that resolves these bugs. This has been a PR half a year in the making after all. Thoughts? -areeves87 -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at gmail.com Tue Feb 11 00:16:44 2020 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Mon, 10 Feb 2020 23:16:44 -0600 Subject: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics In-Reply-To: References: Message-ID: On Tue, Feb 4, 2020 at 2:00 PM Hameer Abbasi wrote: > ?snip? > > > 1) Once NumPy adds the framework and initial set of Universal Intrinsic, > if contributors want to leverage a new architecture specific SIMD > instruction, will they be expected to add software implementation of this > instruction for all other architectures too? > > In my opinion, if the instructions are lower, then yes. For example, one > cannot add AVX-512 without adding, for example adding AVX-256 and AVX-128 > and SSE*. However, I would not expect one person or team to be an expert > in all assemblies, so intrinsics for one architecture can be developed > independently of another. > I think this doesn't quite answer the question. If I understand correctly, it's about a single instruction (e.g. one needs "VEXP2PD" and it's missing from the supported AVX512 instructions in master). I think the answer is yes, it needs to be added for other architectures as well. Otherwise, if universal intrinsics are added ad-hoc and there's no guarantee that a universal instruction is available for all main supported platforms, then over time there won't be much that's "universal" about the framework. This is a different question though from adding a new ufunc implementation. I would expect accelerating ufuncs via intrinsics that are already supported to be much more common than having to add new intrinsics. Does that sound right? > > 2) On whom does the burden lie to ensure that new implementations are > benchmarked and shows benefits on every architecture? What happens if > optimizing an Ufunc leads to improving performance on one architecture and > worsens performance on another? > This is slightly hard to provide a recipe for. I suspect it may take a while before this becomes an issue, since we don't have much SIMD code to begin with. So adding new code with benchmarks will likely show improvements on all architectures (we should ensure benchmarks can be run via CI, otherwise it's too onerous). And if not and it's not easily fixable, the problematic platform could be skipped so performance there is unchanged. Only once there's existing universal intrinsics and then they're tweaked will we have to be much more careful I'd think. Cheers, Ralf > > I would look at this from a maintainability point of view. If we are > increasing the code size by 20% for a certain ufunc, there must be a > domonstrable 20% increase in performance on any CPU. That is to say, > micro-optimisation will be unwelcome, and code readability will be > preferable. Usually we ask the submitter of the PR to test the PR with a > machine they have on hand, and I would be inclined to keep this trend of > self-reporting. Of course, if someone else came along and reported a > performance regression of, say, 10%, then we have increased code by 20%, > with only a net 5% gain in performance, and the PR will have to be reverted. > > ?snip? > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From matti.picus at gmail.com Tue Feb 11 01:53:18 2020 From: matti.picus at gmail.com (Matti Picus) Date: Tue, 11 Feb 2020 08:53:18 +0200 Subject: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics In-Reply-To: References: Message-ID: <11f71397-0bac-aaba-3ee9-133ad098b894@gmail.com> On 11/2/20 7:16 am, Ralf Gommers wrote: > > > On Tue, Feb 4, 2020 at 2:00 PM Hameer Abbasi > > wrote: > > ?snip? > > > 1) Once NumPy adds the framework and initial set of Universal Intrinsic, if > contributors want to leverage a new architecture specific SIMD > instruction, will they be expected to add software implementation > of this instruction for all other architectures too? > > In my opinion, if the instructions are lower, then yes. For > example, one cannot add AVX-512 without adding, for example adding > AVX-256 and AVX-128 and SSE*.? However, I would not expect one > person or team to be an expert in all assemblies, so intrinsics > for one architecture can be developed independently of another. > > > I think this doesn't quite answer the question. If I understand > correctly, it's about a single instruction (e.g. one needs > |"VEXP2PD"and it's missing from the supported AVX512 instructions in > master). I think the answer is yes, it needs to be added for other > architectures as well. Otherwise, if universal intrinsics are added > ad-hoc and there's no guarantee that a universal instruction is > available for all main supported platforms, then over time there won't > be much that's "universal" about the framework.| > | > | > |This is a different question though from adding a new ufunc > implementation. I would expect accelerating ufuncs via intrinsics that > are already supported to be much more common than having to add new > intrinsics. Does that sound right? > | |Yes. Universal intrinsics are cross-platform. However the NEP is open to the possibility that certain architectures may have SIMD intrinsics that cannot be expressed in terms of intrinsics for other platforms, and so there may be a use case for architecture-specific loops. This is explicitly stated in the latest PR to the NEP: "|If the regression is not minimal, we may choose to keep the X86-specific code for that platform and use the universal intrisic code for other platforms." > | > | > > > > 2) On whom does the burden lie to ensure that new > implementations are benchmarked and shows benefits on every > architecture? What happens if optimizing an Ufunc leads to > improving performance on one architecture and worsens performance > on another? > > > This is slightly hard to provide a recipe for. I suspect it may take a > while before this becomes an issue, since we don't have much SIMD code > to begin with. So adding new code with benchmarks will likely show > improvements on all architectures (we should ensure benchmarks can be > run via CI, otherwise it's too onerous). And if not and it's not > easily fixable, the problematic platform could be skipped so > performance there is unchanged. On HEAD, out of the 89 ufuncs in numpy.core.code_generators.generate_umath.defdict, 34 have X86-specific simd loops: >>> [x for x in defdict.keys() if any([td.simd for td in defdict[x].type_descriptions])] ['add', 'subtract', 'multiply', 'conjugate', 'square', 'reciprocal', 'absolute', 'negative', 'greater', 'greater_equal', 'less', 'less_equal', 'equal', 'not_equal', 'logical_and', 'logical_not', 'logical_or', 'maximum', 'minimum', 'bitwise_and', 'bitwise_or', 'bitwise_xor', 'invert', 'left_shift', 'right_shift', 'cos', 'sin', 'exp', 'log', 'sqrt', 'ceil', 'trunc', 'floor', 'rint'] They would be the first targets for universal intrinsics. Of them I estimate that the ones with more than one loop for at least one dtype signature would be the most difficult, since these have different optimizations for avx2, fma, and/or avx512f: ['square', 'reciprocal', 'absolute', 'cos', 'sin', 'exp', 'log', 'sqrt', 'ceil', 'trunc', 'floor', 'rint'] The other 55 ufuncs, for completeness, are ['floor_divide', 'true_divide', 'fmod', '_ones_like', 'power', 'float_power', '_arg', 'positive', 'sign', 'logical_xor', 'clip', 'fmax', 'fmin', 'logaddexp', 'logaddexp2', 'heaviside', 'degrees', 'rad2deg', 'radians', 'deg2rad', 'arccos', 'arccosh', 'arcsin', 'arcsinh', 'arctan', 'arctanh', 'tan', 'cosh', 'sinh', 'tanh', 'exp2', 'expm1', 'log2', 'log10', 'log1p', 'cbrt', 'fabs', 'arctan2', 'remainder', 'divmod', 'hypot', 'isnan', 'isnat', 'isinf', 'isfinite', 'signbit', 'copysign', 'nextafter', 'spacing', 'modf', 'ldexp', 'frexp', 'gcd', 'lcm', 'matmul'] As for testing accuracy: we recently added a framework for testing ulp variation of ufuncs against "golden results" in numpy/core/tests/test_umath_accuracy. So far float32 is tested for exp, log, cos, sin. Others may be tested elsewhere by specific tests, for instance numpy/core/test/test_half.py has test_half_ufuncs. It is difficult to do benchmarking on CI: the machines that run CI vary too much. We would need to set aside a machine for this and carefully set it up to keep CPU speed and temperature constant. We do have benchmarks for ufuncs (they could always be improved). I think Pauli runs the benchmarks carefully on X86, and may even makes the results public, but that resource is not really on PR reviewers' radar. We could run benchmarks on the gcc build farm machines for other architectures. Those machines are shared but not heavily utilized. > Only once there's existing universal intrinsics and then they're > tweaked will we have to be much more careful I'd think. > > > > I would look at this from a maintainability point of view. If we > are increasing the code size by 20% for a certain ufunc, there > must be a domonstrable 20% increase in performance on any CPU. > That is to say, micro-optimisation will be unwelcome, and code > readability will be preferable. Usually we ask the submitter of > the PR to test the PR with a machine they have on hand, and I > would be inclined to keep this trend of self-reporting. Of course, > if someone else came along and reported a performance regression > of, say, 10%, then we have increased code by 20%, with only a net > 5% gain in performance, and the PR will have to be reverted. > > ?snip? > I think we should be careful not to increase the reviewer burden, and try to automate as much as possible. It would be nice if we could at some point set up a set of bots that can be triggered to run benchmarks for us and report in the PR the results. Matti From raghuveer.devulapalli at intel.com Tue Feb 11 13:02:09 2020 From: raghuveer.devulapalli at intel.com (Devulapalli, Raghuveer) Date: Tue, 11 Feb 2020 18:02:09 +0000 Subject: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics In-Reply-To: References: Message-ID: >> I think this doesn't quite answer the question. If I understand correctly, it's about a single instruction (e.g. one needs "VEXP2PD" and it's missing from the supported AVX512 instructions in master). I think the answer is yes, it needs to be added for other architectures as well. That adds a lot of overhead to write SIMD based optimizations which can discourage contributors. It?s also an unreasonable expectation that a developer be familiar with SIMD of all the architectures. On top of that the performance implications aren?t clear. Software implementations of hardware instructions might perform worse and might not even produce the same result. From: NumPy-Discussion On Behalf Of Ralf Gommers Sent: Monday, February 10, 2020 9:17 PM To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics On Tue, Feb 4, 2020 at 2:00 PM Hameer Abbasi > wrote: ?snip? > 1) Once NumPy adds the framework and initial set of Universal Intrinsic, if contributors want to leverage a new architecture specific SIMD instruction, will they be expected to add software implementation of this instruction for all other architectures too? In my opinion, if the instructions are lower, then yes. For example, one cannot add AVX-512 without adding, for example adding AVX-256 and AVX-128 and SSE*. However, I would not expect one person or team to be an expert in all assemblies, so intrinsics for one architecture can be developed independently of another. I think this doesn't quite answer the question. If I understand correctly, it's about a single instruction (e.g. one needs "VEXP2PD" and it's missing from the supported AVX512 instructions in master). I think the answer is yes, it needs to be added for other architectures as well. Otherwise, if universal intrinsics are added ad-hoc and there's no guarantee that a universal instruction is available for all main supported platforms, then over time there won't be much that's "universal" about the framework. This is a different question though from adding a new ufunc implementation. I would expect accelerating ufuncs via intrinsics that are already supported to be much more common than having to add new intrinsics. Does that sound right? > 2) On whom does the burden lie to ensure that new implementations are benchmarked and shows benefits on every architecture? What happens if optimizing an Ufunc leads to improving performance on one architecture and worsens performance on another? This is slightly hard to provide a recipe for. I suspect it may take a while before this becomes an issue, since we don't have much SIMD code to begin with. So adding new code with benchmarks will likely show improvements on all architectures (we should ensure benchmarks can be run via CI, otherwise it's too onerous). And if not and it's not easily fixable, the problematic platform could be skipped so performance there is unchanged. Only once there's existing universal intrinsics and then they're tweaked will we have to be much more careful I'd think. Cheers, Ralf I would look at this from a maintainability point of view. If we are increasing the code size by 20% for a certain ufunc, there must be a domonstrable 20% increase in performance on any CPU. That is to say, micro-optimisation will be unwelcome, and code readability will be preferable. Usually we ask the submitter of the PR to test the PR with a machine they have on hand, and I would be inclined to keep this trend of self-reporting. Of course, if someone else came along and reported a performance regression of, say, 10%, then we have increased code by 20%, with only a net 5% gain in performance, and the PR will have to be reverted. ?snip? _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion at python.org https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Tue Feb 11 13:09:20 2020 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Tue, 11 Feb 2020 10:09:20 -0800 Subject: [Numpy-discussion] NumPy Development Meeting - Triage Focus Message-ID: Hi all, Our bi-weekly triage-focused NumPy development meeting is tomorrow (Wednesday, Februrary 12) at 11 am Pacific Time. Everyone is invited to join in and edit the work-in-progress meeting topics and notes: https://hackmd.io/68i_JvOYQfy9ERiHgXMPvg I encourage everyone to notify us of issues or PRs that you feel should be prioritized or simply discussed briefly. Just comment on it so we can label it, or add your PR/issue to this weeks topics for discussion. Best regards, Sebastian -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From rossbar15 at gmail.com Tue Feb 11 16:12:13 2020 From: rossbar15 at gmail.com (Ross Barnowski) Date: Tue, 11 Feb 2020 13:12:13 -0800 Subject: [Numpy-discussion] Proposal - extend histograms api to allow uneven bins In-Reply-To: References: Message-ID: Just a few thoughts re: the changes proposed in https://github.com/numpy/numpy/pull/14278 1. Though the PR is limited to the 'auto' kwarg, the issue of potential memory problems for the automated binning methods is a more general one (e.g. #15332 ). 2. The main concern that jumps out to me is downstream users who are relying on the implicit assumption of regular binning. This is of course bad practice and makes even less sense when using one of the bin estimators, so I'm not sure how big of a concern it is. However, there is likely downstream user code that relies on the regular binning assumption, especially since, as far as I know, NumPy has never implemented binning techniques that return irregular bins. 3. The astropy project have at least one estimator that returns irregular bins . I checked for issues related to irregular binning: though they have many of the same problems with the automatic bin estimators (i.e. memory problems for inputs with outliers), I didn't see anything specifically related to irregular binning I just wanted to add my two cents. The binning-data-with-outliers problem is very common in high-resolution spectroscopy, and I have seen practitioners rely on the assumption of regular binning (e.g. divide the `range` by the number of bins) to specify bin centers even though this is not the right way to do things. Thanks for taking the time to write up your work! On Mon, Feb 10, 2020 at 10:53 PM wrote: > Send NumPy-Discussion mailing list submissions to > numpy-discussion at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/numpy-discussion > or, via email, send a message with subject or body 'help' to > numpy-discussion-request at python.org > > You can reach the person managing the list at > numpy-discussion-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of NumPy-Discussion digest..." > > > Today's Topics: > > 1. Proposal - extend histograms api to allow uneven bins > (Alexander Reeves) > 2. Re: NEP 38 - Universal SIMD intrinsics (Ralf Gommers) > 3. Re: NEP 38 - Universal SIMD intrinsics (Matti Picus) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 10 Feb 2020 18:07:40 -0800 > From: Alexander Reeves > To: numpy-discussion at python.org > Subject: [Numpy-discussion] Proposal - extend histograms api to allow > uneven bins > Message-ID: > C4c4itJO3WZptdD6yrhWV6NgUoqT_mQ at mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Greetings, > > I have a PR that warrants discussion according to @seberg. See > https://github.com/numpy/numpy/pull/14278. > > It is an enhancement that fixes a bug. The original bug is that when using > the fd estimator on a dataset with small inter-quartile range and large > outliers, the current codebase produces more bins than memory allows. There > are several related bug reports (see #11879, #10297, #8203). > > In terms of scope, I restricted my changes to conditions where > np.histogram(bins='auto') defaults to the 'fd'. For the actual fix, I > actually enhanced the API. I used a suggestion from @eric-wieser to merge > empty histogram bins. In practice this solves the outsized bins issue. > > However @seberg is concerned that extending the API in this way may not be > the way to go. For example, if you use "auto" once, and then re-use the > bins, the uneven bins may not be what you want. > > Furthermore @eric-wieser is concerned that there may be a floating-point > devil in the details. He advocates using the hypothesis testing package to > increase our confidence that the current implementation adequately handles > corner cases. > > I would like to do my part in improving the code base. I don't have strong > opinions but I have to admit that I would like to eventually make a PR that > resolves these bugs. This has been a PR half a year in the making after > all. > > Thoughts? > > -areeves87 > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://mail.python.org/pipermail/numpy-discussion/attachments/20200210/69a4fab1/attachment-0001.html > > > > ------------------------------ > > Message: 2 > Date: Mon, 10 Feb 2020 23:16:44 -0600 > From: Ralf Gommers > To: Discussion of Numerical Python > Subject: Re: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics > Message-ID: > ooVNijYyKRXY1C3Qr5KEeK0+ceOPGrBnQ at mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > On Tue, Feb 4, 2020 at 2:00 PM Hameer Abbasi > wrote: > > > ?snip? > > > > > 1) Once NumPy adds the framework and initial set of Universal > Intrinsic, > > if contributors want to leverage a new architecture specific SIMD > > instruction, will they be expected to add software implementation of this > > instruction for all other architectures too? > > > > In my opinion, if the instructions are lower, then yes. For example, one > > cannot add AVX-512 without adding, for example adding AVX-256 and AVX-128 > > and SSE*. However, I would not expect one person or team to be an expert > > in all assemblies, so intrinsics for one architecture can be developed > > independently of another. > > > > I think this doesn't quite answer the question. If I understand correctly, > it's about a single instruction (e.g. one needs "VEXP2PD" and it's missing > from the supported AVX512 instructions in master). I think the answer is > yes, it needs to be added for other architectures as well. Otherwise, if > universal intrinsics are added ad-hoc and there's no guarantee that a > universal instruction is available for all main supported platforms, then > over time there won't be much that's "universal" about the framework. > > This is a different question though from adding a new ufunc implementation. > I would expect accelerating ufuncs via intrinsics that are already > supported to be much more common than having to add new intrinsics. Does > that sound right? > > > > > 2) On whom does the burden lie to ensure that new implementations are > > benchmarked and shows benefits on every architecture? What happens if > > optimizing an Ufunc leads to improving performance on one architecture > and > > worsens performance on another? > > > > This is slightly hard to provide a recipe for. I suspect it may take a > while before this becomes an issue, since we don't have much SIMD code to > begin with. So adding new code with benchmarks will likely show > improvements on all architectures (we should ensure benchmarks can be run > via CI, otherwise it's too onerous). And if not and it's not easily > fixable, the problematic platform could be skipped so performance there is > unchanged. > > Only once there's existing universal intrinsics and then they're tweaked > will we have to be much more careful I'd think. > > Cheers, > Ralf > > > > > > > I would look at this from a maintainability point of view. If we are > > increasing the code size by 20% for a certain ufunc, there must be a > > domonstrable 20% increase in performance on any CPU. That is to say, > > micro-optimisation will be unwelcome, and code readability will be > > preferable. Usually we ask the submitter of the PR to test the PR with a > > machine they have on hand, and I would be inclined to keep this trend of > > self-reporting. Of course, if someone else came along and reported a > > performance regression of, say, 10%, then we have increased code by 20%, > > with only a net 5% gain in performance, and the PR will have to be > reverted. > > > > ?snip? > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://mail.python.org/pipermail/numpy-discussion/attachments/20200210/4b8354cf/attachment-0001.html > > > > ------------------------------ > > Message: 3 > Date: Tue, 11 Feb 2020 08:53:18 +0200 > From: Matti Picus > To: numpy-discussion at python.org > Subject: Re: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics > Message-ID: <11f71397-0bac-aaba-3ee9-133ad098b894 at gmail.com> > Content-Type: text/plain; charset=utf-8; format=flowed > > > On 11/2/20 7:16 am, Ralf Gommers wrote: > > > > > > On Tue, Feb 4, 2020 at 2:00 PM Hameer Abbasi > > > wrote: > > > > ?snip? > > > > > 1) Once NumPy adds the framework and initial set of Universal > Intrinsic, if > > contributors want to leverage a new architecture specific SIMD > > instruction, will they be expected to add software implementation > > of this instruction for all other architectures too? > > > > In my opinion, if the instructions are lower, then yes. For > > example, one cannot add AVX-512 without adding, for example adding > > AVX-256 and AVX-128 and SSE*.? However, I would not expect one > > person or team to be an expert in all assemblies, so intrinsics > > for one architecture can be developed independently of another. > > > > > > I think this doesn't quite answer the question. If I understand > > correctly, it's about a single instruction (e.g. one needs > > |"VEXP2PD"and it's missing from the supported AVX512 instructions in > > master). I think the answer is yes, it needs to be added for other > > architectures as well. Otherwise, if universal intrinsics are added > > ad-hoc and there's no guarantee that a universal instruction is > > available for all main supported platforms, then over time there won't > > be much that's "universal" about the framework.| > > | > > | > > |This is a different question though from adding a new ufunc > > implementation. I would expect accelerating ufuncs via intrinsics that > > are already supported to be much more common than having to add new > > intrinsics. Does that sound right? > > | > |Yes. Universal intrinsics are cross-platform. However the NEP is open > to the possibility that certain architectures may have SIMD intrinsics > that cannot be expressed in terms of intrinsics for other platforms, and > so there may be a use case for architecture-specific loops. This is > explicitly stated in the latest PR to the NEP: "|If the regression is > not minimal, we may choose to keep the X86-specific code for that > platform and use the universal intrisic code for other platforms." > > | > > | > > > > > > > 2) On whom does the burden lie to ensure that new > > implementations are benchmarked and shows benefits on every > > architecture? What happens if optimizing an Ufunc leads to > > improving performance on one architecture and worsens performance > > on another? > > > > > > This is slightly hard to provide a recipe for. I suspect it may take a > > while before this becomes an issue, since we don't have much SIMD code > > to begin with. So adding new code with benchmarks will likely show > > improvements on all architectures (we should ensure benchmarks can be > > run via CI, otherwise it's too onerous). And if not and it's not > > easily fixable, the problematic platform could be skipped so > > performance there is unchanged. > > > On HEAD, out of the 89 ufuncs in > numpy.core.code_generators.generate_umath.defdict, 34 have X86-specific > simd loops: > > > >>> [x for x in defdict.keys() if any([td.simd for td in > defdict[x].type_descriptions])] > ['add', 'subtract', 'multiply', 'conjugate', 'square', 'reciprocal', > 'absolute', 'negative', 'greater', 'greater_equal', 'less', > 'less_equal', 'equal', 'not_equal', 'logical_and', 'logical_not', > 'logical_or', 'maximum', 'minimum', 'bitwise_and', 'bitwise_or', > 'bitwise_xor', 'invert', 'left_shift', 'right_shift', 'cos', 'sin', > 'exp', 'log', 'sqrt', 'ceil', 'trunc', 'floor', 'rint'] > > > They would be the first targets for universal intrinsics. Of them I > estimate that the ones with more than one loop for at least one dtype > signature would be the most difficult, since these have different > optimizations for avx2, fma, and/or avx512f: > > > ['square', 'reciprocal', 'absolute', 'cos', 'sin', 'exp', 'log', 'sqrt', > 'ceil', 'trunc', 'floor', 'rint'] > > > The other 55 ufuncs, for completeness, are > > > ['floor_divide', 'true_divide', 'fmod', '_ones_like', 'power', > 'float_power', '_arg', 'positive', 'sign', 'logical_xor', 'clip', > 'fmax', 'fmin', 'logaddexp', 'logaddexp2', 'heaviside', 'degrees', > 'rad2deg', 'radians', 'deg2rad', 'arccos', 'arccosh', 'arcsin', > 'arcsinh', 'arctan', 'arctanh', 'tan', 'cosh', 'sinh', 'tanh', 'exp2', > 'expm1', 'log2', 'log10', 'log1p', 'cbrt', 'fabs', 'arctan2', > 'remainder', 'divmod', 'hypot', 'isnan', 'isnat', 'isinf', 'isfinite', > 'signbit', 'copysign', 'nextafter', 'spacing', 'modf', 'ldexp', 'frexp', > 'gcd', 'lcm', 'matmul'] > > > As for testing accuracy: we recently added a framework for testing ulp > variation of ufuncs against "golden results" in > numpy/core/tests/test_umath_accuracy. So far float32 is tested for exp, > log, cos, sin. Others may be tested elsewhere by specific tests, for > instance numpy/core/test/test_half.py has test_half_ufuncs. > > > It is difficult to do benchmarking on CI: the machines that run CI vary > too much. We would need to set aside a machine for this and carefully > set it up to keep CPU speed and temperature constant. We do have > benchmarks for ufuncs (they could always be improved). I think Pauli > runs the benchmarks carefully on X86, and may even makes the results > public, but that resource is not really on PR reviewers' radar. We could > run benchmarks on the gcc build farm machines for other architectures. > Those machines are shared but not heavily utilized. > > > > Only once there's existing universal intrinsics and then they're > > tweaked will we have to be much more careful I'd think. > > > > > > > > I would look at this from a maintainability point of view. If we > > are increasing the code size by 20% for a certain ufunc, there > > must be a domonstrable 20% increase in performance on any CPU. > > That is to say, micro-optimisation will be unwelcome, and code > > readability will be preferable. Usually we ask the submitter of > > the PR to test the PR with a machine they have on hand, and I > > would be inclined to keep this trend of self-reporting. Of course, > > if someone else came along and reported a performance regression > > of, say, 10%, then we have increased code by 20%, with only a net > > 5% gain in performance, and the PR will have to be reverted. > > > > ?snip? > > > > I think we should be careful not to increase the reviewer burden, and > try to automate as much as possible. It would be nice if we could at > some point set up a set of bots that can be triggered to run benchmarks > for us and report in the PR the results. > > > Matti > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > > ------------------------------ > > End of NumPy-Discussion Digest, Vol 161, Issue 10 > ************************************************* > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at gmail.com Tue Feb 11 16:33:19 2020 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Tue, 11 Feb 2020 15:33:19 -0600 Subject: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics In-Reply-To: References: Message-ID: On Tue, Feb 11, 2020 at 12:03 PM Devulapalli, Raghuveer < raghuveer.devulapalli at intel.com> wrote: > >> I think this doesn't quite answer the question. If I understand > correctly, it's about a single instruction (e.g. one needs "VEXP2PD" and > it's missing from the supported AVX512 instructions in master). I think > the answer is yes, it needs to be added for other architectures as well. > > > > That adds a lot of overhead to write SIMD based optimizations which can > discourage contributors. > Keep in mind that a new universal intrinsics instruction is just a bunch of defines. That is way less work than writing a ufunc that uses that instruction. We can also ping a platform expert in case it's not obvious what the corresponding arch-specific instruction is - that's a bit of a chicken-and-egg problem; once we get going we hopefully get more interested people that can help each other out. > It?s also an unreasonable expectation that a developer be familiar with > SIMD of all the architectures. On top of that the performance implications > aren?t clear. Software implementations of hardware instructions might > perform worse and might not even produce the same result. > I think you are worrying about writing ufuncs here, not about adding an instruction. If the same result is not produced, we have CI that should fail - and if it does, we can deal with that by (if it's not easy to figure out) making that platform fall back to the generic non-SIMD version of the ufunc. Cheers, Ralf > > > *From:* NumPy-Discussion intel.com at python.org> *On Behalf Of *Ralf Gommers > *Sent:* Monday, February 10, 2020 9:17 PM > *To:* Discussion of Numerical Python > *Subject:* Re: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics > > > > > > > > On Tue, Feb 4, 2020 at 2:00 PM Hameer Abbasi > wrote: > > ?snip? > > > > > 1) Once NumPy adds the framework and initial set of Universal Intrinsic, > if contributors want to leverage a new architecture specific SIMD > instruction, will they be expected to add software implementation of this > instruction for all other architectures too? > > > > In my opinion, if the instructions are lower, then yes. For example, one > cannot add AVX-512 without adding, for example adding AVX-256 and AVX-128 > and SSE*. However, I would not expect one person or team to be an expert > in all assemblies, so intrinsics for one architecture can be developed > independently of another. > > > > I think this doesn't quite answer the question. If I understand correctly, > it's about a single instruction (e.g. one needs "VEXP2PD" and it's > missing from the supported AVX512 instructions in master). I think the > answer is yes, it needs to be added for other architectures as well. > Otherwise, if universal intrinsics are added ad-hoc and there's no > guarantee that a universal instruction is available for all main supported > platforms, then over time there won't be much that's "universal" about the > framework. > > > > This is a different question though from adding a new ufunc > implementation. I would expect accelerating ufuncs via intrinsics that are > already supported to be much more common than having to add new intrinsics. > Does that sound right? > > > > > > 2) On whom does the burden lie to ensure that new implementations are > benchmarked and shows benefits on every architecture? What happens if > optimizing an Ufunc leads to improving performance on one architecture and > worsens performance on another? > > > > This is slightly hard to provide a recipe for. I suspect it may take a > while before this becomes an issue, since we don't have much SIMD code to > begin with. So adding new code with benchmarks will likely show > improvements on all architectures (we should ensure benchmarks can be run > via CI, otherwise it's too onerous). And if not and it's not easily > fixable, the problematic platform could be skipped so performance there is > unchanged. > > > > Only once there's existing universal intrinsics and then they're tweaked > will we have to be much more careful I'd think. > > > > Cheers, > > Ralf > > > > > > > > I would look at this from a maintainability point of view. If we are > increasing the code size by 20% for a certain ufunc, there must be a > domonstrable 20% increase in performance on any CPU. That is to say, > micro-optimisation will be unwelcome, and code readability will be > preferable. Usually we ask the submitter of the PR to test the PR with a > machine they have on hand, and I would be inclined to keep this trend of > self-reporting. Of course, if someone else came along and reported a > performance regression of, say, 10%, then we have increased code by 20%, > with only a net 5% gain in performance, and the PR will have to be reverted. > > > > ?snip? > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From matti.picus at gmail.com Wed Feb 12 02:19:14 2020 From: matti.picus at gmail.com (Matti Picus) Date: Wed, 12 Feb 2020 09:19:14 +0200 Subject: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics In-Reply-To: References: Message-ID: <5475fdf3-b435-2266-4c0e-b45b38ebe21e@gmail.com> On 11/2/20 8:02 pm, Devulapalli, Raghuveer wrote: > > On top of that the performance implications aren?t clear. Software > implementations of hardware instructions might perform worse and might > not even produce the same result. > The proposal for universal intrinsics does not enable replacing an intrinsic on one platform with a software emulation on another: the intrinsics are meant to be compile-time defines that overlay the universal intrinsic with a platform specific one. In order to use a new intrinsic, it must have parallel intrinsics on the other platforms, or cannot be used there: "NPY_CPU_HAVE(FEATURE_NAME)" will always return false so the compiler will not even build a loop for that platform. I will try to clarify that intention in the NEP. I hope there will not be a demand to use many non-universal intrinsics in ufuncs, we will need to work this out on a case-by-case basis in each ufunc. Does that sound reasonable? Are there intrinsics you have already used that have no parallel on other platforms? Matti From charlesr.harris at gmail.com Wed Feb 12 11:08:02 2020 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 12 Feb 2020 09:08:02 -0700 Subject: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics In-Reply-To: <5475fdf3-b435-2266-4c0e-b45b38ebe21e@gmail.com> References: <5475fdf3-b435-2266-4c0e-b45b38ebe21e@gmail.com> Message-ID: On Wed, Feb 12, 2020 at 12:19 AM Matti Picus wrote: > On 11/2/20 8:02 pm, Devulapalli, Raghuveer wrote: > > > > On top of that the performance implications aren?t clear. Software > > implementations of hardware instructions might perform worse and might > > not even produce the same result. > > > > The proposal for universal intrinsics does not enable replacing an > intrinsic on one platform with a software emulation on another: the > intrinsics are meant to be compile-time defines that overlay the > universal intrinsic with a platform specific one. In order to use a new > intrinsic, it must have parallel intrinsics on the other platforms, or > cannot be used there: "NPY_CPU_HAVE(FEATURE_NAME)" will always return > false so the compiler will not even build a loop for that platform. I > will try to clarify that intention in the NEP. > > > I hope there will not be a demand to use many non-universal intrinsics > in ufuncs, we will need to work this out on a case-by-case basis in each > ufunc. Does that sound reasonable? Are there intrinsics you have already > used that have no parallel on other platforms? > > Intrinsics are not an irreversible change, they are, after all, private. The question is whether they are sufficiently useful to justify the time spent on them. I don't think we will know that until we attempt actual implementations. There will probably be some changes as a result of experience, but that is normal. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From melissawm at gmail.com Wed Feb 12 08:55:09 2020 From: melissawm at gmail.com (=?UTF-8?Q?Melissa_Mendon=C3=A7a?=) Date: Wed, 12 Feb 2020 10:55:09 -0300 Subject: [Numpy-discussion] NEP 44 - Restructuring the NumPy Documentation Message-ID: Hi all, Please see the NEP below for a proposal to restructure the documentation of NumPy. The main goal here is to make the documentation more visible and organized, and also make contributions easier. Comments and feedback are welcome! See https://github.com/numpy/numpy/pull/15554 for details. Best, Melissa ---- NEP 44 ? Restructuring the NumPy Documentation Authors: Ralf Gommers, Melissa Mendon?a, Mars Lee Status: Draft Type: Process Created: 2020-02-11 Abstract ====== This document proposes a restructuring of the NumPy Documentation, both in form and content, with the goal of making it more organized and discoverable for beginners and experienced users. Motivation and Scope ================= See [here](numpy.org/devdocs) for the front page of the latest docs. The organization is quite confusing and illogical (e.g. user and developer docs are mixed). We propose the following: - Reorganizing the docs into the four categories mentioned in [1]; - Creating dedicated sections for Tutorials and How-Tos, including orientation on how to create new content; - Adding an Explanations section for key concepts and techniques that require deeper descriptions, some of which will be rearranged from the Reference Guide. Usage and Impact ============== The documentation is a fundamental part of any software project, especially open source projects. In the case of NumPy, many beginners might feel demotivated by the current structure of the documentation, since it is difficult to discover what to learn (unless the user has a clear view of what to look for in the Reference docs, which is not always the case). Looking at the results of a ?NumPy Tutorial? search on any search engine also gives an idea of the demand for this kind of content. Having official high-level documentation written using up-to-date content and techniques will certainly mean more users (and developers/contributors) are involved in the NumPy community. Backward compatibility ================== The restructuring will effectively demand a complete rewrite of links and some of the current content. Input from the community will be useful for identifying key links and pages that should not be broken. Detailed description =============== As discussed in the article [1], there are four categories of doc content: - Tutorials - How-to guides - Explanations - Reference guide We propose to use those categories as the ones we use (for writing and reviewing) whenever we add a new documentation section. The reasoning for this is that it is clearer both for developers/documentation writers and to users where each information should go, and the scope and tone of each document. For example, if explanations are mixed with basic tutorials, beginners might be overwhelmed and alienated. On the other hand, if the reference guide contains basic how-tos, it might be difficult for experienced users to find the information they need, quickly. Currently, there are many blogs and tutorials on the internet about NumPy or using NumPy. One of the issues with this is that if users search for this information and end up in an outdated (unofficial) tutorial before they find the current official documentation, they end up creating content that is confusing, especially for beginners. Having a better infrastructure for the documentation also aims to solve this problem by giving users high-level, up-to-date official documentation that can be easily updated. Status and ideas of each type of doc content ------------------------------------------------------------ * Reference guide NumPy has a quite complete reference guide. All functions are documented, most have examples, and most are cross-linked well with See Also sections. Further improving the reference guide is incremental work that can be done (and is being done) by many people. There are, however, many explanations in the reference guide. These can be moved to a more dedicated Explanations section on the docs. * How-to guides NumPy does not have many how-to?s. The subclassing and array ducktyping section may be an example of a how-to. Others that could be added are: - Parallelization (controlling BLAS multithreading with threadpoolctl, using multiprocessing, random number generation, etc.) - Storing and loading data (.npy/.npz format, text formats, Zarr, HDF5, Bloscpack, etc.) - Performance (memory layout, profiling, use with Numba, Cython, or Pythran) - Writing generic code that works with NumPy, Dask, CuPy, pydata/sparse, etc. * Explanations There is a reasonable amount of content on fundamental NumPy concepts such as indexing, vectorization, broadcasting, (g)ufuncs, and dtypes. This could be organized better and clarified to ensure it?s really about explaining the concepts and not mixed with tutorial or how-to like content. There are few explanations about anything other than those fundamental NumPy concepts. Some examples of concepts that could be expanded: - Copies vs. Views; - BLAS and other linear algebra libraries; - Fancy indexing. In addition, there are many explanations in the Reference Guide, which should be moved to this new dedicated Explanations section. * Tutorials There?s a lot of scope for writing better tutorials. We have a new NumPy for absolute beginners tutorial [3] (GSoD project of Anne Bonner). In addition we need a number of tutorials addressing different levels of experience with Python and NumPy. This could be done using engaging data sets, ideas or stories. For example, curve fitting with polynomials and functions in numpy.linalg could be done with the Keeling curve (decades worth of CO2 concentration in air measurements) rather than with synthetic random data. Ideas for tutorials (these capture the types of things that make sense, they?re not necessarily the exact topics we propose to implement): - Conway?s game of life with only NumPy (note: already in Nicolas Rougier?s book) - Using masked arrays to deal with missing data in time series measurements - Using Fourier transforms to analyze the Keeling curve data, and extrapolate it. - Geospatial data (e.g. lat/lon/time to create maps for every year via a stacked array, like gridMet data) - Using text data and dtypes (e.g. use speeches from different people, shape (n_speech, n_sentences, n_words)) The Preparing to Teach document [2] from the Software Carpentry Instructor Training materials is a nice summary of how to write effective lesson plans (and tutorials would be very similar). In addition to adding new tutorials, we also propose a How to write a tutorial document, which would help users contribute new high-quality content to the documentation. Data sets ------------- Using interesting data in the NumPy docs requires giving all users access to that data, either inside NumPy or in a separate package. The former is not the best idea, since it?s hard to do without increasing the size of NumPy significantly. Even for SciPy there has so far been no consensus on this (see scipy PR 8707 on adding a new scipy.datasets subpackage). So we?ll aim for a new (pure Python) package, named numpy-datasets or scipy-datasets or something similar. That package can take some lessons from how, e.g., scikit-learn ships data sets. Small data sets can be included in the repo, large data sets can be accessed via a downloader class or function. Related Work =========== Some examples of documentation organization in other projects: - Documentation for Jupyter: https://jupyter.org/documentation - Documentation for Python: https://docs.python.org/3/ - Documentation for TensorFlow: https://www.tensorflow.org/learn These projects make the intended audience for each part of the documentation more explicit, as well as previewing some of the content in each section. Implementation ============ Besides rewriting the current documentation to some extent, it would be ideal to have a technical infrastructure that would allow more contributions from the community. For example, if Jupyter Notebooks could be submitted as-is as tutorials or How-Tos, this might create more contributors and broaden the NumPy community. Similarly, if people could download some of the documentation in Notebook format, this would certainly mean people would use less outdated material for learning NumPy. It would also be interesting if the new structure for the documentation makes translations easier. Currently, the documentation for NumPy can be confusing, especially for beginners. Our proposal is to reorganize the docs in the following structure: * For users: - Absolute Beginners Tutorial - main Tutorials section - How To?s for common tasks with NumPy - Reference Guide - Explanations - F2Py Guide - Glossary * For developers/contributors: - Contributor?s Guide - Building and extending the documentation - Benchmarking - NumPy Enhancement Proposals * Meta information - Reporting bugs - Release Notes - About NumPy - License References and Footnotes ==================== [1] What nobody tells you about documentation. https://www.divio.com/blog/documentation/ [2] Preparing to Teach (from the Software Carpentry Instructor Training materials). https://carpentries.github.io/instructor-training/15-lesson-study/index.html [3] NumPy for absolute beginners Tutorial by Anne Bonner. https://numpy.org/devdocs/user/absolute_beginners.html Copyright ======== This document has been placed in the public domain. -- Melissa Weber Mendon?a -------------- next part -------------- An HTML attachment was scrubbed... URL: From raghuveer.devulapalli at intel.com Wed Feb 12 14:36:10 2020 From: raghuveer.devulapalli at intel.com (Devulapalli, Raghuveer) Date: Wed, 12 Feb 2020 19:36:10 +0000 Subject: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics In-Reply-To: <5475fdf3-b435-2266-4c0e-b45b38ebe21e@gmail.com> References: <5475fdf3-b435-2266-4c0e-b45b38ebe21e@gmail.com> Message-ID: >> I hope there will not be a demand to use many non-universal intrinsics in ufuncs, we will need to work this out on a case-by-case basis in each ufunc. Does that sound reasonable? Are there intrinsics you have already used that have no parallel on other platforms? I think that is reasonable. It's hard to anticipate the future need and benefit of specialized intrinsics but I tried to make a list of some of the specialized intrinsics that are currently in use in NumPy that I don?t believe exist on other platforms (most of these actually don?t exist on AVX2 either). I am not an expert in ARM or VSX architecture, so please correct me if I am wrong. a. _mm512_mask_i32gather_ps b. _mm512_mask_i32scatter_ps/_mm512_mask_i32scatter_pd c. _mm512_maskz_loadu_pd/_mm512_maskz_loadu_ps d. _mm512_getexp_ps e. _mm512_getmant_ps f. _mm512_scalef_ps g. _mm512_permutex2var_ps, _mm512_permutex2var_pd h. _mm512_maskz_div_ps, _mm512_maskz_div_pd i. _mm512_permute_ps/_mm512_permute_pd j. _mm512_sqrt_ps/pd (I could be wrong on this one, but from the little google search I did, it seems like power ISA doesn?t have a vectorized sqrt instruction) Software implementations of these instructions is definitely possible. But some of them are not trivial to implement and are surely not going to be one line macro's either. I am also unsure of what implications this has on performance, but we will hopefully find out once we convert these to universal intrinsic and then benchmark. Raghuveer -----Original Message----- From: NumPy-Discussion On Behalf Of Matti Picus Sent: Tuesday, February 11, 2020 11:19 PM To: numpy-discussion at python.org Subject: Re: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics On 11/2/20 8:02 pm, Devulapalli, Raghuveer wrote: > > On top of that the performance implications aren?t clear. Software > implementations of hardware instructions might perform worse and might > not even produce the same result. > The proposal for universal intrinsics does not enable replacing an intrinsic on one platform with a software emulation on another: the intrinsics are meant to be compile-time defines that overlay the universal intrinsic with a platform specific one. In order to use a new intrinsic, it must have parallel intrinsics on the other platforms, or cannot be used there: "NPY_CPU_HAVE(FEATURE_NAME)" will always return false so the compiler will not even build a loop for that platform. I will try to clarify that intention in the NEP. I hope there will not be a demand to use many non-universal intrinsics in ufuncs, we will need to work this out on a case-by-case basis in each ufunc. Does that sound reasonable? Are there intrinsics you have already used that have no parallel on other platforms? Matti _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion at python.org https://mail.python.org/mailman/listinfo/numpy-discussion From Jerome.Kieffer at esrf.fr Thu Feb 13 04:02:51 2020 From: Jerome.Kieffer at esrf.fr (Jerome Kieffer) Date: Thu, 13 Feb 2020 10:02:51 +0100 Subject: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics In-Reply-To: References: <5475fdf3-b435-2266-4c0e-b45b38ebe21e@gmail.com> Message-ID: <20200213100251.71742e2e@lintaillefer.esrf.fr> On Wed, 12 Feb 2020 19:36:10 +0000 "Devulapalli, Raghuveer" wrote: > j. _mm512_sqrt_ps/pd (I could be wrong on this one, but from the little google search I did, it seems like power ISA doesn?t have a vectorized sqrt instruction) Hi, starting at Power7 (we are at Power9), the sqrt is available both in single and double precision: https://www.ibm.com/support/knowledgecenter/SSGH2K_12.1.0/com.ibm.xlc121.aix.doc/compiler_ref/vec_sqrt.html Cheers, -- J?r?me Kieffer tel +33 476 882 445 From opossumnano at gmail.com Thu Feb 13 10:41:28 2020 From: opossumnano at gmail.com (ASPP) Date: Thu, 13 Feb 2020 07:41:28 -0800 (PST) Subject: [Numpy-discussion] =?utf-8?b?W0FOTl0gMTPhtZfKsCBBZHZhbmNlZCBT?= =?utf-8?q?cientific_Programming_in_Python_in_Ghent=2C_Belgium=2C_31_Augus?= =?utf-8?b?dOKAlDUgU2VwdGVtYmVyLCAyMDIw?= Message-ID: <5e456e28.1c69fb81.9303c.f5bd@mx.google.com> 13?? Advanced Scientific Programming in Python ============================================== a Summer School by the ASPP faculty and the Ghent University https://aspp.school Scientists spend more and more time writing, maintaining, and debugging software. While techniques for doing this efficiently have evolved, only few scientists have been trained to use them. As a result, instead of doing their research, they spend far too much time writing deficient code and reinventing the wheel. In this course we will present a selection of advanced programming techniques and best practices which are standard in the industry, but especially tailored to the needs of a programming scientist. Lectures are devised to be interactive and to give the students enough time to acquire direct hands-on experience with the materials. Students will work in pairs throughout the school and will team up to practice the newly learned skills in a real programming project ? an entertaining computer game. We use the Python programming language for the entire course. Python works as a simple programming language for beginners, but more importantly, it also works great in scientific simulations and data analysis. We show how clean language design, ease of extensibility, and the great wealth of open source libraries for scientific computing and data visualization are driving Python to become a standard tool for the programming scientist. This school is targeted at Master or PhD students and Post-docs from all areas of science. Competence in Python or in another language such as Java, C/C++, MATLAB, or R is absolutely required. Basic knowledge of Python and of a version control system such as git, subversion, mercurial, or bazaar is assumed. Participants without any prior experience with Python and/or git should work through the proposed introductory material before the course. We are striving hard to get a pool of students which is international and gender-balanced. Date & Location =============== 31 August?5 September, 2020. Ghent, Belgium. Application =========== You can apply online: https://aspp.school/wiki/applications Application deadline: 23:59 UTC, Sunday 24 May, 2020 There will be no deadline extension, so be sure to apply on time. Be sure to read the FAQ before applying: https://aspp.school/wiki/faq Participation is for free, i.e. no fee is charged! Accommodation in the student residence comes at no costs for participants. We are trying to arrange financial coverage for food expenses too, but this may not work. Participants however should take care of travel expenses by themselves. Program ======= ? Version control with git and how to contribute to open source projects with GitHub ? Best practices in data visualization ? Testing and debugging scientific code ? Advanced NumPy ? Organizing, documenting, and distributing scientific code ? Advanced scientific Python: context managers and generators ? Writing parallel applications in Python ? Profiling and speeding up scientific code with Cython and numba ? Programming in teams Faculty ======= ? Caterina Buizza, Personal Robotics Lab, Imperial College London UK ? Lisa Schwetlick, Experimental and Biological Psychology, Universit?t Potsdam Germany ? Nelle Varoquaux, CNRS, TIMC-IMAG, University Grenoble Alpes France ? Nicolas P. Rougier, Inria Bordeaux Sud-Ouest, Institute of Neurodegenerative Disease, University of Bordeaux France ? Pamela Hathway, Neural Reckoning, Imperial College London UK ? Pietro Berkes, NAGRA Kudelski, Lausanne Switzerland ? Rike-Benjamin Schuppner, Institute for Theoretical Biology, Humboldt-Universit?t zu Berlin Germany ? Tiziano Zito, Department of Psychology, Humboldt-Universit?t zu Berlin Germany ? Zbigniew J?drzejewski-Szmek, Red Hat Inc., Warsaw Poland Organizers ========== Head of the organization for ASPP and responsible for the scientific program: ? Tiziano Zito, Department of Psychology, Humboldt-Universit?t zu Berlin Germany Local team: ? Nina Turk, Photonics Research Group, INTEC, Ghent University ? imec Belgium ? Freya Acar, Office for Data and Information, City of Ghent Belgium ? Joan Juvert Institutional organizers: ? Wim Bogaerts, Photonics Research Group, INTEC, Ghent University ?imec Belgium ? Sven Degroeve, VIB-UGent Center for Medical Biotechnology, Ghent Belgium ? Jeroen Famaey, Department of Mathematics and Computer Science, University of Antwerp ? iMinds Belgium ? Bernard Manderick, Artificial Intelligence Lab, Vrije Universiteit Brussel Belgium Website: https://aspp.school Contact: info at aspp.school From opossumnano at gmail.com Thu Feb 13 10:44:11 2020 From: opossumnano at gmail.com (Tiziano Zito) Date: Thu, 13 Feb 2020 07:44:11 -0800 (PST) Subject: [Numpy-discussion] =?utf-8?b?W0FOTl0gMTPhtZfKsCBBZHZhbmNlZCBT?= =?utf-8?q?cientific_Programming_in_Python_in_Ghent=2C_Belgium=2C_31_Augus?= =?utf-8?b?dOKAlDUgU2VwdGVtYmVyLCAyMDIw?= Message-ID: <5e456ecb.1c69fb81.dd39f.e267@mx.google.com> 13?? Advanced Scientific Programming in Python ============================================== a Summer School by the ASPP faculty and the Ghent University https://aspp.school Scientists spend more and more time writing, maintaining, and debugging software. While techniques for doing this efficiently have evolved, only few scientists have been trained to use them. As a result, instead of doing their research, they spend far too much time writing deficient code and reinventing the wheel. In this course we will present a selection of advanced programming techniques and best practices which are standard in the industry, but especially tailored to the needs of a programming scientist. Lectures are devised to be interactive and to give the students enough time to acquire direct hands-on experience with the materials. Students will work in pairs throughout the school and will team up to practice the newly learned skills in a real programming project ? an entertaining computer game. We use the Python programming language for the entire course. Python works as a simple programming language for beginners, but more importantly, it also works great in scientific simulations and data analysis. We show how clean language design, ease of extensibility, and the great wealth of open source libraries for scientific computing and data visualization are driving Python to become a standard tool for the programming scientist. This school is targeted at Master or PhD students and Post-docs from all areas of science. Competence in Python or in another language such as Java, C/C++, MATLAB, or R is absolutely required. Basic knowledge of Python and of a version control system such as git, subversion, mercurial, or bazaar is assumed. Participants without any prior experience with Python and/or git should work through the proposed introductory material before the course. We are striving hard to get a pool of students which is international and gender-balanced. Date & Location =============== 31 August?5 September, 2020. Ghent, Belgium. Application =========== You can apply online: https://aspp.school/wiki/applications Application deadline: 23:59 UTC, Sunday 24 May, 2020 There will be no deadline extension, so be sure to apply on time. Be sure to read the FAQ before applying: https://aspp.school/wiki/faq Participation is for free, i.e. no fee is charged! Accommodation in the student residence comes at no costs for participants. We are trying to arrange financial coverage for food expenses too, but this may not work. Participants however should take care of travel expenses by themselves. Program ======= ? Version control with git and how to contribute to open source projects with GitHub ? Best practices in data visualization ? Testing and debugging scientific code ? Advanced NumPy ? Organizing, documenting, and distributing scientific code ? Advanced scientific Python: context managers and generators ? Writing parallel applications in Python ? Profiling and speeding up scientific code with Cython and numba ? Programming in teams Faculty ======= ? Caterina Buizza, Personal Robotics Lab, Imperial College London UK ? Lisa Schwetlick, Experimental and Biological Psychology, Universit?t Potsdam Germany ? Nelle Varoquaux, CNRS, TIMC-IMAG, University Grenoble Alpes France ? Nicolas P. Rougier, Inria Bordeaux Sud-Ouest, Institute of Neurodegenerative Disease, University of Bordeaux France ? Pamela Hathway, Neural Reckoning, Imperial College London UK ? Pietro Berkes, NAGRA Kudelski, Lausanne Switzerland ? Rike-Benjamin Schuppner, Institute for Theoretical Biology, Humboldt-Universit?t zu Berlin Germany ? Tiziano Zito, Department of Psychology, Humboldt-Universit?t zu Berlin Germany ? Zbigniew J?drzejewski-Szmek, Red Hat Inc., Warsaw Poland Organizers ========== Head of the organization for ASPP and responsible for the scientific program: ? Tiziano Zito, Department of Psychology, Humboldt-Universit?t zu Berlin Germany Local team: ? Nina Turk, Photonics Research Group, INTEC, Ghent University ? imec Belgium ? Freya Acar, Office for Data and Information, City of Ghent Belgium ? Joan Juvert Institutional organizers: ? Wim Bogaerts, Photonics Research Group, INTEC, Ghent University ?imec Belgium ? Sven Degroeve, VIB-UGent Center for Medical Biotechnology, Ghent Belgium ? Jeroen Famaey, Department of Mathematics and Computer Science, University of Antwerp ? iMinds Belgium ? Bernard Manderick, Artificial Intelligence Lab, Vrije Universiteit Brussel Belgium Website: https://aspp.school Contact: info at aspp.school From ralf.gommers at gmail.com Thu Feb 13 12:27:13 2020 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Thu, 13 Feb 2020 11:27:13 -0600 Subject: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics In-Reply-To: References: <5475fdf3-b435-2266-4c0e-b45b38ebe21e@gmail.com> Message-ID: On Wed, Feb 12, 2020 at 1:37 PM Devulapalli, Raghuveer < raghuveer.devulapalli at intel.com> wrote: > >> I hope there will not be a demand to use many non-universal intrinsics > in ufuncs, we will need to work this out on a case-by-case basis in each > ufunc. Does that sound reasonable? Are there intrinsics you have already > used that have no parallel on other platforms? > > I think that is reasonable. It's hard to anticipate the future need and > benefit of specialized intrinsics but I tried to make a list of some of the > specialized intrinsics that are currently in use in NumPy that I don?t > believe exist on other platforms (most of these actually don?t exist on > AVX2 either). I am not an expert in ARM or VSX architecture, so please > correct me if I am wrong. > > a. _mm512_mask_i32gather_ps > b. _mm512_mask_i32scatter_ps/_mm512_mask_i32scatter_pd > c. _mm512_maskz_loadu_pd/_mm512_maskz_loadu_ps > d. _mm512_getexp_ps > e. _mm512_getmant_ps > f. _mm512_scalef_ps > g. _mm512_permutex2var_ps, _mm512_permutex2var_pd > h. _mm512_maskz_div_ps, _mm512_maskz_div_pd > i. _mm512_permute_ps/_mm512_permute_pd > j. _mm512_sqrt_ps/pd (I could be wrong on this one, but from the little > google search I did, it seems like power ISA doesn?t have a vectorized sqrt > instruction) > > Software implementations of these instructions is definitely possible. But > some of them are not trivial to implement and are surely not going to be > one line macro's either. I am also unsure of what implications this has on > performance, but we will hopefully find out once we convert these to > universal intrinsic and then benchmark. > For these it seems like we don't want software implementations of the universal intrinsics - if there's no equivalent on PPC/ARM and there's enough value (performance gain given additional code complexity) in the additional AVX instructions, then we should still simply use AVX instructions directly. Ralf > Raghuveer > > -----Original Message----- > From: NumPy-Discussion intel.com at python.org> On Behalf Of Matti Picus > Sent: Tuesday, February 11, 2020 11:19 PM > To: numpy-discussion at python.org > Subject: Re: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics > > On 11/2/20 8:02 pm, Devulapalli, Raghuveer wrote: > > > > On top of that the performance implications aren?t clear. Software > > implementations of hardware instructions might perform worse and might > > not even produce the same result. > > > > The proposal for universal intrinsics does not enable replacing an > intrinsic on one platform with a software emulation on another: the > intrinsics are meant to be compile-time defines that overlay the universal > intrinsic with a platform specific one. In order to use a new intrinsic, it > must have parallel intrinsics on the other platforms, or cannot be used > there: "NPY_CPU_HAVE(FEATURE_NAME)" will always return false so the > compiler will not even build a loop for that platform. I will try to > clarify that intention in the NEP. > > > I hope there will not be a demand to use many non-universal intrinsics in > ufuncs, we will need to work this out on a case-by-case basis in each > ufunc. Does that sound reasonable? Are there intrinsics you have already > used that have no parallel on other platforms? > > > Matti > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Fri Feb 14 14:03:11 2020 From: charlesr.harris at gmail.com (Charles R Harris) Date: Fri, 14 Feb 2020 12:03:11 -0700 Subject: [Numpy-discussion] manylinux2010. Message-ID: Hi All, Just a note that I've moved the nightly NumPy wheels builds to manylinux2010. Downstream projects testing against those wheels should check that they are using pip >= 19.0 in order to fetch those wheels. If there are any problems, please note them here. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark.harfouche at gmail.com Fri Feb 14 14:47:42 2020 From: mark.harfouche at gmail.com (Mark Harfouche) Date: Fri, 14 Feb 2020 11:47:42 -0800 Subject: [Numpy-discussion] manylinux2010. In-Reply-To: References: Message-ID: Chuck, Cool stuff! Will manylinux1 wheels compiled with older numpy (say 1.14) work with manylinux2010 wheels? What is your recommendation for downstream projects that depend on Cython/numpy to do? Do you have a document we can read? Mark On Fri, Feb 14, 2020 at 11:05 AM Charles R Harris wrote: > Hi All, > > Just a note that I've moved the nightly NumPy wheels builds to > manylinux2010. Downstream projects testing against those wheels should > check that they are using pip >= 19.0 in order to fetch those wheels. If > there are any problems, please note them here. > > Chuck > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Fri Feb 14 15:55:21 2020 From: charlesr.harris at gmail.com (Charles R Harris) Date: Fri, 14 Feb 2020 13:55:21 -0700 Subject: [Numpy-discussion] manylinux2010. In-Reply-To: References: Message-ID: On Fri, Feb 14, 2020 at 12:48 PM Mark Harfouche wrote: > Chuck, > > Cool stuff! > > Will manylinux1 wheels compiled with older numpy (say 1.14) work with > manylinux2010 wheels? > > What is your recommendation for downstream projects that depend on > Cython/numpy to do? > > Do you have a document we can read? > > I don't think there will be any problems apart from pip versions, the reason for using manylinux2010 in the pre-release wheels is to discover if I'm wrong about that :) For documentation of the manylinux project, see PyPA , PEP 571 , and PEP 599 . Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Fri Feb 14 16:27:29 2020 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Fri, 14 Feb 2020 13:27:29 -0800 Subject: [Numpy-discussion] Py-API: Deprecate `np.dtype(np.floating)` and similar dtype creation Message-ID: Hi all, In https://github.com/numpy/numpy/pull/15534 I would like to start deprecating creating dtypes from "abstract" scalar classes, such as: np.dtype(np.floating) is np.dtype(np.float64) While, at the same time, `isinstance(np.float32, np.floating)` is true. Right now `arr.astype(np.floating, copy=False)` and, more obviously, `arr.astype(np.dtype(np.floating), copy=False)` will cast a float32 array to float64. I think we should deprecate this, to consistently enable that in the future `dtype=np.floating` may choose to not cast a float32 array. Of course for the `astype` call the DeprecationWarning would be changed to a FutureWarning before we change the result value. A slight (but hopefully rare) annoyance is that `np.integer` might be used since it reads fairly well compared to `np.int_`. The large upstream packages such as SciPy or astropy seem to be clean in this regard, though (at least almost clean). Does anyone think this is a bad idea? To me these deprecations seem fairly straight forward, possibly flush out bugs/unintended behaviour, and necessary for consistent future behaviour. (More similar ones may have to follow). If there is some, but not much, hesitation, I can also add this to the NEP 41 draft. Although I currently feel it is the right thing to do even if we never had any new dtypes. - Sebastian -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From nathan.goldbaum at gmail.com Fri Feb 14 16:39:21 2020 From: nathan.goldbaum at gmail.com (Nathan) Date: Fri, 14 Feb 2020 14:39:21 -0700 Subject: [Numpy-discussion] Py-API: Deprecate `np.dtype(np.floating)` and similar dtype creation In-Reply-To: References: Message-ID: For what it's worth, github search only finds two instances of this usage: https://github.com/search?q=%22np.dtype%28np.floating%29%22&type=Code On Fri, Feb 14, 2020 at 2:28 PM Sebastian Berg wrote: > Hi all, > > In https://github.com/numpy/numpy/pull/15534 I would like to start > deprecating creating dtypes from "abstract" scalar classes, such as: > > np.dtype(np.floating) is np.dtype(np.float64) > > While, at the same time, `isinstance(np.float32, np.floating)` is true. > > Right now `arr.astype(np.floating, copy=False)` and, more obviously, > `arr.astype(np.dtype(np.floating), copy=False)` will cast a float32 > array to float64. > > I think we should deprecate this, to consistently enable that in the > future `dtype=np.floating` may choose to not cast a float32 array. Of > course for the `astype` call the DeprecationWarning would be changed to > a FutureWarning before we change the result value. > > A slight (but hopefully rare) annoyance is that `np.integer` might be > used since it reads fairly well compared to `np.int_`. The large > upstream packages such as SciPy or astropy seem to be clean in this > regard, though (at least almost clean). > > Does anyone think this is a bad idea? To me these deprecations seem > fairly straight forward, possibly flush out bugs/unintended behaviour, > and necessary for consistent future behaviour. (More similar ones may > have to follow). > > If there is some, but not much, hesitation, I can also add this to the > NEP 41 draft. Although I currently feel it is the right thing to do > even if we never had any new dtypes. > > - Sebastian > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Fri Feb 14 16:44:08 2020 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Fri, 14 Feb 2020 13:44:08 -0800 Subject: [Numpy-discussion] Py-API: Deprecate `np.dtype(np.floating)` and similar dtype creation In-Reply-To: References: Message-ID: <39cb4177ff1c61fe35fe6c8efb107c4c8d940ec0.camel@sipsolutions.net> On Fri, 2020-02-14 at 14:39 -0700, Nathan wrote: > For what it's worth, github search only finds two instances of this > usage: > > https://github.com/search?q=%22np.dtype%28np.floating%29%22&type=Code > In most common thing I would expect to be `dtype=np.integer` (possibly without the `dtype` as a positional argument). The call your search finds is nice because it must delete `np.dtype` call. As is, it is doing the incorrect thing so the deprecation would flush out a bug. - Sebastian > On Fri, Feb 14, 2020 at 2:28 PM Sebastian Berg < > sebastian at sipsolutions.net> wrote: > > Hi all, > > > > In https://github.com/numpy/numpy/pull/15534 I would like to start > > deprecating creating dtypes from "abstract" scalar classes, such > > as: > > > > np.dtype(np.floating) is np.dtype(np.float64) > > > > While, at the same time, `isinstance(np.float32, np.floating)` is > > true. > > > > Right now `arr.astype(np.floating, copy=False)` and, more > > obviously, > > `arr.astype(np.dtype(np.floating), copy=False)` will cast a float32 > > array to float64. > > > > I think we should deprecate this, to consistently enable that in > > the > > future `dtype=np.floating` may choose to not cast a float32 array. > > Of > > course for the `astype` call the DeprecationWarning would be > > changed to > > a FutureWarning before we change the result value. > > > > A slight (but hopefully rare) annoyance is that `np.integer` might > > be > > used since it reads fairly well compared to `np.int_`. The large > > upstream packages such as SciPy or astropy seem to be clean in this > > regard, though (at least almost clean). > > > > Does anyone think this is a bad idea? To me these deprecations seem > > fairly straight forward, possibly flush out bugs/unintended > > behaviour, > > and necessary for consistent future behaviour. (More similar ones > > may > > have to follow). > > > > If there is some, but not much, hesitation, I can also add this to > > the > > NEP 41 draft. Although I currently feel it is the right thing to do > > even if we never had any new dtypes. > > > > - Sebastian > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From nathan.goldbaum at gmail.com Fri Feb 14 17:07:07 2020 From: nathan.goldbaum at gmail.com (Nathan) Date: Fri, 14 Feb 2020 15:07:07 -0700 Subject: [Numpy-discussion] Py-API: Deprecate `np.dtype(np.floating)` and similar dtype creation In-Reply-To: <39cb4177ff1c61fe35fe6c8efb107c4c8d940ec0.camel@sipsolutions.net> References: <39cb4177ff1c61fe35fe6c8efb107c4c8d940ec0.camel@sipsolutions.net> Message-ID: Yeah, that seems to be more popular: https://github.com/search?q=%22dtype%3Dnp.integer%22&type=Code On Fri, Feb 14, 2020 at 2:45 PM Sebastian Berg wrote: > On Fri, 2020-02-14 at 14:39 -0700, Nathan wrote: > > For what it's worth, github search only finds two instances of this > > usage: > > > > https://github.com/search?q=%22np.dtype%28np.floating%29%22&type=Code > > > > In most common thing I would expect to be `dtype=np.integer` (possibly > without the `dtype` as a positional argument). > The call your search finds is nice because it must delete `np.dtype` > call. > As is, it is doing the incorrect thing so the deprecation would flush > out a bug. > > - Sebastian > > > > On Fri, Feb 14, 2020 at 2:28 PM Sebastian Berg < > > sebastian at sipsolutions.net> wrote: > > > Hi all, > > > > > > In https://github.com/numpy/numpy/pull/15534 I would like to start > > > deprecating creating dtypes from "abstract" scalar classes, such > > > as: > > > > > > np.dtype(np.floating) is np.dtype(np.float64) > > > > > > While, at the same time, `isinstance(np.float32, np.floating)` is > > > true. > > > > > > Right now `arr.astype(np.floating, copy=False)` and, more > > > obviously, > > > `arr.astype(np.dtype(np.floating), copy=False)` will cast a float32 > > > array to float64. > > > > > > I think we should deprecate this, to consistently enable that in > > > the > > > future `dtype=np.floating` may choose to not cast a float32 array. > > > Of > > > course for the `astype` call the DeprecationWarning would be > > > changed to > > > a FutureWarning before we change the result value. > > > > > > A slight (but hopefully rare) annoyance is that `np.integer` might > > > be > > > used since it reads fairly well compared to `np.int_`. The large > > > upstream packages such as SciPy or astropy seem to be clean in this > > > regard, though (at least almost clean). > > > > > > Does anyone think this is a bad idea? To me these deprecations seem > > > fairly straight forward, possibly flush out bugs/unintended > > > behaviour, > > > and necessary for consistent future behaviour. (More similar ones > > > may > > > have to follow). > > > > > > If there is some, but not much, hesitation, I can also add this to > > > the > > > NEP 41 draft. Although I currently feel it is the right thing to do > > > even if we never had any new dtypes. > > > > > > - Sebastian > > > _______________________________________________ > > > NumPy-Discussion mailing list > > > NumPy-Discussion at python.org > > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From albuscode at gmail.com Fri Feb 14 17:46:45 2020 From: albuscode at gmail.com (Inessa Pawson) Date: Sat, 15 Feb 2020 08:46:45 +1000 Subject: [Numpy-discussion] NumPy-Discussion Digest, Vol 161, Issue 14 In-Reply-To: References: Message-ID: Documentation is imperative for project sustainability, yet often overlooked. Millions of NumPy stakeholders will benefit from this initiative. Melissa, Mars and Ralf, thank you for taking a lead on this! On Thu, Feb 13, 2020 at 3:05 AM wrote: > Send NumPy-Discussion mailing list submissions to > numpy-discussion at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/numpy-discussion > or, via email, send a message with subject or body 'help' to > numpy-discussion-request at python.org > > You can reach the person managing the list at > numpy-discussion-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of NumPy-Discussion digest..." > Today's Topics: > > 1. NEP 44 - Restructuring the NumPy Documentation (Melissa Mendon?a) > > > > ---------- Forwarded message ---------- > From: "Melissa Mendon?a" > To: numpy-discussion at python.org > Cc: > Bcc: > Date: Wed, 12 Feb 2020 10:55:09 -0300 > Subject: [Numpy-discussion] NEP 44 - Restructuring the NumPy Documentation > Hi all, > > Please see the NEP below for a proposal to restructure the documentation > of NumPy. The main goal here is to make the documentation more visible and > organized, and also make contributions easier. > > Comments and feedback are welcome! > > > See https://github.com/numpy/numpy/pull/15554 for details. > > Best, > > Melissa > > ---- > > NEP 44 ? Restructuring the NumPy Documentation > > Authors: Ralf Gommers, Melissa Mendon?a, Mars Lee > Status: Draft > Type: Process > Created: 2020-02-11 > > Abstract > ====== > > This document proposes a restructuring of the NumPy Documentation, both in > form and content, with the goal of making it more organized and > discoverable for beginners and experienced users. > > Motivation and Scope > ================= > > See [here](numpy.org/devdocs) for the front page of the latest docs. The > organization is quite confusing and illogical (e.g. user and developer docs > are mixed). We propose the following: > > - Reorganizing the docs into the four categories mentioned in [1]; > - Creating dedicated sections for Tutorials and How-Tos, including > orientation on how to create new content; > - Adding an Explanations section for key concepts and techniques that > require deeper descriptions, some of which will be rearranged from the > Reference Guide. > > Usage and Impact > ============== > > The documentation is a fundamental part of any software project, > especially open source projects. In the case of NumPy, many beginners might > feel demotivated by the current structure of the documentation, since it is > difficult to discover what to learn (unless the user has a clear view of > what to look for in the Reference docs, which is not always the case). > > Looking at the results of a ?NumPy Tutorial? search on any search engine > also gives an idea of the demand for this kind of content. Having official > high-level documentation written using up-to-date content and techniques > will certainly mean more users (and developers/contributors) are involved > in the NumPy community. > > Backward compatibility > ================== > > The restructuring will effectively demand a complete rewrite of links and > some of the current content. Input from the community will be useful for > identifying key links and pages that should not be broken. > > Detailed description > =============== > > As discussed in the article [1], there are four categories of doc content: > - Tutorials > - How-to guides > - Explanations > - Reference guide > > We propose to use those categories as the ones we use (for writing and > reviewing) whenever we add a new documentation section. > > The reasoning for this is that it is clearer both for > developers/documentation writers and to users where each information should > go, and the scope and tone of each document. For example, if explanations > are mixed with basic tutorials, beginners might be overwhelmed and > alienated. On the other hand, if the reference guide contains basic > how-tos, it might be difficult for experienced users to find the > information they need, quickly. > > Currently, there are many blogs and tutorials on the internet about NumPy > or using NumPy. One of the issues with this is that if users search for > this information and end up in an outdated (unofficial) tutorial before > they find the current official documentation, they end up creating content > that is confusing, especially for beginners. Having a better infrastructure > for the documentation also aims to solve this problem by giving users > high-level, up-to-date official documentation that can be easily updated. > > Status and ideas of each type of doc content > ------------------------------------------------------------ > > * Reference guide > > NumPy has a quite complete reference guide. All functions are documented, > most have examples, and most are cross-linked well with See Also sections. > Further improving the reference guide is incremental work that can be done > (and is being done) by many people. There are, however, many explanations > in the reference guide. These can be moved to a more dedicated Explanations > section on the docs. > > * How-to guides > > NumPy does not have many how-to?s. The subclassing and array ducktyping > section may be an example of a how-to. Others that could be added are: > - Parallelization (controlling BLAS multithreading with threadpoolctl, > using multiprocessing, random number generation, etc.) > - Storing and loading data (.npy/.npz format, text formats, Zarr, HDF5, > Bloscpack, etc.) > - Performance (memory layout, profiling, use with Numba, Cython, or > Pythran) > - Writing generic code that works with NumPy, Dask, CuPy, pydata/sparse, > etc. > > * Explanations > > There is a reasonable amount of content on fundamental NumPy concepts such > as indexing, vectorization, broadcasting, (g)ufuncs, and dtypes. This could > be organized better and clarified to ensure it?s really about explaining > the concepts and not mixed with tutorial or how-to like content. > > There are few explanations about anything other than those fundamental > NumPy concepts. > > Some examples of concepts that could be expanded: > - Copies vs. Views; > - BLAS and other linear algebra libraries; > - Fancy indexing. > > In addition, there are many explanations in the Reference Guide, which > should be moved to this new dedicated Explanations section. > > * Tutorials > > There?s a lot of scope for writing better tutorials. We have a new NumPy > for absolute beginners tutorial [3] (GSoD project of Anne Bonner). In > addition we need a number of tutorials addressing different levels of > experience with Python and NumPy. This could be done using engaging data > sets, ideas or stories. For example, curve fitting with polynomials and > functions in numpy.linalg could be done with the Keeling curve (decades > worth of CO2 concentration in air measurements) rather than with synthetic > random data. > > Ideas for tutorials (these capture the types of things that make sense, > they?re not necessarily the exact topics we propose to implement): > - Conway?s game of life with only NumPy (note: already in Nicolas > Rougier?s book) > - Using masked arrays to deal with missing data in time series measurements > - Using Fourier transforms to analyze the Keeling curve data, and > extrapolate it. > - Geospatial data (e.g. lat/lon/time to create maps for every year via a > stacked array, like gridMet data) > - Using text data and dtypes (e.g. use speeches from different people, > shape (n_speech, n_sentences, n_words)) > > The Preparing to Teach document [2] from the Software Carpentry Instructor > Training materials is a nice summary of how to write effective lesson plans > (and tutorials would be very similar). In addition to adding new tutorials, > we also propose a How to write a tutorial document, which would help users > contribute new high-quality content to the documentation. > > Data sets > ------------- > > Using interesting data in the NumPy docs requires giving all users access > to that data, either inside NumPy or in a separate package. The former is > not the best idea, since it?s hard to do without increasing the size of > NumPy significantly. Even for SciPy there has so far been no consensus on > this (see scipy PR 8707 on adding a new scipy.datasets subpackage). > > So we?ll aim for a new (pure Python) package, named numpy-datasets or > scipy-datasets or something similar. That package can take some lessons > from how, e.g., scikit-learn ships data sets. Small data sets can be > included in the repo, large data sets can be accessed via a downloader > class or function. > > Related Work > =========== > > Some examples of documentation organization in other projects: > - Documentation for Jupyter: https://jupyter.org/documentation > - Documentation for Python: https://docs.python.org/3/ > - Documentation for TensorFlow: https://www.tensorflow.org/learn > > These projects make the intended audience for each part of the > documentation more explicit, as well as previewing some of the content in > each section. > > Implementation > ============ > > Besides rewriting the current documentation to some extent, it would be > ideal to have a technical infrastructure that would allow more > contributions from the community. For example, if Jupyter Notebooks could > be submitted as-is as tutorials or How-Tos, this might create more > contributors and broaden the NumPy community. > > Similarly, if people could download some of the documentation in Notebook > format, this would certainly mean people would use less outdated material > for learning NumPy. > > It would also be interesting if the new structure for the documentation > makes translations easier. > > Currently, the documentation for NumPy can be confusing, especially for > beginners. Our proposal is to reorganize the docs in the following > structure: > > * For users: > - Absolute Beginners Tutorial > - main Tutorials section > - How To?s for common tasks with NumPy > - Reference Guide > - Explanations > - F2Py Guide > - Glossary > > * For developers/contributors: > - Contributor?s Guide > - Building and extending the documentation > - Benchmarking > - NumPy Enhancement Proposals > > * Meta information > - Reporting bugs > - Release Notes > - About NumPy > - License > > References and Footnotes > ==================== > > [1] What nobody tells you about documentation. > https://www.divio.com/blog/documentation/ > [2] Preparing to Teach (from the Software Carpentry Instructor Training > materials). > https://carpentries.github.io/instructor-training/15-lesson-study/index.html > [3] NumPy for absolute beginners Tutorial by Anne Bonner. > https://numpy.org/devdocs/user/absolute_beginners.html > > Copyright > ======== > > This document has been placed in the public domain. > > -- > Melissa Weber Mendon?a > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -- Every good wish, *Inessa Pawson* Executive Director Albus Code inessa at albuscode.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Tue Feb 18 10:14:59 2020 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Tue, 18 Feb 2020 10:14:59 -0500 Subject: [Numpy-discussion] recent changes in np.maximum.accumulate ? Message-ID: I'm trying to track down test failures of statsmodels against recent master dev versions of numpy and scipy. The core computation is the following in one set of tests that fail pvals_corrected_raw = pvals * np.arange(ntests, 0, -1) pvals_corrected = np.maximum.accumulate(pvals_corrected_raw) this numpy version numpy-1.19.0.dev0%2B20200214184618_1f9ab28-cp38-cp38-manylinux2010_x86_64.whl is in the test run with failures (the first time statsmodel master failed) the previous version in the test runs didn't have these failures numpy-1.19.0.dev0%2B20200212232857_af0dfce-cp38-cp38-manylinux1_x86_64.whl I'm right now just fishing for candidates for the failures. And I'm not running any dev versions on my computer. Were there any recent changes that affect np.maximum.accumulate? Josef -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Tue Feb 18 10:41:58 2020 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Tue, 18 Feb 2020 07:41:58 -0800 Subject: [Numpy-discussion] recent changes in np.maximum.accumulate ? In-Reply-To: References: Message-ID: <553d467fc241e1b44b9881ada555dad90400d6a7.camel@sipsolutions.net> On Tue, 2020-02-18 at 10:14 -0500, josef.pktd at gmail.com wrote: > I'm trying to track down test failures of statsmodels against recent > master dev versions of numpy and scipy. > > The core computation is the following in one set of tests that fail > > pvals_corrected_raw = pvals * np.arange(ntests, 0, -1) > pvals_corrected = np.maximum.accumulate(pvals_corrected_raw) > Hmmm, the two git hashes indicate few changes between the two versions (mainly unicode related). However, recently there was also the addition of AVX-512F loops to maximum, so that seems like the most reasonable candidate (although I am unsure it changed exactly between those versions, it is also more complex maybe due to needing a machine that supports the instructions). Some details about the input could be nice. But if this is all that is as input, it sounds like it should be a contiguous array? I guess it might include subnormal numbers or NaN? Can you open an issue with some of those details if you have them? - Sebastian > this numpy version > numpy-1.19.0.dev0%2B20200214184618_1f9ab28-cp38-cp38- > manylinux2010_x86_64.whl > is in the test run with failures (the first time statsmodel master > failed) > > the previous version in the test runs didn't have these failures > numpy-1.19.0.dev0%2B20200212232857_af0dfce-cp38-cp38- > manylinux1_x86_64.whl > > > I'm right now just fishing for candidates for the failures. And I'm > not running any dev versions on my computer. > > Were there any recent changes that affect np.maximum.accumulate? > > Josef > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From kevin.k.sheppard at gmail.com Tue Feb 18 11:24:16 2020 From: kevin.k.sheppard at gmail.com (Kevin Sheppard) Date: Tue, 18 Feb 2020 16:24:16 +0000 Subject: [Numpy-discussion] recent changes in np.maximum.accumulate ? Message-ID: <5e4c0fb3.1c69fb81.14035.1b9f@mx.google.com> An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Tue Feb 18 11:29:25 2020 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Tue, 18 Feb 2020 11:29:25 -0500 Subject: [Numpy-discussion] recent changes in np.maximum.accumulate ? In-Reply-To: <553d467fc241e1b44b9881ada555dad90400d6a7.camel@sipsolutions.net> References: <553d467fc241e1b44b9881ada555dad90400d6a7.camel@sipsolutions.net> Message-ID: On Tue, Feb 18, 2020 at 10:43 AM Sebastian Berg wrote: > On Tue, 2020-02-18 at 10:14 -0500, josef.pktd at gmail.com wrote: > > I'm trying to track down test failures of statsmodels against recent > > master dev versions of numpy and scipy. > > > > The core computation is the following in one set of tests that fail > > > > pvals_corrected_raw = pvals * np.arange(ntests, 0, -1) > > pvals_corrected = np.maximum.accumulate(pvals_corrected_raw) > > > > Hmmm, the two git hashes indicate few changes between the two versions > (mainly unicode related). > > However, recently there was also the addition of AVX-512F loops to > maximum, so that seems like the most reasonable candidate (although I > am unsure it changed exactly between those versions, it is also more > complex maybe due to needing a machine that supports the instructions). > > Some details about the input could be nice. But if this is all that is > as input, it sounds like it should be a contiguous array? I guess it > might include subnormal numbers or NaN? > The test failures are on a Travis machine https://travis-ci.org/statsmodels/statsmodels/jobs/650430129 I can extract the numbers and examples from the unit test on my Windows computer. But, if it's machine dependent, then that might not be enough. The main reason why maximum.accumulate might be the problem is that in some tests we don't get monotonically increasing values, e.g. E x: array([0.012, 0.02 , 0.024, 0.024, 0.02 , 0.012]) E y: array([0.012, 0.02 , 0.024, 0.024, 0.024, 0.024]) first row is computed, second row is expected Josef > > Can you open an issue with some of those details if you have them? > > - Sebastian > > > > > this numpy version > > numpy-1.19.0.dev0%2B20200214184618_1f9ab28-cp38-cp38- > > manylinux2010_x86_64.whl > > is in the test run with failures (the first time statsmodel master > > failed) > > > > the previous version in the test runs didn't have these failures > > numpy-1.19.0.dev0%2B20200212232857_af0dfce-cp38-cp38- > > manylinux1_x86_64.whl > > > > > > I'm right now just fishing for candidates for the failures. And I'm > > not running any dev versions on my computer. > > > > Were there any recent changes that affect np.maximum.accumulate? > > > > Josef > > > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Tue Feb 18 13:47:52 2020 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Tue, 18 Feb 2020 10:47:52 -0800 Subject: [Numpy-discussion] NumPy Community Meeting Wednesday, Feb. 19 Message-ID: <993781f29624a5dc47907afdcb85b5db49222138.camel@sipsolutions.net> Hi all, There will be a NumPy Community meeting Wednesday February 19 at 11 am Pacific Time. Everyone is invited to join in and edit the work-in- progress meeting topics and notes: https://hackmd.io/76o-IxCjQX2mOXO_wwkcpg?both Best wishes Sebastian -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From jordan at jordanmartel.com Tue Feb 18 16:32:52 2020 From: jordan at jordanmartel.com (Jordan) Date: Tue, 18 Feb 2020 21:32:52 +0000 Subject: [Numpy-discussion] numpy_financial functions Message-ID: I teach finance at IU Bloomington and use the numpy_financial module pretty heavily. I've written a couple of functions for bond math for my own use (duration, convexity, forward rates, etc.). Is there any appetite for expanding numpy_financial beyond the core Excel functions? Is there a point-person for numpy_financial with whom I could correspond about contributing? Thanks, Jordan -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Tue Feb 18 16:54:36 2020 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 18 Feb 2020 14:54:36 -0700 Subject: [Numpy-discussion] numpy_financial functions In-Reply-To: References: Message-ID: On Tue, Feb 18, 2020 at 2:33 PM Jordan wrote: > I teach finance at IU Bloomington and use the numpy_financial module > pretty heavily. I've written a couple of functions for bond math for my own > use (duration, convexity, forward rates, etc.). Is there any appetite for > expanding numpy_financial beyond the core Excel functions? Is there a > point-person for numpy_financial with whom I could correspond about > contributing? > Thanks, > Jordan > The financial package is separate from NumPy at this point. I don't see a problem with it being extended as long as it is maintained. Your best bet might be to make a PR at https://github.com/numpy/numpy-financial and initiate a conversation. I suspect the current maintainer(s) would be happy for help. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefanv at berkeley.edu Tue Feb 18 20:06:00 2020 From: stefanv at berkeley.edu (Stefan van der Walt) Date: Tue, 18 Feb 2020 17:06:00 -0800 Subject: [Numpy-discussion] Tensor Developer Summit Message-ID: <167a1c2b-4288-4bc5-bae8-64f2d800e11d@www.fastmail.com> Hi all, This has been mentioned on the community calls, but not on the mailing list, so a reminder about the Tensor Developer Summit happening at March in Berkeley: https://xd-con.org/tensor-2020/ We would love to have developers and advanced users of NumPy (or other array libraries with Python interfaces) attend. Registration closes 20 February. Best regards, St?fan From melissawm at gmail.com Wed Feb 19 06:58:52 2020 From: melissawm at gmail.com (=?UTF-8?Q?Melissa_Mendon=C3=A7a?=) Date: Wed, 19 Feb 2020 08:58:52 -0300 Subject: [Numpy-discussion] Proposal to accept NEP #44: Restructuring the NumPy Documentation Message-ID: Hi all, I am proposing the acceptance of NEP 44 - Restructuring the NumPy Documentation. https://numpy.org/neps/nep-0044-restructuring-numpy-docs.html There were some comments about reorganizing the text to make it clearer, and some rewording regarding Reference Guides, developer "under-the-hoods" documentation and the possible mechanism for community contributions in the form of new tutorials or how-tos. Overall I believe we have answered all comments and there were no other points of concern. If there are no substantive objections within 7 days from this email, then the NEP will be accepted; see NEP 0 for more details: see http://www.numpy.org/neps/nep-0000.html Thanks for the comments! Looking forward to start working on this :) -- Melissa Weber Mendon?a -------------- next part -------------- An HTML attachment was scrubbed... URL: From ndbecker2 at gmail.com Wed Feb 19 14:43:38 2020 From: ndbecker2 at gmail.com (Neal Becker) Date: Wed, 19 Feb 2020 14:43:38 -0500 Subject: [Numpy-discussion] Tensor Developer Summit References: <167a1c2b-4288-4bc5-bae8-64f2d800e11d@www.fastmail.com> Message-ID: Stefan van der Walt wrote: > Hi all, > > This has been mentioned on the community calls, but not on the mailing > list, so a reminder about the Tensor Developer Summit happening at March > in Berkeley: > > https://xd-con.org/tensor-2020/ > > We would love to have developers and advanced users of NumPy (or other > array libraries with Python interfaces) attend. Registration closes 20 > February. > > Best regards, > St?fan Sounds like an exciting group! Will it be streamed? Thanks, Neal From stefanv at berkeley.edu Wed Feb 19 15:36:51 2020 From: stefanv at berkeley.edu (Stefan van der Walt) Date: Wed, 19 Feb 2020 12:36:51 -0800 Subject: [Numpy-discussion] Tensor Developer Summit In-Reply-To: References: <167a1c2b-4288-4bc5-bae8-64f2d800e11d@www.fastmail.com> Message-ID: Hi Neil, On Wed, Feb 19, 2020, at 11:43, Neal Becker wrote: > Sounds like an exciting group! Will it be streamed? Due to the highly interactive nature of this event, we will not be streaming. Best regards, St?fan From sebastian at sipsolutions.net Wed Feb 19 18:00:25 2020 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Wed, 19 Feb 2020 15:00:25 -0800 Subject: [Numpy-discussion] Proposal to accept NEP #44: Restructuring the NumPy Documentation In-Reply-To: References: Message-ID: <335487fc63226509f5686a2eab2e8cda65919d4c.camel@sipsolutions.net> On Wed, 2020-02-19 at 08:58 -0300, Melissa Mendon?a wrote: > Hi all, > > I am proposing the acceptance of NEP 44 - Restructuring the NumPy > Documentation. > > https://numpy.org/neps/nep-0044-restructuring-numpy-docs.html > > There were some comments about reorganizing the text to make it > clearer, and some rewording regarding Reference Guides, developer > "under-the-hoods" documentation and the possible mechanism for > community contributions in the form of new tutorials or how-tos. > Overall I believe we have answered all comments and there were no > other points of concern. > If there are no substantive objections within 7 days from this email, > then the NEP will be accepted; see NEP 0 for more details: > see http://www.numpy.org/neps/nep-0000.html > > Thanks for the comments! Looking forward to start working on this :) Looks like a good structure to me, I think go ahead and start! - Sebastian > -- > Melissa Weber Mendon?a > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From stefanv at berkeley.edu Fri Feb 21 13:29:16 2020 From: stefanv at berkeley.edu (Stefan van der Walt) Date: Fri, 21 Feb 2020 10:29:16 -0800 Subject: [Numpy-discussion] =?utf-8?q?Proposal_to_accept_NEP_=2344=3A_Res?= =?utf-8?q?tructuring_the_NumPy_Documentation?= In-Reply-To: References: Message-ID: <004c39a1-49dd-4a77-a6c1-3bdce2fbfad3@www.fastmail.com> On Wed, Feb 19, 2020, at 03:58, Melissa Mendon?a wrote: > I am proposing the acceptance of NEP 44 - Restructuring the NumPy Documentation. > > https://numpy.org/neps/nep-0044-restructuring-numpy-docs.html Thanks, Melissa, for developing this NEP! The plan makes sense to me. Best regards, St?fan -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Fri Feb 21 14:05:05 2020 From: charlesr.harris at gmail.com (Charles R Harris) Date: Fri, 21 Feb 2020 12:05:05 -0700 Subject: [Numpy-discussion] Scikit-learn in the news. Message-ID: Hi All, Just thought I mention a new paper where scikit-learn was used: A Deep Learning Approach to Antibiotic Discovery . Congratulations to the scikit-learn team. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Fri Feb 21 20:37:01 2020 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Fri, 21 Feb 2020 17:37:01 -0800 Subject: [Numpy-discussion] New DTypes: Are scalars a central concept in NumPy or not? Message-ID: Hi all, When we create new datatypes, we have the option to make new choices for the new datatypes [0] (not the existing ones). The question is: Should every NumPy datatype have a scalar associated and should operations like indexing return a scalar or a 0-D array? This is in my opinion a complex, almost philosophical, question, and we do not have to settle anything for a long time. But, if we do not decide a direction before we have many new datatypes the decision will make itself... So happy about any ideas, even if its just a gut feeling :). There are various points. I would like to mostly ignore the technical ones, but I am listing them anyway here: * Scalars are faster (although that can be optimized likely) * Scalars have a lower memory footprint * The current implementation incurs a technical debt in NumPy. (I do not think that is a general issue, though. We could automatically create scalars for each new datatype probably.) Advantages of having no scalars: * No need to keep track of scalars to preserve them in ufuncs, or libraries using `np.asarray`, do they need `np.asarray_or_scalar`? (or decide they return always arrays, although ufuncs may not) * Seems simpler in many ways, you always know the output will be an array if it has to do with NumPy. Advantages of having scalars: * Scalars are immutable and we are used to them from Python. A 0-D array cannot be used as a dictionary key consistently [1]. I.e. without scalars as first class citizen `dict[arr1d[0]]` cannot work, `dict[arr1d[0].item()]` may (if `.item()` is defined, and e.g. `dict[arr1d[0].frozen()]` could make a copy to work. [2] * Object arrays as we have them now make sense, `arr1d[0]` can reasonably return a Python object. I.e. arrays feel more like container if you can take elements out easily. Could go both ways: * Scalar math `scalar = arr1d[0]; scalar += 1` modifies the array without scalars. With scalars `arr1d[0, ...]` clarifies the meaning. (In principle it is good to never use `arr2d[0]` to get a 1D slice, probably more-so if scalars exist.) Note: array-scalars (the current NumPy scalars) are not useful in my opinion [3]. A scalar should not be indexed or have a shape. I do not believe in scalars pretending to be arrays. I personally tend towards liking scalars. If Python was a language where the array (array-programming) concept was ingrained into the language itself, I would lean the other way. But users are used to scalars, and they "put" scalars into arrays. Array objects are in some ways strange in Python, and I feel not having scalars detaches them further. Having scalars, however also means we should preserve them. I feel in principle that is actually fairly straight forward. E.g. for ufuncs: * np.add(scalar, scalar) -> scalar * np.add.reduce(arr, axis=None) -> scalar * np.add.reduce(arr, axis=1) -> array (even if arr is 1d) * np.add.reduce(scalar, axis=()) -> array Of course libraries that do `np.asarray` would/could basically chose to not preserve scalars: Their signature is defined as taking strictly array input. Cheers, Sebastian [0] At best this can be a vision to decide which way they may evolve. [1] E.g. PyTorch uses `hash(tensor) == id(tensor)` which is arguably strange. E.g. Quantity defines hash correctly, but does not fully ensure immutability for 0-D Quantities. Ensuring immutability in a world where "views" are a central concept requires a write-only copy. [2] Arguably `.item()` would always return a scalar, but it would be a second class citizen. (Although if it returns a scalar, at least we already have a scalar implementation.) [3] They are necessary due to technical debt for NumPy datatypes though. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From jni at fastmail.com Fri Feb 21 21:15:07 2020 From: jni at fastmail.com (Juan Nunez-Iglesias) Date: Fri, 21 Feb 2020 20:15:07 -0600 Subject: [Numpy-discussion] =?utf-8?q?New_DTypes=3A_Are_scalars_a_central?= =?utf-8?q?_concept_in_NumPy_or_not=3F?= In-Reply-To: References: Message-ID: I personally have always found it weird and annoying to deal with 0-D arrays, so +1 for scalars!* Juan *: admittedly, I have almost no grasp of the underlying NumPy implementation complexities, but I will happily take Sebastian's word that scalars can be consistent with the library. On Fri, 21 Feb 2020, at 7:37 PM, Sebastian Berg wrote: > Hi all, > > When we create new datatypes, we have the option to make new choices > for the new datatypes [0] (not the existing ones). > > The question is: Should every NumPy datatype have a scalar associated > and should operations like indexing return a scalar or a 0-D array? > > This is in my opinion a complex, almost philosophical, question, and we > do not have to settle anything for a long time. But, if we do not > decide a direction before we have many new datatypes the decision will > make itself... > So happy about any ideas, even if its just a gut feeling :). > > There are various points. I would like to mostly ignore the technical > ones, but I am listing them anyway here: > > * Scalars are faster (although that can be optimized likely) > > * Scalars have a lower memory footprint > > * The current implementation incurs a technical debt in NumPy. > (I do not think that is a general issue, though. We could > automatically create scalars for each new datatype probably.) > > Advantages of having no scalars: > > * No need to keep track of scalars to preserve them in ufuncs, or > libraries using `np.asarray`, do they need `np.asarray_or_scalar`? > (or decide they return always arrays, although ufuncs may not) > > * Seems simpler in many ways, you always know the output will be an > array if it has to do with NumPy. > > Advantages of having scalars: > > * Scalars are immutable and we are used to them from Python. > A 0-D array cannot be used as a dictionary key consistently [1]. > > I.e. without scalars as first class citizen `dict[arr1d[0]]` > cannot work, `dict[arr1d[0].item()]` may (if `.item()` is defined, > and e.g. `dict[arr1d[0].frozen()]` could make a copy to work. [2] > > * Object arrays as we have them now make sense, `arr1d[0]` can > reasonably return a Python object. I.e. arrays feel more like > container if you can take elements out easily. > > Could go both ways: > > * Scalar math `scalar = arr1d[0]; scalar += 1` modifies the array > without scalars. With scalars `arr1d[0, ...]` clarifies the > meaning. (In principle it is good to never use `arr2d[0]` to > get a 1D slice, probably more-so if scalars exist.) > > Note: array-scalars (the current NumPy scalars) are not useful in my > opinion [3]. A scalar should not be indexed or have a shape. I do not > believe in scalars pretending to be arrays. > > I personally tend towards liking scalars. If Python was a language > where the array (array-programming) concept was ingrained into the > language itself, I would lean the other way. But users are used to > scalars, and they "put" scalars into arrays. Array objects are in some > ways strange in Python, and I feel not having scalars detaches them > further. > > Having scalars, however also means we should preserve them. I feel in > principle that is actually fairly straight forward. E.g. for ufuncs: > > * np.add(scalar, scalar) -> scalar > * np.add.reduce(arr, axis=None) -> scalar > * np.add.reduce(arr, axis=1) -> array (even if arr is 1d) > * np.add.reduce(scalar, axis=()) -> array > > Of course libraries that do `np.asarray` would/could basically chose to > not preserve scalars: Their signature is defined as taking strictly > array input. > > Cheers, > > Sebastian > > > [0] At best this can be a vision to decide which way they may evolve. > > [1] E.g. PyTorch uses `hash(tensor) == id(tensor)` which is arguably > strange. E.g. Quantity defines hash correctly, but does not fully > ensure immutability for 0-D Quantities. Ensuring immutability in a > world where "views" are a central concept requires a write-only copy. > > [2] Arguably `.item()` would always return a scalar, but it would be a > second class citizen. (Although if it returns a scalar, at least we > already have a scalar implementation.) > > [3] They are necessary due to technical debt for NumPy datatypes > though. > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > > *Attachments:* > * signature.asc -------------- next part -------------- An HTML attachment was scrubbed... URL: From evgeny.burovskiy at gmail.com Sat Feb 22 09:27:01 2020 From: evgeny.burovskiy at gmail.com (Evgeni Burovski) Date: Sat, 22 Feb 2020 17:27:01 +0300 Subject: [Numpy-discussion] New DTypes: Are scalars a central concept in NumPy or not? In-Reply-To: References: Message-ID: Hi Sebastian, Just to clarify the difference: >>> x = np.float64(42) >>> y = np.array(42, dtype=float) Here `x` is a scalar and `y` is a 0D array, correct? If that's the case, not having the former would be very confusing for users (at least, that would be very confusing to me, FWIW). If anything, I think it'd be cleaner to not have the latter, and only have either scalars or 1D arrays (i.e., N-D arrays with N>=1), but it is probably way too late to even think about it anyway. Cheers, Evgeni On Sat, Feb 22, 2020 at 4:37 AM Sebastian Berg wrote: > > Hi all, > > When we create new datatypes, we have the option to make new choices > for the new datatypes [0] (not the existing ones). > > The question is: Should every NumPy datatype have a scalar associated > and should operations like indexing return a scalar or a 0-D array? > > This is in my opinion a complex, almost philosophical, question, and we > do not have to settle anything for a long time. But, if we do not > decide a direction before we have many new datatypes the decision will > make itself... > So happy about any ideas, even if its just a gut feeling :). > > There are various points. I would like to mostly ignore the technical > ones, but I am listing them anyway here: > > * Scalars are faster (although that can be optimized likely) > > * Scalars have a lower memory footprint > > * The current implementation incurs a technical debt in NumPy. > (I do not think that is a general issue, though. We could > automatically create scalars for each new datatype probably.) > > Advantages of having no scalars: > > * No need to keep track of scalars to preserve them in ufuncs, or > libraries using `np.asarray`, do they need `np.asarray_or_scalar`? > (or decide they return always arrays, although ufuncs may not) > > * Seems simpler in many ways, you always know the output will be an > array if it has to do with NumPy. > > Advantages of having scalars: > > * Scalars are immutable and we are used to them from Python. > A 0-D array cannot be used as a dictionary key consistently [1]. > > I.e. without scalars as first class citizen `dict[arr1d[0]]` > cannot work, `dict[arr1d[0].item()]` may (if `.item()` is defined, > and e.g. `dict[arr1d[0].frozen()]` could make a copy to work. [2] > > * Object arrays as we have them now make sense, `arr1d[0]` can > reasonably return a Python object. I.e. arrays feel more like > container if you can take elements out easily. > > Could go both ways: > > * Scalar math `scalar = arr1d[0]; scalar += 1` modifies the array > without scalars. With scalars `arr1d[0, ...]` clarifies the > meaning. (In principle it is good to never use `arr2d[0]` to > get a 1D slice, probably more-so if scalars exist.) > > Note: array-scalars (the current NumPy scalars) are not useful in my > opinion [3]. A scalar should not be indexed or have a shape. I do not > believe in scalars pretending to be arrays. > > I personally tend towards liking scalars. If Python was a language > where the array (array-programming) concept was ingrained into the > language itself, I would lean the other way. But users are used to > scalars, and they "put" scalars into arrays. Array objects are in some > ways strange in Python, and I feel not having scalars detaches them > further. > > Having scalars, however also means we should preserve them. I feel in > principle that is actually fairly straight forward. E.g. for ufuncs: > > * np.add(scalar, scalar) -> scalar > * np.add.reduce(arr, axis=None) -> scalar > * np.add.reduce(arr, axis=1) -> array (even if arr is 1d) > * np.add.reduce(scalar, axis=()) -> array > > Of course libraries that do `np.asarray` would/could basically chose to > not preserve scalars: Their signature is defined as taking strictly > array input. > > Cheers, > > Sebastian > > > [0] At best this can be a vision to decide which way they may evolve. > > [1] E.g. PyTorch uses `hash(tensor) == id(tensor)` which is arguably > strange. E.g. Quantity defines hash correctly, but does not fully > ensure immutability for 0-D Quantities. Ensuring immutability in a > world where "views" are a central concept requires a write-only copy. > > [2] Arguably `.item()` would always return a scalar, but it would be a > second class citizen. (Although if it returns a scalar, at least we > already have a scalar implementation.) > > [3] They are necessary due to technical debt for NumPy datatypes > though. > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion From josef.pktd at gmail.com Sat Feb 22 09:34:55 2020 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Sat, 22 Feb 2020 09:34:55 -0500 Subject: [Numpy-discussion] New DTypes: Are scalars a central concept in NumPy or not? In-Reply-To: References: Message-ID: not having a hashable tuple conversion would be a strong limitation a = tuple(np.arange(5)) versus a = tuple([np.array(i) for i in range(5)]) {a:5} Josef On Sat, Feb 22, 2020 at 9:28 AM Evgeni Burovski wrote: > Hi Sebastian, > > Just to clarify the difference: > > >>> x = np.float64(42) > >>> y = np.array(42, dtype=float) > > Here `x` is a scalar and `y` is a 0D array, correct? > If that's the case, not having the former would be very confusing for > users (at least, that would be very confusing to me, FWIW). > > If anything, I think it'd be cleaner to not have the latter, and only > have either scalars or 1D arrays (i.e., N-D arrays with N>=1), but it > is probably way too late to even think about it anyway. > > Cheers, > > Evgeni > > On Sat, Feb 22, 2020 at 4:37 AM Sebastian Berg > wrote: > > > > Hi all, > > > > When we create new datatypes, we have the option to make new choices > > for the new datatypes [0] (not the existing ones). > > > > The question is: Should every NumPy datatype have a scalar associated > > and should operations like indexing return a scalar or a 0-D array? > > > > This is in my opinion a complex, almost philosophical, question, and we > > do not have to settle anything for a long time. But, if we do not > > decide a direction before we have many new datatypes the decision will > > make itself... > > So happy about any ideas, even if its just a gut feeling :). > > > > There are various points. I would like to mostly ignore the technical > > ones, but I am listing them anyway here: > > > > * Scalars are faster (although that can be optimized likely) > > > > * Scalars have a lower memory footprint > > > > * The current implementation incurs a technical debt in NumPy. > > (I do not think that is a general issue, though. We could > > automatically create scalars for each new datatype probably.) > > > > Advantages of having no scalars: > > > > * No need to keep track of scalars to preserve them in ufuncs, or > > libraries using `np.asarray`, do they need `np.asarray_or_scalar`? > > (or decide they return always arrays, although ufuncs may not) > > > > * Seems simpler in many ways, you always know the output will be an > > array if it has to do with NumPy. > > > > Advantages of having scalars: > > > > * Scalars are immutable and we are used to them from Python. > > A 0-D array cannot be used as a dictionary key consistently [1]. > > > > I.e. without scalars as first class citizen `dict[arr1d[0]]` > > cannot work, `dict[arr1d[0].item()]` may (if `.item()` is defined, > > and e.g. `dict[arr1d[0].frozen()]` could make a copy to work. [2] > > > > * Object arrays as we have them now make sense, `arr1d[0]` can > > reasonably return a Python object. I.e. arrays feel more like > > container if you can take elements out easily. > > > > Could go both ways: > > > > * Scalar math `scalar = arr1d[0]; scalar += 1` modifies the array > > without scalars. With scalars `arr1d[0, ...]` clarifies the > > meaning. (In principle it is good to never use `arr2d[0]` to > > get a 1D slice, probably more-so if scalars exist.) > > > > Note: array-scalars (the current NumPy scalars) are not useful in my > > opinion [3]. A scalar should not be indexed or have a shape. I do not > > believe in scalars pretending to be arrays. > > > > I personally tend towards liking scalars. If Python was a language > > where the array (array-programming) concept was ingrained into the > > language itself, I would lean the other way. But users are used to > > scalars, and they "put" scalars into arrays. Array objects are in some > > ways strange in Python, and I feel not having scalars detaches them > > further. > > > > Having scalars, however also means we should preserve them. I feel in > > principle that is actually fairly straight forward. E.g. for ufuncs: > > > > * np.add(scalar, scalar) -> scalar > > * np.add.reduce(arr, axis=None) -> scalar > > * np.add.reduce(arr, axis=1) -> array (even if arr is 1d) > > * np.add.reduce(scalar, axis=()) -> array > > > > Of course libraries that do `np.asarray` would/could basically chose to > > not preserve scalars: Their signature is defined as taking strictly > > array input. > > > > Cheers, > > > > Sebastian > > > > > > [0] At best this can be a vision to decide which way they may evolve. > > > > [1] E.g. PyTorch uses `hash(tensor) == id(tensor)` which is arguably > > strange. E.g. Quantity defines hash correctly, but does not fully > > ensure immutability for 0-D Quantities. Ensuring immutability in a > > world where "views" are a central concept requires a write-only copy. > > > > [2] Arguably `.item()` would always return a scalar, but it would be a > > second class citizen. (Although if it returns a scalar, at least we > > already have a scalar implementation.) > > > > [3] They are necessary due to technical debt for NumPy datatypes > > though. > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Sat Feb 22 09:41:10 2020 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Sat, 22 Feb 2020 09:41:10 -0500 Subject: [Numpy-discussion] New DTypes: Are scalars a central concept in NumPy or not? In-Reply-To: References: Message-ID: On Sat, Feb 22, 2020 at 9:34 AM wrote: > not having a hashable tuple conversion would be a strong limitation > > a = tuple(np.arange(5)) > versus > a = tuple([np.array(i) for i in range(5)]) > {a:5} > also there is the question of which scalar .item() versus [()] This was used in the old times in scipy.stats, and I just saw https://github.com/scipy/scipy/pull/11165#issuecomment-589952838 aside: AFAIR, I use 0-dim arrays also to ensure that I have a numpy dtype and not, e.g. some equivalent python type Josef > > Josef > > On Sat, Feb 22, 2020 at 9:28 AM Evgeni Burovski < > evgeny.burovskiy at gmail.com> wrote: > >> Hi Sebastian, >> >> Just to clarify the difference: >> >> >>> x = np.float64(42) >> >>> y = np.array(42, dtype=float) >> >> Here `x` is a scalar and `y` is a 0D array, correct? >> If that's the case, not having the former would be very confusing for >> users (at least, that would be very confusing to me, FWIW). >> >> If anything, I think it'd be cleaner to not have the latter, and only >> have either scalars or 1D arrays (i.e., N-D arrays with N>=1), but it >> is probably way too late to even think about it anyway. >> >> Cheers, >> >> Evgeni >> >> On Sat, Feb 22, 2020 at 4:37 AM Sebastian Berg >> wrote: >> > >> > Hi all, >> > >> > When we create new datatypes, we have the option to make new choices >> > for the new datatypes [0] (not the existing ones). >> > >> > The question is: Should every NumPy datatype have a scalar associated >> > and should operations like indexing return a scalar or a 0-D array? >> > >> > This is in my opinion a complex, almost philosophical, question, and we >> > do not have to settle anything for a long time. But, if we do not >> > decide a direction before we have many new datatypes the decision will >> > make itself... >> > So happy about any ideas, even if its just a gut feeling :). >> > >> > There are various points. I would like to mostly ignore the technical >> > ones, but I am listing them anyway here: >> > >> > * Scalars are faster (although that can be optimized likely) >> > >> > * Scalars have a lower memory footprint >> > >> > * The current implementation incurs a technical debt in NumPy. >> > (I do not think that is a general issue, though. We could >> > automatically create scalars for each new datatype probably.) >> > >> > Advantages of having no scalars: >> > >> > * No need to keep track of scalars to preserve them in ufuncs, or >> > libraries using `np.asarray`, do they need `np.asarray_or_scalar`? >> > (or decide they return always arrays, although ufuncs may not) >> > >> > * Seems simpler in many ways, you always know the output will be an >> > array if it has to do with NumPy. >> > >> > Advantages of having scalars: >> > >> > * Scalars are immutable and we are used to them from Python. >> > A 0-D array cannot be used as a dictionary key consistently [1]. >> > >> > I.e. without scalars as first class citizen `dict[arr1d[0]]` >> > cannot work, `dict[arr1d[0].item()]` may (if `.item()` is defined, >> > and e.g. `dict[arr1d[0].frozen()]` could make a copy to work. [2] >> > >> > * Object arrays as we have them now make sense, `arr1d[0]` can >> > reasonably return a Python object. I.e. arrays feel more like >> > container if you can take elements out easily. >> > >> > Could go both ways: >> > >> > * Scalar math `scalar = arr1d[0]; scalar += 1` modifies the array >> > without scalars. With scalars `arr1d[0, ...]` clarifies the >> > meaning. (In principle it is good to never use `arr2d[0]` to >> > get a 1D slice, probably more-so if scalars exist.) >> > >> > Note: array-scalars (the current NumPy scalars) are not useful in my >> > opinion [3]. A scalar should not be indexed or have a shape. I do not >> > believe in scalars pretending to be arrays. >> > >> > I personally tend towards liking scalars. If Python was a language >> > where the array (array-programming) concept was ingrained into the >> > language itself, I would lean the other way. But users are used to >> > scalars, and they "put" scalars into arrays. Array objects are in some >> > ways strange in Python, and I feel not having scalars detaches them >> > further. >> > >> > Having scalars, however also means we should preserve them. I feel in >> > principle that is actually fairly straight forward. E.g. for ufuncs: >> > >> > * np.add(scalar, scalar) -> scalar >> > * np.add.reduce(arr, axis=None) -> scalar >> > * np.add.reduce(arr, axis=1) -> array (even if arr is 1d) >> > * np.add.reduce(scalar, axis=()) -> array >> > >> > Of course libraries that do `np.asarray` would/could basically chose to >> > not preserve scalars: Their signature is defined as taking strictly >> > array input. >> > >> > Cheers, >> > >> > Sebastian >> > >> > >> > [0] At best this can be a vision to decide which way they may evolve. >> > >> > [1] E.g. PyTorch uses `hash(tensor) == id(tensor)` which is arguably >> > strange. E.g. Quantity defines hash correctly, but does not fully >> > ensure immutability for 0-D Quantities. Ensuring immutability in a >> > world where "views" are a central concept requires a write-only copy. >> > >> > [2] Arguably `.item()` would always return a scalar, but it would be a >> > second class citizen. (Although if it returns a scalar, at least we >> > already have a scalar implementation.) >> > >> > [3] They are necessary due to technical debt for NumPy datatypes >> > though. >> > _______________________________________________ >> > NumPy-Discussion mailing list >> > NumPy-Discussion at python.org >> > https://mail.python.org/mailman/listinfo/numpy-discussion >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Sat Feb 22 09:53:29 2020 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Sat, 22 Feb 2020 09:53:29 -0500 Subject: [Numpy-discussion] New DTypes: Are scalars a central concept in NumPy or not? In-Reply-To: References: Message-ID: On Sat, Feb 22, 2020 at 9:41 AM wrote: > > > On Sat, Feb 22, 2020 at 9:34 AM wrote: > >> not having a hashable tuple conversion would be a strong limitation >> >> a = tuple(np.arange(5)) >> versus >> a = tuple([np.array(i) for i in range(5)]) >> {a:5} >> > > also there is the question of which scalar > > .item() versus [()] > > This was used in the old times in scipy.stats, and I just saw > https://github.com/scipy/scipy/pull/11165#issuecomment-589952838 > > aside: > AFAIR, I use 0-dim arrays also to ensure that I have a numpy dtype and > not, e.g. some equivalent python type > 0-dim as mutable pseudo-scalar a = np.asarray(5) a, id(a) (array(5), 844574884528) a[()] = 1 a, id(a) (array(1), 844574884528) maybe I never used that, In a recent similar case, I could use just a 1-d list or array to work around python's muting or mutability behavior > Josef > > >> >> Josef >> >> On Sat, Feb 22, 2020 at 9:28 AM Evgeni Burovski < >> evgeny.burovskiy at gmail.com> wrote: >> >>> Hi Sebastian, >>> >>> Just to clarify the difference: >>> >>> >>> x = np.float64(42) >>> >>> y = np.array(42, dtype=float) >>> >>> Here `x` is a scalar and `y` is a 0D array, correct? >>> If that's the case, not having the former would be very confusing for >>> users (at least, that would be very confusing to me, FWIW). >>> >>> If anything, I think it'd be cleaner to not have the latter, and only >>> have either scalars or 1D arrays (i.e., N-D arrays with N>=1), but it >>> is probably way too late to even think about it anyway. >>> >>> Cheers, >>> >>> Evgeni >>> >>> On Sat, Feb 22, 2020 at 4:37 AM Sebastian Berg >>> wrote: >>> > >>> > Hi all, >>> > >>> > When we create new datatypes, we have the option to make new choices >>> > for the new datatypes [0] (not the existing ones). >>> > >>> > The question is: Should every NumPy datatype have a scalar associated >>> > and should operations like indexing return a scalar or a 0-D array? >>> > >>> > This is in my opinion a complex, almost philosophical, question, and we >>> > do not have to settle anything for a long time. But, if we do not >>> > decide a direction before we have many new datatypes the decision will >>> > make itself... >>> > So happy about any ideas, even if its just a gut feeling :). >>> > >>> > There are various points. I would like to mostly ignore the technical >>> > ones, but I am listing them anyway here: >>> > >>> > * Scalars are faster (although that can be optimized likely) >>> > >>> > * Scalars have a lower memory footprint >>> > >>> > * The current implementation incurs a technical debt in NumPy. >>> > (I do not think that is a general issue, though. We could >>> > automatically create scalars for each new datatype probably.) >>> > >>> > Advantages of having no scalars: >>> > >>> > * No need to keep track of scalars to preserve them in ufuncs, or >>> > libraries using `np.asarray`, do they need `np.asarray_or_scalar`? >>> > (or decide they return always arrays, although ufuncs may not) >>> > >>> > * Seems simpler in many ways, you always know the output will be an >>> > array if it has to do with NumPy. >>> > >>> > Advantages of having scalars: >>> > >>> > * Scalars are immutable and we are used to them from Python. >>> > A 0-D array cannot be used as a dictionary key consistently [1]. >>> > >>> > I.e. without scalars as first class citizen `dict[arr1d[0]]` >>> > cannot work, `dict[arr1d[0].item()]` may (if `.item()` is defined, >>> > and e.g. `dict[arr1d[0].frozen()]` could make a copy to work. [2] >>> > >>> > * Object arrays as we have them now make sense, `arr1d[0]` can >>> > reasonably return a Python object. I.e. arrays feel more like >>> > container if you can take elements out easily. >>> > >>> > Could go both ways: >>> > >>> > * Scalar math `scalar = arr1d[0]; scalar += 1` modifies the array >>> > without scalars. With scalars `arr1d[0, ...]` clarifies the >>> > meaning. (In principle it is good to never use `arr2d[0]` to >>> > get a 1D slice, probably more-so if scalars exist.) >>> > >>> > Note: array-scalars (the current NumPy scalars) are not useful in my >>> > opinion [3]. A scalar should not be indexed or have a shape. I do not >>> > believe in scalars pretending to be arrays. >>> > >>> > I personally tend towards liking scalars. If Python was a language >>> > where the array (array-programming) concept was ingrained into the >>> > language itself, I would lean the other way. But users are used to >>> > scalars, and they "put" scalars into arrays. Array objects are in some >>> > ways strange in Python, and I feel not having scalars detaches them >>> > further. >>> > >>> > Having scalars, however also means we should preserve them. I feel in >>> > principle that is actually fairly straight forward. E.g. for ufuncs: >>> > >>> > * np.add(scalar, scalar) -> scalar >>> > * np.add.reduce(arr, axis=None) -> scalar >>> > * np.add.reduce(arr, axis=1) -> array (even if arr is 1d) >>> > * np.add.reduce(scalar, axis=()) -> array >>> > >>> > Of course libraries that do `np.asarray` would/could basically chose to >>> > not preserve scalars: Their signature is defined as taking strictly >>> > array input. >>> > >>> > Cheers, >>> > >>> > Sebastian >>> > >>> > >>> > [0] At best this can be a vision to decide which way they may evolve. >>> > >>> > [1] E.g. PyTorch uses `hash(tensor) == id(tensor)` which is arguably >>> > strange. E.g. Quantity defines hash correctly, but does not fully >>> > ensure immutability for 0-D Quantities. Ensuring immutability in a >>> > world where "views" are a central concept requires a write-only copy. >>> > >>> > [2] Arguably `.item()` would always return a scalar, but it would be a >>> > second class citizen. (Although if it returns a scalar, at least we >>> > already have a scalar implementation.) >>> > >>> > [3] They are necessary due to technical debt for NumPy datatypes >>> > though. >>> > _______________________________________________ >>> > NumPy-Discussion mailing list >>> > NumPy-Discussion at python.org >>> > https://mail.python.org/mailman/listinfo/numpy-discussion >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion at python.org >>> https://mail.python.org/mailman/listinfo/numpy-discussion >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Sat Feb 22 16:28:16 2020 From: njs at pobox.com (Nathaniel Smith) Date: Sat, 22 Feb 2020 13:28:16 -0800 Subject: [Numpy-discussion] New DTypes: Are scalars a central concept in NumPy or not? In-Reply-To: References: Message-ID: Off the cuff, my intuition is that dtypes will want to be able to define how scalar indexing works, and let it return objects other than arrays. So e.g.: - some dtypes might just return a zero-d array - some dtypes might want to return some arbitrary domain-appropriate type, like a datetime dtype might want to return datetime.datetime objects (like how dtype(object) works now) - some dtypes might want to go to all the trouble to define immutable duck-array "scalar" types (like how dtype(float) and friends work now) But I don't think we need to give that last case any special privileges in the dtype system. For example, I don't think we need to mandate that everyone who defines their own dtype MUST also implement a custom duck-array type to act as the scalars, or build a whole complex system to auto-generate such types given an arbitrary user-defined dtype. -n On Fri, Feb 21, 2020 at 5:37 PM Sebastian Berg wrote: > > Hi all, > > When we create new datatypes, we have the option to make new choices > for the new datatypes [0] (not the existing ones). > > The question is: Should every NumPy datatype have a scalar associated > and should operations like indexing return a scalar or a 0-D array? > > This is in my opinion a complex, almost philosophical, question, and we > do not have to settle anything for a long time. But, if we do not > decide a direction before we have many new datatypes the decision will > make itself... > So happy about any ideas, even if its just a gut feeling :). > > There are various points. I would like to mostly ignore the technical > ones, but I am listing them anyway here: > > * Scalars are faster (although that can be optimized likely) > > * Scalars have a lower memory footprint > > * The current implementation incurs a technical debt in NumPy. > (I do not think that is a general issue, though. We could > automatically create scalars for each new datatype probably.) > > Advantages of having no scalars: > > * No need to keep track of scalars to preserve them in ufuncs, or > libraries using `np.asarray`, do they need `np.asarray_or_scalar`? > (or decide they return always arrays, although ufuncs may not) > > * Seems simpler in many ways, you always know the output will be an > array if it has to do with NumPy. > > Advantages of having scalars: > > * Scalars are immutable and we are used to them from Python. > A 0-D array cannot be used as a dictionary key consistently [1]. > > I.e. without scalars as first class citizen `dict[arr1d[0]]` > cannot work, `dict[arr1d[0].item()]` may (if `.item()` is defined, > and e.g. `dict[arr1d[0].frozen()]` could make a copy to work. [2] > > * Object arrays as we have them now make sense, `arr1d[0]` can > reasonably return a Python object. I.e. arrays feel more like > container if you can take elements out easily. > > Could go both ways: > > * Scalar math `scalar = arr1d[0]; scalar += 1` modifies the array > without scalars. With scalars `arr1d[0, ...]` clarifies the > meaning. (In principle it is good to never use `arr2d[0]` to > get a 1D slice, probably more-so if scalars exist.) > > Note: array-scalars (the current NumPy scalars) are not useful in my > opinion [3]. A scalar should not be indexed or have a shape. I do not > believe in scalars pretending to be arrays. > > I personally tend towards liking scalars. If Python was a language > where the array (array-programming) concept was ingrained into the > language itself, I would lean the other way. But users are used to > scalars, and they "put" scalars into arrays. Array objects are in some > ways strange in Python, and I feel not having scalars detaches them > further. > > Having scalars, however also means we should preserve them. I feel in > principle that is actually fairly straight forward. E.g. for ufuncs: > > * np.add(scalar, scalar) -> scalar > * np.add.reduce(arr, axis=None) -> scalar > * np.add.reduce(arr, axis=1) -> array (even if arr is 1d) > * np.add.reduce(scalar, axis=()) -> array > > Of course libraries that do `np.asarray` would/could basically chose to > not preserve scalars: Their signature is defined as taking strictly > array input. > > Cheers, > > Sebastian > > > [0] At best this can be a vision to decide which way they may evolve. > > [1] E.g. PyTorch uses `hash(tensor) == id(tensor)` which is arguably > strange. E.g. Quantity defines hash correctly, but does not fully > ensure immutability for 0-D Quantities. Ensuring immutability in a > world where "views" are a central concept requires a write-only copy. > > [2] Arguably `.item()` would always return a scalar, but it would be a > second class citizen. (Although if it returns a scalar, at least we > already have a scalar implementation.) > > [3] They are necessary due to technical debt for NumPy datatypes > though. > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion -- Nathaniel J. Smith -- https://vorpus.org From einstein.edison at gmail.com Sun Feb 23 05:04:10 2020 From: einstein.edison at gmail.com (Hameer Abbasi) Date: Sun, 23 Feb 2020 10:04:10 +0000 Subject: [Numpy-discussion] New DTypes: Are scalars a central concept in NumPy or not? References: Message-ID: Hi, Sebastian, ?On 22.02.20, 02:37, "NumPy-Discussion on behalf of Sebastian Berg" wrote: Hi all, When we create new datatypes, we have the option to make new choices for the new datatypes [0] (not the existing ones). The question is: Should every NumPy datatype have a scalar associated and should operations like indexing return a scalar or a 0-D array? This is in my opinion a complex, almost philosophical, question, and we do not have to settle anything for a long time. But, if we do not decide a direction before we have many new datatypes the decision will make itself... So happy about any ideas, even if its just a gut feeling :). There are various points. I would like to mostly ignore the technical ones, but I am listing them anyway here: * Scalars are faster (although that can be optimized likely) * Scalars have a lower memory footprint * The current implementation incurs a technical debt in NumPy. (I do not think that is a general issue, though. We could automatically create scalars for each new datatype probably.) Advantages of having no scalars: * No need to keep track of scalars to preserve them in ufuncs, or libraries using `np.asarray`, do they need `np.asarray_or_scalar`? (or decide they return always arrays, although ufuncs may not) * Seems simpler in many ways, you always know the output will be an array if it has to do with NumPy. Advantages of having scalars: * Scalars are immutable and we are used to them from Python. A 0-D array cannot be used as a dictionary key consistently [1]. I.e. without scalars as first class citizen `dict[arr1d[0]]` cannot work, `dict[arr1d[0].item()]` may (if `.item()` is defined, and e.g. `dict[arr1d[0].frozen()]` could make a copy to work. [2] * Object arrays as we have them now make sense, `arr1d[0]` can reasonably return a Python object. I.e. arrays feel more like container if you can take elements out easily. Could go both ways: * Scalar math `scalar = arr1d[0]; scalar += 1` modifies the array without scalars. With scalars `arr1d[0, ...]` clarifies the meaning. (In principle it is good to never use `arr2d[0]` to get a 1D slice, probably more-so if scalars exist.) From a usability perspective, one could argue that if the dimension of the array one is indexing into is known and the user isn't advanced, then the behavior expected is one of scalars and not 0D arrays. If, however, the input dimension is unknown, then the behavior switch at 0D and the need for an extra ellipsis to ensure array-ness makes things confusing to regular users. I am file with the current behavior of indexing, as anything else would likely be a large backwards-compat break. Note: array-scalars (the current NumPy scalars) are not useful in my opinion [3]. A scalar should not be indexed or have a shape. I do not believe in scalars pretending to be arrays. I personally tend towards liking scalars. If Python was a language where the array (array-programming) concept was ingrained into the language itself, I would lean the other way. But users are used to scalars, and they "put" scalars into arrays. Array objects are in some ways strange in Python, and I feel not having scalars detaches them further. Having scalars, however also means we should preserve them. I feel in principle that is actually fairly straight forward. E.g. for ufuncs: * np.add(scalar, scalar) -> scalar * np.add.reduce(arr, axis=None) -> scalar * np.add.reduce(arr, axis=1) -> array (even if arr is 1d) * np.add.reduce(scalar, axis=()) -> array I love this idea. Of course libraries that do `np.asarray` would/could basically chose to not preserve scalars: Their signature is defined as taking strictly array input. Cheers, Sebastian [0] At best this can be a vision to decide which way they may evolve. [1] E.g. PyTorch uses `hash(tensor) == id(tensor)` which is arguably strange. E.g. Quantity defines hash correctly, but does not fully ensure immutability for 0-D Quantities. Ensuring immutability in a world where "views" are a central concept requires a write-only copy. [2] Arguably `.item()` would always return a scalar, but it would be a second class citizen. (Although if it returns a scalar, at least we already have a scalar implementation.) [3] They are necessary due to technical debt for NumPy datatypes though. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion at python.org https://mail.python.org/mailman/listinfo/numpy-discussion From sebastian at sipsolutions.net Sun Feb 23 16:56:55 2020 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Sun, 23 Feb 2020 13:56:55 -0800 Subject: [Numpy-discussion] New DTypes: Are scalars a central concept in NumPy or not? In-Reply-To: References: Message-ID: <4fecc19c9dcc40d935214dd3b997b2cf769cb7d7.camel@sipsolutions.net> On Sat, 2020-02-22 at 13:28 -0800, Nathaniel Smith wrote: > Off the cuff, my intuition is that dtypes will want to be able to > define how scalar indexing works, and let it return objects other > than > arrays. So e.g.: > > - some dtypes might just return a zero-d array > - some dtypes might want to return some arbitrary domain-appropriate > type, like a datetime dtype might want to return datetime.datetime > objects (like how dtype(object) works now) > - some dtypes might want to go to all the trouble to define immutable > duck-array "scalar" types (like how dtype(float) and friends work > now) Right, my assumption is that whatever we suggest is going to be what most will choose, so we have the chance to move in a certain direction and set a standard. This is to make code which may or may not deal with 0-D arrays more reliable (more below). > > But I don't think we need to give that last case any special > privileges in the dtype system. For example, I don't think we need to > mandate that everyone who defines their own dtype MUST also implement > a custom duck-array type to act as the scalars, or build a whole > complex system to auto-generate such types given an arbitrary > user-defined dtype. (Note that "autogenerating" would be nothing more than a write-only 0-D array, which does not implement indexing.) There are also categoricals, for which the type may just be "object" in practice (you could define it closer, but it seems unlikely to be useful). And for simple numerical types, if we go the `.item()` path, it is arguably fine if the type is just a python type. Maybe the crux of the problem is actuall that in general `np.asarray(arr1d[0])` does not roundtrip for the current object dtype, and only partially for a categorical above. As such that is fine, but right now it is hard to tell when you will have a scalar and when a 0D array. Maybe it is better to talk about a potentially new `np.pyobject[type]` datatype (i.e. an object datatype with all elements having the same python type). Currently writing generic code with the object dtype is tricky, because we randomly return the object instead of arrays. What would be the preference for such a specific dtype? * arr1d[0] -> scalar or array? * np.add(scalar, scalar) -> scalar or array * np.add.reduce(arr) -> scalar or array? I think the `np.add` case we can decide fairly independently. The main thing is the indexing. Would we want to force a `.item()` call or not? Forcing `.item()` is in many ways simpler, I am unsure whether it would be inconvenient often. And, maybe the answer is just that for datatypes that do not round-trip easily, `.item()` is probably preferable, and for datatypes that do round-trip scalars are fine. - Sebastian > > On Fri, Feb 21, 2020 at 5:37 PM Sebastian Berg > wrote: > > Hi all, > > > > When we create new datatypes, we have the option to make new > > choices > > for the new datatypes [0] (not the existing ones). > > > > The question is: Should every NumPy datatype have a scalar > > associated > > and should operations like indexing return a scalar or a 0-D array? > > > > This is in my opinion a complex, almost philosophical, question, > > and we > > do not have to settle anything for a long time. But, if we do not > > decide a direction before we have many new datatypes the decision > > will > > make itself... > > So happy about any ideas, even if its just a gut feeling :). > > > > There are various points. I would like to mostly ignore the > > technical > > ones, but I am listing them anyway here: > > > > * Scalars are faster (although that can be optimized likely) > > > > * Scalars have a lower memory footprint > > > > * The current implementation incurs a technical debt in NumPy. > > (I do not think that is a general issue, though. We could > > automatically create scalars for each new datatype probably.) > > > > Advantages of having no scalars: > > > > * No need to keep track of scalars to preserve them in ufuncs, or > > libraries using `np.asarray`, do they need > > `np.asarray_or_scalar`? > > (or decide they return always arrays, although ufuncs may not) > > > > * Seems simpler in many ways, you always know the output will be > > an > > array if it has to do with NumPy. > > > > Advantages of having scalars: > > > > * Scalars are immutable and we are used to them from Python. > > A 0-D array cannot be used as a dictionary key consistently > > [1]. > > > > I.e. without scalars as first class citizen `dict[arr1d[0]]` > > cannot work, `dict[arr1d[0].item()]` may (if `.item()` is > > defined, > > and e.g. `dict[arr1d[0].frozen()]` could make a copy to work. > > [2] > > > > * Object arrays as we have them now make sense, `arr1d[0]` can > > reasonably return a Python object. I.e. arrays feel more like > > container if you can take elements out easily. > > > > Could go both ways: > > > > * Scalar math `scalar = arr1d[0]; scalar += 1` modifies the array > > without scalars. With scalars `arr1d[0, ...]` clarifies the > > meaning. (In principle it is good to never use `arr2d[0]` to > > get a 1D slice, probably more-so if scalars exist.) > > > > Note: array-scalars (the current NumPy scalars) are not useful in > > my > > opinion [3]. A scalar should not be indexed or have a shape. I do > > not > > believe in scalars pretending to be arrays. > > > > I personally tend towards liking scalars. If Python was a language > > where the array (array-programming) concept was ingrained into the > > language itself, I would lean the other way. But users are used to > > scalars, and they "put" scalars into arrays. Array objects are in > > some > > ways strange in Python, and I feel not having scalars detaches them > > further. > > > > Having scalars, however also means we should preserve them. I feel > > in > > principle that is actually fairly straight forward. E.g. for > > ufuncs: > > > > * np.add(scalar, scalar) -> scalar > > * np.add.reduce(arr, axis=None) -> scalar > > * np.add.reduce(arr, axis=1) -> array (even if arr is 1d) > > * np.add.reduce(scalar, axis=()) -> array > > > > Of course libraries that do `np.asarray` would/could basically > > chose to > > not preserve scalars: Their signature is defined as taking strictly > > array input. > > > > Cheers, > > > > Sebastian > > > > > > [0] At best this can be a vision to decide which way they may > > evolve. > > > > [1] E.g. PyTorch uses `hash(tensor) == id(tensor)` which is > > arguably > > strange. E.g. Quantity defines hash correctly, but does not fully > > ensure immutability for 0-D Quantities. Ensuring immutability in a > > world where "views" are a central concept requires a write-only > > copy. > > > > [2] Arguably `.item()` would always return a scalar, but it would > > be a > > second class citizen. (Although if it returns a scalar, at least we > > already have a scalar implementation.) > > > > [3] They are necessary due to technical debt for NumPy datatypes > > though. > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From shoyer at gmail.com Sun Feb 23 18:30:54 2020 From: shoyer at gmail.com (Stephan Hoyer) Date: Sun, 23 Feb 2020 15:30:54 -0800 Subject: [Numpy-discussion] NEP 37: A dispatch protocol for NumPy-like modules In-Reply-To: <1cfce715d48b847e91739c2a56b9750f15b1958f.camel@sipsolutions.net> References: <21692339-9f4b-029c-d422-ea549acbe6c3@gmail.com> <1cfce715d48b847e91739c2a56b9750f15b1958f.camel@sipsolutions.net> Message-ID: On Thu, Feb 6, 2020 at 12:20 PM Sebastian Berg wrote: > > It is less clear how this could work for __array_module__, because > > __array_module__ and get_array_module() are not generic -- they > > refers explicitly to a NumPy like module. If we want to extend it to > > SciPy (for which I agree there are good use-cases), what should that > > look __array_module__` > > I suppose the question is here, where should the code reside? For > SciPy, I agree there is a good reason why you may want to "reverse" the > implementation. The code to support JAX arrays, should live inside JAX. > > One, probably silly, option is to return a "global" namespace, so that: > > np = get_array_module(*arrays).numpy` > > My main concern with a "global namespace" is that it adds boilerplate to the typical usage of fetching a duck-array version of NumPy. I think the simplest proposal is to add a "module" argument to both get_array_module and __array_module__, with a default value of "numpy". This adds flexibility with minimal additional complexity. The main question is what the type of arguments for "module" should be: 1. Modules could be specified as strings, e.g., "numpy" 2. Module could be specified as actual namespace, e.g., numpy from import numpy. The advantage of (1) is that in theory you could write np.get_array_module(*arrays, module='scipy.linalg') without the overhead of actually importing scipy.linalg or without even needing scipy to be installed, if all the arrays use a different scipy.linalg implementation. But in practice, this seems a little far-fetched. All alternative implementations of scipy that I know of (e.g., in JAX or conceivably in Dask) import the original library. The main downside of (1) is that it would would mean that NumPy's ndarray.__array_module__ would need to use importlib.import_module() to dynamically import modules. It also adds a potentially awkward asymmetry between the "module" and "default" arguments, unless we also switched default to specify modules with strings. Either way, the "default" argument will probably need to be adjusted so that by default it matches whatever value is passed into "module", instead of always defaulting to "numpy". Any thoughts on which of these options makes most sense? We could also put off making any changes to the protocol now, but this change seems pretty safe and appear to have real use-cases (e.g., for sklearn) so I am inclined to go ahead with it now before finalizing the NEP. > We have to distinct issues: Where should e.g. SciPy put a generic > implementation (assuming they to provide implementations that only > require NumPy-API support to not require overriding)? > And, also if a library provides generic support, should we define a > standard of how the context/namespace may be passed in/provided? > > sklearn's main namespace is expected to support many array > objects/types, but it could be nice to pass in an already known > context/namespace (say scikit-image already found it, and then calls > scikit-learn internally). A "generic" namespace may even require this > to infer the correct output array object. > > > Another thing about backward compatibility: What is our vision there > actually? > This NEP will *not* give the *end user* the option to opt-in! Here, > opt-in is really reserved to the *library user* (e.g. sklearn). (I did > not realize this clearly before) > > Thinking about that for a bit now, that seems like the right choice. > But it also means that the library requires an easy way of giving a > FutureWarning, to notify the end-user of the upcoming change. The end- > user will easily be able to convert to a NumPy array to keep the old > behaviour. > Once this warning is given (maybe during `get_array_module()`, the > array module object/context would preferably be passed around, > hopefully even between libraries. That provides a reasonable way to > opt-in to the new behaviour without a warning (mainly for library > users, end-users can silence the warning if they wish so). > I don't think NumPy needs to do anything about warnings. It is straightforward for libraries that want to use use get_array_module() to issue their own warnings before calling get_array_module(), if desired. Or alternatively, if a library is about to add a new __array_module__ method, it is straightforward to issue a warning inside the new __array_module__ method before returning the NumPy functions. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at gmail.com Sun Feb 23 18:59:42 2020 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Sun, 23 Feb 2020 15:59:42 -0800 Subject: [Numpy-discussion] NEP 37: A dispatch protocol for NumPy-like modules In-Reply-To: References: <21692339-9f4b-029c-d422-ea549acbe6c3@gmail.com> <1cfce715d48b847e91739c2a56b9750f15b1958f.camel@sipsolutions.net> Message-ID: On Sun, Feb 23, 2020 at 3:31 PM Stephan Hoyer wrote: > On Thu, Feb 6, 2020 at 12:20 PM Sebastian Berg > wrote: > >> >> Another thing about backward compatibility: What is our vision there >> actually? >> This NEP will *not* give the *end user* the option to opt-in! Here, >> opt-in is really reserved to the *library user* (e.g. sklearn). (I did >> not realize this clearly before) >> >> Thinking about that for a bit now, that seems like the right choice. >> But it also means that the library requires an easy way of giving a >> FutureWarning, to notify the end-user of the upcoming change. The end- >> user will easily be able to convert to a NumPy array to keep the old >> behaviour. >> Once this warning is given (maybe during `get_array_module()`, the >> array module object/context would preferably be passed around, >> hopefully even between libraries. That provides a reasonable way to >> opt-in to the new behaviour without a warning (mainly for library >> users, end-users can silence the warning if they wish so). >> > > I don't think NumPy needs to do anything about warnings. It is > straightforward for libraries that want to use use get_array_module() to > issue their own warnings before calling get_array_module(), if desired. > > Or alternatively, if a library is about to add a new __array_module__ > method, it is straightforward to issue a warning inside the new > __array_module__ method before returning the NumPy functions. > I don't think this is quite enough. Sebastian points out a fairly important issue. One of the main rationales for the whole NEP, and the argument in multiple places ( https://numpy.org/neps/nep-0037-array-module.html#opt-in-vs-opt-out-for-users) is that it's now opt-in while __array_function__ was opt-out. This isn't really true - the problem is simply *moved*, from the duck array libraries to the array-consuming libraries. The end user will still see the backwards incompatible change, with no way to turn it off. It will be easier with __array_module__ to warn users, but this should be expanded on in the NEP. Also, I'm still not sure I agree with the tone of the discussion on this topic. It's very heavily inspired by what the JAX devs are telling you (the NEP still says PyTorch and scipy.sparse as well, but that's not true in both cases). If you ask Dask and CuPy for example, they're quite happy with __array_function__ and there haven't been many complaints about backwards compat breakage. Cheers, Ralf _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Mon Feb 24 01:44:29 2020 From: shoyer at gmail.com (Stephan Hoyer) Date: Sun, 23 Feb 2020 22:44:29 -0800 Subject: [Numpy-discussion] NEP 37: A dispatch protocol for NumPy-like modules In-Reply-To: References: <21692339-9f4b-029c-d422-ea549acbe6c3@gmail.com> <1cfce715d48b847e91739c2a56b9750f15b1958f.camel@sipsolutions.net> Message-ID: On Sun, Feb 23, 2020 at 3:59 PM Ralf Gommers wrote: > > > On Sun, Feb 23, 2020 at 3:31 PM Stephan Hoyer wrote: > >> On Thu, Feb 6, 2020 at 12:20 PM Sebastian Berg < >> sebastian at sipsolutions.net> wrote: >> >>> >>> Another thing about backward compatibility: What is our vision there >>> actually? >>> This NEP will *not* give the *end user* the option to opt-in! Here, >>> opt-in is really reserved to the *library user* (e.g. sklearn). (I did >>> not realize this clearly before) >>> >>> Thinking about that for a bit now, that seems like the right choice. >>> But it also means that the library requires an easy way of giving a >>> FutureWarning, to notify the end-user of the upcoming change. The end- >>> user will easily be able to convert to a NumPy array to keep the old >>> behaviour. >>> Once this warning is given (maybe during `get_array_module()`, the >>> array module object/context would preferably be passed around, >>> hopefully even between libraries. That provides a reasonable way to >>> opt-in to the new behaviour without a warning (mainly for library >>> users, end-users can silence the warning if they wish so). >>> >> >> I don't think NumPy needs to do anything about warnings. It is >> straightforward for libraries that want to use use get_array_module() to >> issue their own warnings before calling get_array_module(), if desired. >> > >> Or alternatively, if a library is about to add a new __array_module__ >> method, it is straightforward to issue a warning inside the new >> __array_module__ method before returning the NumPy functions. >> > > I don't think this is quite enough. Sebastian points out a fairly > important issue. One of the main rationales for the whole NEP, and the > argument in multiple places ( > https://numpy.org/neps/nep-0037-array-module.html#opt-in-vs-opt-out-for-users) > is that it's now opt-in while __array_function__ was opt-out. This isn't > really true - the problem is simply *moved*, from the duck array libraries > to the array-consuming libraries. The end user will still see the backwards > incompatible change, with no way to turn it off. It will be easier with > __array_module__ to warn users, but this should be expanded on in the NEP. > Ralf, thanks for sharing your thoughts. I'm not quite I understand the concerns about backwards incompatibility: 1. The intention is that implementing a __array_module__ method should be backwards compatible with all current uses of NumPy. This satisfies backwards compatibility concerns for an array-implementing library like JAX. 2. In contrast, calling get_array_module() offers no guarantees about backwards compatibility. This seems nearly impossible, because the entire point of the protocol is to make it possible to opt-in to new behavior. So backwards compatibility isn't solved for Scikit-Learn switching to use get_array_module(), and after Scikit-Learn does so, adding __array_module__ to new types of arrays could potentially have backwards incompatible consequences for Scikit-Learn (unless sklearn uses default=None). Are you suggesting just adding something like what I'm writing here into the NEP? Perhaps along with advice to consider issuing warnings inside __array_module__ and falling back to legacy behavior when first implementing it on a new type? We could also potentially make a few changes to make backwards compatibility even easier, by making the protocol less aggressive about assuming that NumPy is a safe fallback. Some non-exclusive options: a. We could switch the default value of "default" on get_array_module() to None, so an exception is raised if nothing implements __array_module__. b. We could includes *all* argument types in "types", not just types that implement __array_module__. NumPy's ndarray.__array_module__ could then recognize and refuse to return an implementation if there are other arguments that might implement __array_module__ in the future (e.g., anything outside the standard library?). The downside of making either of these choices is that it would potentially make get_array_function() a bit less usable, because it is more likely to fail, e.g., if called on a float, or some custom type that should be treated as a scalar. Also, I'm still not sure I agree with the tone of the discussion on this > topic. It's very heavily inspired by what the JAX devs are telling you (the > NEP still says PyTorch and scipy.sparse as well, but that's not true in > both cases). If you ask Dask and CuPy for example, they're quite happy with > __array_function__ and there haven't been many complaints about backwards > compat breakage. > I'm linking to comments you wrote in reference to PyTorch and scipy.sparse in the current draft of the NEP, so I certainly want to make sure that you agree my characterization :). Would it be fair to say: - JAX is reluctant to implement __array_function__ because of concerns about breaking existing code. JAX developers think that when users use NumPy functions on JAX arrays, they are explicitly choosing to convert from JAX to NumPy. This model is fundamentally incompatible __array_function__, which we chose to override the existing numpy namespace. - PyTorch and scipy.sparse are not yet in position to implement __array_function__ (due to a lack of a direct implementation of NumPy's API), but these projects take backwards compatibility seriously. Does "take backwards compatibility seriously" sound about right to you? I'm very open to specific suggestions here. (TensorFlow could probably also be safely added to this second list.) Best, Stephan -------------- next part -------------- An HTML attachment was scrubbed... URL: From alt.tblt1 at gmail.com Mon Feb 24 15:00:47 2020 From: alt.tblt1 at gmail.com (A T) Date: Mon, 24 Feb 2020 21:00:47 +0100 Subject: [Numpy-discussion] Suggestion: prevent silent downcast in np.full_like Message-ID: Hello, This is my first activity on this mailing list, please let me know if I am doing anything improperly. I recently opened issue #15635 and it was suggested to me that it could be worth discussing here. Here is a summarized code example of how unsafe downcasting in np.full_like() resulted in issues in our scientific toolbox: t0 = 20.5 # We're trying to make a constant-valued "ufunc" temperature = lambda x: np.full_like(x, t0) print(temperature([0.1, 0.7, 2.3])) # [20.5 20.5 20.5] print(temperature(0)) # 20 This is consistent with the documentation (which even gives an example of this unsafe casting), and was obvious to fix once identified. But what seems especially problematic to me is the fact that the code /looks safe, but isn't/. There is no sketchy `dtype=...` to make you think twice about the possibility of downcasting, but it happens all the same. What are your thoughts on this topic? Should a special warning be given? Should the casting rule be made more strict? Alexis THIBAULT -------------- next part -------------- An HTML attachment was scrubbed... URL: From allanhaldane at gmail.com Mon Feb 24 15:29:00 2020 From: allanhaldane at gmail.com (Allan Haldane) Date: Mon, 24 Feb 2020 15:29:00 -0500 Subject: [Numpy-discussion] New DTypes: Are scalars a central concept in NumPy or not? In-Reply-To: References: Message-ID: <9716a1b7-871b-1776-11fe-59e095b89cf0@gmail.com> I have some thoughts on scalars from playing with ndarray ducktypes (__array_function__), eg a MaskedArray ndarray-ducktype, for which I wanted an associated "MaskedScalar" type. In summary, the ways scalars currently work makes ducktyping (duck-scalars) difficult: * numpy scalar types are not subclassable, so my duck-scalars aren't subclasses of numpy scalars and aren't in the type hierarchy * even if scalars were subclassable, I would have to subclass each scalar datatype individually to make masked versions * lots of code checks `np.isinstance(var, np.float64)` which breaks for my duck-scalars * it was difficult to distinguish between a duck-scalar and a duck-0d array. The method I used in the end seems hacky. This has led to some daydreams about how scalars should work, and also led me last to read through your NEPs 40/41 with specific focus on what you said about scalars, and was about to post there until I saw this discussion. I agree with what you said in the NEPs about not making scalars be dtype instances. Here is what ducktypes led me to: If we are able to do something like define a `np.numpy_scalar` type covering all numpy scalars, which has a `.dtype` attribute like you describe in the NEPs, then that would seem to solve the ducktype problems above. Ducktype implementors would need to make a "duck-scalar" type in parallel to their "duck-ndarray" type, but I found that to be pretty easy using an abstract class in my MaskedArray ducktype, since the MaskedArray and MaskedScalar share a lot of behavior. A numpy_scalar type would also help solve some object-array problems if the object scalars are wrapped in the np_scalar type. A long time ago I started to try to fix up various funny/strange behaviors of object datatypes, but there are lots of special cases, and the main problem was that the returned objects (eg from indexing) were not numpy types and did not support numpy attributes or indexing. Wrapping the returned object in `np.numpy_scalar` might add an extra slight annoyance to people who want to unwrap the object, but I think it would make object arrays less buggy and make code using object arrays easier to reason about and debug. Finally, a few random votes/comments based on the other emails on the list: I think scalars have a place in numpy (rather than just reusing 0d arrays), since there is a clear use in having hashable, immutable scalars. Structured scalars should probably be immutable. I agree with your suggestion that scalars should not be indexable. Thus, my duck-scalars (and proposed numpy_scalar) would not be indexable. However, I think they should encode their datatype though a .dtype attribute like ndarrays, rather than by inheritance. Also, something to think about is that currently numpy scalars satisfy the property `isinstance(np.float64(1), float)`, i.e they are within the python numerical type hierarchy. 0d arrays do not have this property. My proposal above would break this. I'm not sure what to think about whether this is a good property to maintain or not. Cheers, Allan On 2/21/20 8:37 PM, Sebastian Berg wrote: > Hi all, > > When we create new datatypes, we have the option to make new choices > for the new datatypes [0] (not the existing ones). > > The question is: Should every NumPy datatype have a scalar associated > and should operations like indexing return a scalar or a 0-D array? > > This is in my opinion a complex, almost philosophical, question, and we > do not have to settle anything for a long time. But, if we do not > decide a direction before we have many new datatypes the decision will > make itself... > So happy about any ideas, even if its just a gut feeling :). > > There are various points. I would like to mostly ignore the technical > ones, but I am listing them anyway here: > > * Scalars are faster (although that can be optimized likely) > > * Scalars have a lower memory footprint > > * The current implementation incurs a technical debt in NumPy. > (I do not think that is a general issue, though. We could > automatically create scalars for each new datatype probably.) > > Advantages of having no scalars: > > * No need to keep track of scalars to preserve them in ufuncs, or > libraries using `np.asarray`, do they need `np.asarray_or_scalar`? > (or decide they return always arrays, although ufuncs may not) > > * Seems simpler in many ways, you always know the output will be an > array if it has to do with NumPy. > > Advantages of having scalars: > > * Scalars are immutable and we are used to them from Python. > A 0-D array cannot be used as a dictionary key consistently [1]. > > I.e. without scalars as first class citizen `dict[arr1d[0]]` > cannot work, `dict[arr1d[0].item()]` may (if `.item()` is defined, > and e.g. `dict[arr1d[0].frozen()]` could make a copy to work. [2] > > * Object arrays as we have them now make sense, `arr1d[0]` can > reasonably return a Python object. I.e. arrays feel more like > container if you can take elements out easily. > > Could go both ways: > > * Scalar math `scalar = arr1d[0]; scalar += 1` modifies the array > without scalars. With scalars `arr1d[0, ...]` clarifies the > meaning. (In principle it is good to never use `arr2d[0]` to > get a 1D slice, probably more-so if scalars exist.) > > Note: array-scalars (the current NumPy scalars) are not useful in my > opinion [3]. A scalar should not be indexed or have a shape. I do not > believe in scalars pretending to be arrays. > > I personally tend towards liking scalars. If Python was a language > where the array (array-programming) concept was ingrained into the > language itself, I would lean the other way. But users are used to > scalars, and they "put" scalars into arrays. Array objects are in some > ways strange in Python, and I feel not having scalars detaches them > further. > > Having scalars, however also means we should preserve them. I feel in > principle that is actually fairly straight forward. E.g. for ufuncs: > > * np.add(scalar, scalar) -> scalar > * np.add.reduce(arr, axis=None) -> scalar > * np.add.reduce(arr, axis=1) -> array (even if arr is 1d) > * np.add.reduce(scalar, axis=()) -> array > > Of course libraries that do `np.asarray` would/could basically chose to > not preserve scalars: Their signature is defined as taking strictly > array input. > > Cheers, > > Sebastian > > > [0] At best this can be a vision to decide which way they may evolve. > > [1] E.g. PyTorch uses `hash(tensor) == id(tensor)` which is arguably > strange. E.g. Quantity defines hash correctly, but does not fully > ensure immutability for 0-D Quantities. Ensuring immutability in a > world where "views" are a central concept requires a write-only copy. > > [2] Arguably `.item()` would always return a scalar, but it would be a > second class citizen. (Although if it returns a scalar, at least we > already have a scalar implementation.) > > [3] They are necessary due to technical debt for NumPy datatypes > though. > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > From stefanv at berkeley.edu Mon Feb 24 18:03:34 2020 From: stefanv at berkeley.edu (Stefan van der Walt) Date: Mon, 24 Feb 2020 15:03:34 -0800 Subject: [Numpy-discussion] =?utf-8?q?Suggestion=3A_prevent_silent_downca?= =?utf-8?q?st_in_np=2Efull=5Flike?= In-Reply-To: References: Message-ID: <208093a9-dff0-4fb5-866d-f717fbd8acf0@www.fastmail.com> Hi Alexis, On Mon, Feb 24, 2020, at 12:00, A T wrote: > Here is a summarized code example of how unsafe downcasting in np.full_like() resulted in issues in our scientific toolbox: > > t0 = 20.5 > # We're trying to make a constant-valued "ufunc" > temperature = lambda x: np.full_like(x, t0) > print(temperature([0.1, 0.7, 2.3])) > # [20.5 20.5 20.5] > print(temperature(0)) > # 20 I agree that this behavior is counter-intuitive. When t0 is not of the same type as x, the user intent is likely to store t0 as-is. If a check is introduced, there is the question of how strict that check should be. Checking that dtypes are identical may be too strict (there are objects that cast to one another without loss of information). Perhaps, for now, a warning is the best way to flag that something is going on. Users who do not wish to see such a warning can first cast t0 to the correct dtype: np.full_like(x, x.dtype.type(t0)) or np.full_like(x, int(t0)) I imagine we'd want to show the warning even if dtype is specified. Best regards, St?fan -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefano.miccoli at polimi.it Tue Feb 25 04:00:59 2020 From: stefano.miccoli at polimi.it (Stefano Miccoli) Date: Tue, 25 Feb 2020 09:00:59 +0000 Subject: [Numpy-discussion] New DTypes: Are scalars a central concept in NumPy or not? In-Reply-To: References: Message-ID: The fact that `isinstance(np.float64(1), float)` raises the problem that the current implementation of np.float64 scalars breaks the Liskov substitution principle: `sequence_or_array[round(x)]` works if `x` is a float, but breaks down if x is a np.float64. See https://github.com/numpy/numpy/issues/11810, where the issue is discussed in the broader setting of the semantics of `np.round` vs. python3 `round`. I do not have a strong opinion here, except that if np.float64?s are within the python number hierarchy they should be PEP 3141 compliant (which currently they are not.) Stefano > On 25 Feb 2020, at 00:03, numpy-discussion-request at python.org wrote: > > Also, something to think about is that currently numpy scalars satisfy > the property `isinstance(np.float64(1), float)`, i.e they are within the > python numerical type hierarchy. 0d arrays do not have this property. My > proposal above would break this. I'm not sure what to think about > whether this is a good property to maintain or not. > > Cheers, > Allan From sebastian at sipsolutions.net Tue Feb 25 16:05:06 2020 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Tue, 25 Feb 2020 13:05:06 -0800 Subject: [Numpy-discussion] NumPy Development Meeting - Triage Focus Message-ID: Hi all, Our bi-weekly triage-focused NumPy development meeting is tomorrow (Wednesday, Februrary 26) at 11 am Pacific Time. Everyone is invited to join in and edit the work-in-progress meeting topics and notes: https://hackmd.io/68i_JvOYQfy9ERiHgXMPvg I encourage everyone to notify us of issues or PRs that you feel should be prioritized or simply discussed briefly. Just comment on it so we can label it, or add your PR/issue to this weeks topics for discussion. Best regards, Sebastian -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From einstein.edison at gmail.com Wed Feb 26 15:17:39 2020 From: einstein.edison at gmail.com (Hameer Abbasi) Date: Wed, 26 Feb 2020 20:17:39 +0000 Subject: [Numpy-discussion] Output type of round is inconsistent with python built-in Message-ID: Hello, Currently, the built-in Python round (which is different from np.round) when called on a np.float64 returns a np.float64, due to its __round__ method. A congruous statement is true for np.float32. However, since Python 3, the default behavior of round is to return a Python int when it operates on a Python float. This is a mismatch according to the Liskov Substitution Principle, as both these types subclass Python?s float. This has been brought up in gh-15297. Here is the problem summed up in code: >>> type(round(np.float64(5))) >>> type(round(np.float32(5))) >>> type(round(float(5))) This problem manifests itself most prominently when trying to index into collections: >>> np.arange(6)[round(float(5))] 5 >>> np.arange(6)[round(np.float64(5))] Traceback (most recent call last): File "", line 1, in IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices There still remains the question, do we return Python ints or np.int64s? * Python ints have the advantage of not overflowing. * If we decide to add __round__ to arrays in the future, Python ints may become inconsistent with our design, as such a method will return an int64 array. This was issue was discussed in the weekly triage meeting today, and the following plan of action was proposed: * change scalar floats to return integers for __round__ (which integer type was not discussed, I propose np.int64) * not change anything else: not 0d arrays and not other numpy functionality Does anyone have any thoughts on the proposal? Best regards, Hameer Abbasi -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Wed Feb 26 16:36:31 2020 From: robert.kern at gmail.com (Robert Kern) Date: Wed, 26 Feb 2020 16:36:31 -0500 Subject: [Numpy-discussion] Output type of round is inconsistent with python built-in In-Reply-To: References: Message-ID: On Wed, Feb 26, 2020 at 3:19 PM Hameer Abbasi wrote: > > There still remains the question, do we return Python ints or np.int64s? > > - Python ints have the advantage of not overflowing. > - If we decide to add __round__ to arrays in the future, Python ints > may become inconsistent with our design, as such a method will return an > int64 array. > > > > This was issue was discussed in the weekly triage meeting today, and the > following plan of action was proposed: > > - change scalar floats to return integers for __round__ (which integer > type was not discussed, I propose np.int64) > - not change anything else: not 0d arrays and not other numpy > functionality > > The only reason that float.__round__() was allowed to change to returning ints was because ints became unbounded. If we also change to returning an integer type, it should be a Python int. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Wed Feb 26 17:26:10 2020 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Wed, 26 Feb 2020 17:26:10 -0500 Subject: [Numpy-discussion] Output type of round is inconsistent with python built-in In-Reply-To: References: Message-ID: great another object array np.asarray([round(x_i.item()) for x_i in np.array([1, 2.5, 2e20, 2e200])]) array([1, 2, 200000000000000000000, 199999999999999993946624442502072331894900655091004725296483501900693696871108151068392676809412503736055024831947764816364271468736556969278770082094479755742047182133579963622363626612334257709776896], dtype=object) I would rather have numpy consistent with numpy than with python On Wed, Feb 26, 2020 at 4:38 PM Robert Kern wrote: > On Wed, Feb 26, 2020 at 3:19 PM Hameer Abbasi > wrote: > >> >> There still remains the question, do we return Python ints or np.int64s? >> >> - Python ints have the advantage of not overflowing. >> - If we decide to add __round__ to arrays in the future, Python ints >> may become inconsistent with our design, as such a method will return an >> int64 array. >> >> >> >> This was issue was discussed in the weekly triage meeting today, and the >> following plan of action was proposed: >> >> - change scalar floats to return integers for __round__ (which >> integer type was not discussed, I propose np.int64) >> - not change anything else: not 0d arrays and not other numpy >> functionality >> >> The only reason that float.__round__() was allowed to change to returning > ints was because ints became unbounded. If we also change to returning an > integer type, it should be a Python int. > > -- > Robert Kern > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ilhanpolat at gmail.com Wed Feb 26 17:28:39 2020 From: ilhanpolat at gmail.com (Ilhan Polat) Date: Wed, 26 Feb 2020 22:28:39 +0000 Subject: [Numpy-discussion] Output type of round is inconsistent with python built-in In-Reply-To: References: Message-ID: Does this mean that np.round(np.float32(5)) return a 64 bit upcasted int? That would be really awkward for many reasons pandas frame size being bloated just by rounding for an example. Or numpy array size growing for no apparent reason I am not really sure if I understand why LSP should hold in this case to be honest. Rounding is an operation specific for the number instance and not for the generic class. On Wed, Feb 26, 2020, 21:38 Robert Kern wrote: > On Wed, Feb 26, 2020 at 3:19 PM Hameer Abbasi > wrote: > >> >> There still remains the question, do we return Python ints or np.int64s? >> >> - Python ints have the advantage of not overflowing. >> - If we decide to add __round__ to arrays in the future, Python ints >> may become inconsistent with our design, as such a method will return an >> int64 array. >> >> >> >> This was issue was discussed in the weekly triage meeting today, and the >> following plan of action was proposed: >> >> - change scalar floats to return integers for __round__ (which >> integer type was not discussed, I propose np.int64) >> - not change anything else: not 0d arrays and not other numpy >> functionality >> >> The only reason that float.__round__() was allowed to change to returning > ints was because ints became unbounded. If we also change to returning an > integer type, it should be a Python int. > > -- > Robert Kern > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ilhanpolat at gmail.com Wed Feb 26 17:28:39 2020 From: ilhanpolat at gmail.com (Ilhan Polat) Date: Wed, 26 Feb 2020 22:28:39 +0000 Subject: [Numpy-discussion] Output type of round is inconsistent with python built-in In-Reply-To: References: Message-ID: Does this mean that np.round(np.float32(5)) return a 64 bit upcasted int? That would be really awkward for many reasons pandas frame size being bloated just by rounding for an example. Or numpy array size growing for no apparent reason I am not really sure if I understand why LSP should hold in this case to be honest. Rounding is an operation specific for the number instance and not for the generic class. On Wed, Feb 26, 2020, 21:38 Robert Kern wrote: > On Wed, Feb 26, 2020 at 3:19 PM Hameer Abbasi > wrote: > >> >> There still remains the question, do we return Python ints or np.int64s? >> >> - Python ints have the advantage of not overflowing. >> - If we decide to add __round__ to arrays in the future, Python ints >> may become inconsistent with our design, as such a method will return an >> int64 array. >> >> >> >> This was issue was discussed in the weekly triage meeting today, and the >> following plan of action was proposed: >> >> - change scalar floats to return integers for __round__ (which >> integer type was not discussed, I propose np.int64) >> - not change anything else: not 0d arrays and not other numpy >> functionality >> >> The only reason that float.__round__() was allowed to change to returning > ints was because ints became unbounded. If we also change to returning an > integer type, it should be a Python int. > > -- > Robert Kern > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Wed Feb 26 17:40:26 2020 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Wed, 26 Feb 2020 17:40:26 -0500 Subject: [Numpy-discussion] Output type of round is inconsistent with python built-in In-Reply-To: References: Message-ID: On Wed, Feb 26, 2020 at 5:30 PM Ilhan Polat wrote: > Does this mean that np.round(np.float32(5)) return a 64 bit upcasted int? > > That would be really awkward for many reasons pandas frame size being > bloated just by rounding for an example. Or numpy array size growing for no > apparent reason > > I am not really sure if I understand why LSP should hold in this case to > be honest. Rounding is an operation specific for the number instance and > not for the generic class. > > > > > On Wed, Feb 26, 2020, 21:38 Robert Kern wrote: > >> On Wed, Feb 26, 2020 at 3:19 PM Hameer Abbasi >> wrote: >> >>> >>> There still remains the question, do we return Python ints or np.int64s? >>> >>> - Python ints have the advantage of not overflowing. >>> - If we decide to add __round__ to arrays in the future, Python ints >>> may become inconsistent with our design, as such a method will return an >>> int64 array. >>> >>> >>> >>> This was issue was discussed in the weekly triage meeting today, and the >>> following plan of action was proposed: >>> >>> - change scalar floats to return integers for __round__ (which >>> integer type was not discussed, I propose np.int64) >>> - not change anything else: not 0d arrays and not other numpy >>> functionality >>> >>> I think making numerical behavior different between arrays and numpy scalars with the same dtype, will create many happy debugging hours. (although I don't remember having been careful about the distinction between python scalars and numpy scalars in some time. I had some fun with integers in the scipy.stats discrete distributions, until they became floats) Josef > The only reason that float.__round__() was allowed to change to returning >> ints was because ints became unbounded. If we also change to returning an >> integer type, it should be a Python int. >> >> -- >> Robert Kern >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Wed Feb 26 18:02:53 2020 From: robert.kern at gmail.com (Robert Kern) Date: Wed, 26 Feb 2020 18:02:53 -0500 Subject: [Numpy-discussion] Output type of round is inconsistent with python built-in In-Reply-To: References: Message-ID: On Wed, Feb 26, 2020 at 5:41 PM wrote: > > > On Wed, Feb 26, 2020 at 5:30 PM Ilhan Polat wrote: > >> Does this mean that np.round(np.float32(5)) return a 64 bit upcasted int? >> >> That would be really awkward for many reasons pandas frame size being >> bloated just by rounding for an example. Or numpy array size growing for no >> apparent reason >> >> I am not really sure if I understand why LSP should hold in this case to >> be honest. Rounding is an operation specific for the number instance and >> not for the generic class. >> >> >> >> >> On Wed, Feb 26, 2020, 21:38 Robert Kern wrote: >> >>> On Wed, Feb 26, 2020 at 3:19 PM Hameer Abbasi >>> wrote: >>> >>>> >>>> There still remains the question, do we return Python ints or np.int64 >>>> s? >>>> >>>> - Python ints have the advantage of not overflowing. >>>> - If we decide to add __round__ to arrays in the future, Python ints >>>> may become inconsistent with our design, as such a method will return an >>>> int64 array. >>>> >>>> >>>> >>>> This was issue was discussed in the weekly triage meeting today, and >>>> the following plan of action was proposed: >>>> >>>> - change scalar floats to return integers for __round__ (which >>>> integer type was not discussed, I propose np.int64) >>>> - not change anything else: not 0d arrays and not other numpy >>>> functionality >>>> >>>> > I think making numerical behavior different between arrays and numpy > scalars with the same dtype, will create many happy debugging hours. > round(some_ndarray) isn't implemented, so there is no difference to worry about. If you want the float->float rounding, use np.around(). That function should continue to behave like it currently does for both arrays and scalars. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Wed Feb 26 18:05:30 2020 From: robert.kern at gmail.com (Robert Kern) Date: Wed, 26 Feb 2020 18:05:30 -0500 Subject: [Numpy-discussion] Output type of round is inconsistent with python built-in In-Reply-To: References: Message-ID: On Wed, Feb 26, 2020 at 5:30 PM Ilhan Polat wrote: > Does this mean that np.round(np.float32(5)) return a 64 bit upcasted int? > No. np.round() is an alias (which would be good to deprecate) for np.around(). No one has proposed changing np.around(). > That would be really awkward for many reasons pandas frame size being > bloated just by rounding for an example. Or numpy array size growing for no > apparent reason > > I am not really sure if I understand why LSP should hold in this case to > be honest. Rounding is an operation specific for the number instance and > not for the generic class. > The type of the return value is part of the type's interface, not the specific instance. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Wed Feb 26 18:07:57 2020 From: robert.kern at gmail.com (Robert Kern) Date: Wed, 26 Feb 2020 18:07:57 -0500 Subject: [Numpy-discussion] Output type of round is inconsistent with python built-in In-Reply-To: References: Message-ID: On Wed, Feb 26, 2020 at 5:27 PM wrote: > great another object array > > np.asarray([round(x_i.item()) for x_i in np.array([1, 2.5, 2e20, 2e200])]) > array([1, 2, 200000000000000000000, > > 199999999999999993946624442502072331894900655091004725296483501900693696871108151068392676809412503736055024831947764816364271468736556969278770082094479755742047182133579963622363626612334257709776896], > dtype=object) > > > I would rather have numpy consistent with numpy than with python > Since round() (and the __round__() interface) is part of Python and not numpy, there is nothing in numpy to be consistent with. We only implement __round__() for the scalar types. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Wed Feb 26 18:57:50 2020 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Wed, 26 Feb 2020 18:57:50 -0500 Subject: [Numpy-discussion] Output type of round is inconsistent with python built-in In-Reply-To: References: Message-ID: On Wed, Feb 26, 2020 at 6:09 PM Robert Kern wrote: > On Wed, Feb 26, 2020 at 5:27 PM wrote: > >> great another object array >> >> np.asarray([round(x_i.item()) for x_i in np.array([1, 2.5, 2e20, 2e200])]) >> array([1, 2, 200000000000000000000, >> >> 199999999999999993946624442502072331894900655091004725296483501900693696871108151068392676809412503736055024831947764816364271468736556969278770082094479755742047182133579963622363626612334257709776896], >> dtype=object) >> >> >> I would rather have numpy consistent with numpy than with python >> > > Since round() (and the __round__() interface) is part of Python and not > numpy, there is nothing in numpy to be consistent with. We only implement > __round__() for the scalar types. > Maybe I misunderstand I'm using np.round a lot. So maybe it's a question whether and how it will affect np.round. Does the following change with the proposal? np.round(np.array([1, 2.5, 2e20, 2e200])) array([1.e+000, 2.e+000, 2.e+020, 2.e+200]) np.round(np.array([1, 2.5, 2e20, 2e200])).astype(int) array([ 1, 2, -2147483648, -2147483648]) np.round(np.array([2e200])[0]) 2e+200 np.round(2e200) 2e+200 round(2e200) 199999999999999993946624442502072331894900655091004725296483501900693696871108151068392676809412503736055024831947764816364271468736556969278770082094479755742047182133579963622363626612334257709776896 Josef "around 100" sounds like "something all_close(100)" > > -- > Robert Kern > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Wed Feb 26 19:03:39 2020 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Wed, 26 Feb 2020 19:03:39 -0500 Subject: [Numpy-discussion] Output type of round is inconsistent with python built-in In-Reply-To: References: Message-ID: On Wed, Feb 26, 2020 at 6:57 PM wrote: > > > On Wed, Feb 26, 2020 at 6:09 PM Robert Kern wrote: > >> On Wed, Feb 26, 2020 at 5:27 PM wrote: >> >>> great another object array >>> >>> np.asarray([round(x_i.item()) for x_i in np.array([1, 2.5, 2e20, >>> 2e200])]) >>> array([1, 2, 200000000000000000000, >>> >>> 199999999999999993946624442502072331894900655091004725296483501900693696871108151068392676809412503736055024831947764816364271468736556969278770082094479755742047182133579963622363626612334257709776896], >>> dtype=object) >>> >>> >>> I would rather have numpy consistent with numpy than with python >>> >> >> Since round() (and the __round__() interface) is part of Python and not >> numpy, there is nothing in numpy to be consistent with. We only implement >> __round__() for the scalar types. >> > > > Maybe I misunderstand > > I'm using np.round a lot. So maybe it's a question whether and how it will > affect np.round. > > Does the following change with the proposal? > > np.round(np.array([1, 2.5, 2e20, 2e200])) > array([1.e+000, 2.e+000, 2.e+020, 2.e+200]) > > np.round(np.array([1, 2.5, 2e20, 2e200])).astype(int) > array([ 1, 2, -2147483648, -2147483648]) > > np.round(np.array([2e200])[0]) > 2e+200 > > np.round(2e200) > 2e+200 > > round(2e200) > > 199999999999999993946624442502072331894900655091004725296483501900693696871108151068392676809412503736055024831947764816364271468736556969278770082094479755742047182133579963622363626612334257709776896 > > Josef > "around 100" sounds like "something all_close(100)" > I guess I'm slow It only affects this case, as long as we don't have __round__ in arrays round(np.float64(2e200)) 2e+200 round(np.array([1, 2.5, 2e20, 2e200])) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) in ----> 1 round(np.array([1, 2.5, 2e20, 2e200])) TypeError: type numpy.ndarray doesn't define __round__ method Josef > > >> >> -- >> Robert Kern >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Wed Feb 26 19:06:36 2020 From: robert.kern at gmail.com (Robert Kern) Date: Wed, 26 Feb 2020 19:06:36 -0500 Subject: [Numpy-discussion] Output type of round is inconsistent with python built-in In-Reply-To: References: Message-ID: On Wed, Feb 26, 2020 at 6:59 PM wrote: > > > On Wed, Feb 26, 2020 at 6:09 PM Robert Kern wrote: > >> On Wed, Feb 26, 2020 at 5:27 PM wrote: >> >>> great another object array >>> >>> np.asarray([round(x_i.item()) for x_i in np.array([1, 2.5, 2e20, >>> 2e200])]) >>> array([1, 2, 200000000000000000000, >>> >>> 199999999999999993946624442502072331894900655091004725296483501900693696871108151068392676809412503736055024831947764816364271468736556969278770082094479755742047182133579963622363626612334257709776896], >>> dtype=object) >>> >>> >>> I would rather have numpy consistent with numpy than with python >>> >> >> Since round() (and the __round__() interface) is part of Python and not >> numpy, there is nothing in numpy to be consistent with. We only implement >> __round__() for the scalar types. >> > > > Maybe I misunderstand > > I'm using np.round a lot. So maybe it's a question whether and how it will > affect np.round. > Nope, not changing. > Does the following change with the proposal? > > np.round(np.array([1, 2.5, 2e20, 2e200])) > array([1.e+000, 2.e+000, 2.e+020, 2.e+200]) > > np.round(np.array([1, 2.5, 2e20, 2e200])).astype(int) > array([ 1, 2, -2147483648, -2147483648]) > > np.round(np.array([2e200])[0]) > 2e+200 > > np.round(2e200) > 2e+200 > No change. > round(2e200) > > 199999999999999993946624442502072331894900655091004725296483501900693696871108151068392676809412503736055024831947764816364271468736556969278770082094479755742047182133579963622363626612334257709776896 > Obviously, not under out control, but no, that's not changing. This is the only result that will change: round(np.float64(2e200)) 2e+200 > Josef > "around 100" sounds like "something all_close(100)" > I know. It's meant to be read as "array-round". We prefer the `around()` spelling to avoid shadowing the built-in. Early mistake that we're still living with. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From ilhanpolat at gmail.com Thu Feb 27 00:03:39 2020 From: ilhanpolat at gmail.com (Ilhan Polat) Date: Thu, 27 Feb 2020 05:03:39 +0000 Subject: [Numpy-discussion] Output type of round is inconsistent with python built-in In-Reply-To: References: Message-ID: It's not about what I want but this changes the output of round. In my example I didn't use any arrays but a scalar type which looks like will upcasted. On Wed, Feb 26, 2020, 23:04 Robert Kern wrote: > On Wed, Feb 26, 2020 at 5:41 PM wrote: > >> >> >> On Wed, Feb 26, 2020 at 5:30 PM Ilhan Polat wrote: >> >>> Does this mean that np.round(np.float32(5)) return a 64 bit upcasted int? >>> >>> That would be really awkward for many reasons pandas frame size being >>> bloated just by rounding for an example. Or numpy array size growing for no >>> apparent reason >>> >>> I am not really sure if I understand why LSP should hold in this case to >>> be honest. Rounding is an operation specific for the number instance and >>> not for the generic class. >>> >>> >>> >>> >>> On Wed, Feb 26, 2020, 21:38 Robert Kern wrote: >>> >>>> On Wed, Feb 26, 2020 at 3:19 PM Hameer Abbasi < >>>> einstein.edison at gmail.com> wrote: >>>> >>>>> >>>>> There still remains the question, do we return Python ints or np.int64 >>>>> s? >>>>> >>>>> - Python ints have the advantage of not overflowing. >>>>> - If we decide to add __round__ to arrays in the future, Python ints >>>>> may become inconsistent with our design, as such a method will return an >>>>> int64 array. >>>>> >>>>> >>>>> >>>>> This was issue was discussed in the weekly triage meeting today, and >>>>> the following plan of action was proposed: >>>>> >>>>> - change scalar floats to return integers for __round__ (which >>>>> integer type was not discussed, I propose np.int64) >>>>> - not change anything else: not 0d arrays and not other numpy >>>>> functionality >>>>> >>>>> >> I think making numerical behavior different between arrays and numpy >> scalars with the same dtype, will create many happy debugging hours. >> > > round(some_ndarray) isn't implemented, so there is no difference to worry > about. > > If you want the float->float rounding, use np.around(). That function > should continue to behave like it currently does for both arrays and > scalars. > > -- > Robert Kern > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Thu Feb 27 00:10:40 2020 From: robert.kern at gmail.com (Robert Kern) Date: Thu, 27 Feb 2020 00:10:40 -0500 Subject: [Numpy-discussion] Output type of round is inconsistent with python built-in In-Reply-To: References: Message-ID: Your example used np.round(), not the builtin round(). np.round() is not changing. If you want the dtype of the output to be the dtype of the input, you can certainly keep using np.round() (or its canonical spelling, np.around()). On Thu, Feb 27, 2020, 12:05 AM Ilhan Polat wrote: > It's not about what I want but this changes the output of round. In my > example I didn't use any arrays but a scalar type which looks like will > upcasted. > > On Wed, Feb 26, 2020, 23:04 Robert Kern wrote: > >> On Wed, Feb 26, 2020 at 5:41 PM wrote: >> >>> >>> >>> On Wed, Feb 26, 2020 at 5:30 PM Ilhan Polat >>> wrote: >>> >>>> Does this mean that np.round(np.float32(5)) return a 64 bit upcasted >>>> int? >>>> >>>> That would be really awkward for many reasons pandas frame size being >>>> bloated just by rounding for an example. Or numpy array size growing for no >>>> apparent reason >>>> >>>> I am not really sure if I understand why LSP should hold in this case >>>> to be honest. Rounding is an operation specific for the number instance and >>>> not for the generic class. >>>> >>>> >>>> >>>> >>>> On Wed, Feb 26, 2020, 21:38 Robert Kern wrote: >>>> >>>>> On Wed, Feb 26, 2020 at 3:19 PM Hameer Abbasi < >>>>> einstein.edison at gmail.com> wrote: >>>>> >>>>>> >>>>>> There still remains the question, do we return Python ints or >>>>>> np.int64s? >>>>>> >>>>>> - Python ints have the advantage of not overflowing. >>>>>> - If we decide to add __round__ to arrays in the future, Python >>>>>> ints may become inconsistent with our design, as such a method >>>>>> will return an int64 array. >>>>>> >>>>>> >>>>>> >>>>>> This was issue was discussed in the weekly triage meeting today, and >>>>>> the following plan of action was proposed: >>>>>> >>>>>> - change scalar floats to return integers for __round__ (which >>>>>> integer type was not discussed, I propose np.int64) >>>>>> - not change anything else: not 0d arrays and not other numpy >>>>>> functionality >>>>>> >>>>>> >>> I think making numerical behavior different between arrays and numpy >>> scalars with the same dtype, will create many happy debugging hours. >>> >> >> round(some_ndarray) isn't implemented, so there is no difference to worry >> about. >> >> If you want the float->float rounding, use np.around(). That function >> should continue to behave like it currently does for both arrays and >> scalars. >> >> -- >> Robert Kern >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From albuscode at gmail.com Thu Feb 27 02:30:33 2020 From: albuscode at gmail.com (Inessa Pawson) Date: Thu, 27 Feb 2020 17:30:33 +1000 Subject: [Numpy-discussion] help translating into Hindi Message-ID: Our collaboration with the students and faculty from the Master?s program in Survey Methodology at the University of Michigan and the University of Maryland at College Park is underway. We are looking for a volunteer to translate the survey questionnaire into Hindi. If you are available, or you know someone who would be interested to help, please leave a comment here: https://github.com/numpy/numpy-surveys/issues/1. -- Every good wish, *Inessa Pawson * Executive Director Albus Code -------------- next part -------------- An HTML attachment was scrubbed... URL: From ilhanpolat at gmail.com Thu Feb 27 02:41:29 2020 From: ilhanpolat at gmail.com (Ilhan Polat) Date: Thu, 27 Feb 2020 07:41:29 +0000 Subject: [Numpy-discussion] Output type of round is inconsistent with python built-in In-Reply-To: References: Message-ID: Oh sorry. That's trigger finger np-dotting. What i mean is if someone was using the round method on float32 or other small bit datatypes they would have a silent upcasting. Maybe not a big problem but can have significant impact. On Thu, Feb 27, 2020, 05:12 Robert Kern wrote: > Your example used np.round(), not the builtin round(). np.round() is not > changing. If you want the dtype of the output to be the dtype of the input, > you can certainly keep using np.round() (or its canonical spelling, > np.around()). > > On Thu, Feb 27, 2020, 12:05 AM Ilhan Polat wrote: > >> It's not about what I want but this changes the output of round. In my >> example I didn't use any arrays but a scalar type which looks like will >> upcasted. >> >> On Wed, Feb 26, 2020, 23:04 Robert Kern wrote: >> >>> On Wed, Feb 26, 2020 at 5:41 PM wrote: >>> >>>> >>>> >>>> On Wed, Feb 26, 2020 at 5:30 PM Ilhan Polat >>>> wrote: >>>> >>>>> Does this mean that np.round(np.float32(5)) return a 64 bit upcasted >>>>> int? >>>>> >>>>> That would be really awkward for many reasons pandas frame size being >>>>> bloated just by rounding for an example. Or numpy array size growing for no >>>>> apparent reason >>>>> >>>>> I am not really sure if I understand why LSP should hold in this case >>>>> to be honest. Rounding is an operation specific for the number instance and >>>>> not for the generic class. >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, Feb 26, 2020, 21:38 Robert Kern wrote: >>>>> >>>>>> On Wed, Feb 26, 2020 at 3:19 PM Hameer Abbasi < >>>>>> einstein.edison at gmail.com> wrote: >>>>>> >>>>>>> >>>>>>> There still remains the question, do we return Python ints or >>>>>>> np.int64s? >>>>>>> >>>>>>> - Python ints have the advantage of not overflowing. >>>>>>> - If we decide to add __round__ to arrays in the future, Python >>>>>>> ints may become inconsistent with our design, as such a method >>>>>>> will return an int64 array. >>>>>>> >>>>>>> >>>>>>> >>>>>>> This was issue was discussed in the weekly triage meeting today, and >>>>>>> the following plan of action was proposed: >>>>>>> >>>>>>> - change scalar floats to return integers for __round__ (which >>>>>>> integer type was not discussed, I propose np.int64) >>>>>>> - not change anything else: not 0d arrays and not other numpy >>>>>>> functionality >>>>>>> >>>>>>> >>>> I think making numerical behavior different between arrays and numpy >>>> scalars with the same dtype, will create many happy debugging hours. >>>> >>> >>> round(some_ndarray) isn't implemented, so there is no difference to >>> worry about. >>> >>> If you want the float->float rounding, use np.around(). That function >>> should continue to behave like it currently does for both arrays and >>> scalars. >>> >>> -- >>> Robert Kern >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion at python.org >>> https://mail.python.org/mailman/listinfo/numpy-discussion >>> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mikofski at berkeley.edu Thu Feb 27 03:21:03 2020 From: mikofski at berkeley.edu (Dr. Mark Alexander Mikofski PhD) Date: Thu, 27 Feb 2020 00:21:03 -0800 Subject: [Numpy-discussion] [ANN] pvlib participating in GSoC Message-ID: Excited to announce that pvlib python is participating in its first ever Google Summer of Code GSoC under the NumFOCUS umbrella. If you are a student interested in modeling renewable solar energy please apply: https://summerofcode.withgoogle.com/organizations/4727917315096576/ For project ideas, take a look at the pvlib github wiki: https://summerofcode.withgoogle.com/organizations/4727917315096576/ and reach out on our Google Groups forum: https://groups.google.com/forum/#!forum/pvlib-python We look forward to hearing from you! -- Mark Mikofski, PhD (2005) *Fiat Lux* -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefano.miccoli at polimi.it Thu Feb 27 04:07:19 2020 From: stefano.miccoli at polimi.it (Stefano Miccoli) Date: Thu, 27 Feb 2020 09:07:19 +0000 Subject: [Numpy-discussion] Output type of round is inconsistent with python built-in In-Reply-To: References: Message-ID: <4571E446-E241-401B-A589-0F15F2767F75@polimi.it> There several mixed issues here. 1. PEP 3141 compliance. Numpy scalars are `numbers.Real` instances, and have to respect the `__round__` semantics defined by PEP 3141: def __round__(self, ndigits:Integral=None): """Rounds self to ndigits decimal places, defaulting to 0. If ndigits is omitted or None, returns an Integral, otherwise returns a Real, preferably of the same type as self. Types may choose which direction to round half. For example, float rounds half toward even. """ This means that if Real -> Real rounding is desired one should call `round(x, 0)` or `np.around(x)`. This semantics only dictates that the return type should be Integral, so for `round(x)` and `round(x, None)` np.float32 -> np.int32 np.float32 -> np.int64 np.float64 -> np.int64 np.floatXX -> int are all OK. I think also that it is perfectly OK to raise an overflow on `round(x)` 2. Liskov substitution principle `np.float64` floats are also `float` instances (but `np.float32` are not.) This means that strictly respecting LSP means that `np.float64` should round to python `int`, since `round(x)` never overflows for python `float`. Here we have several options. - round `np.float64` -> `int` and respect LSP. - relax LSP, and round `np.float64` -> `np.int64`. Who cares about `round(1e300)`? - decide that there is no reason for having `np.float64` a subclass of `float`, so that LSP does not apply. This all said, I think that these are the two most sensible choices for `round(x)`: np.float32 -> np.int32 np.float64 -> np.int64 drop np.float64 subclassing python float or np.float32 -> int np.float64 -> int keep np.float64 subclassing python float The second one seems to me the less disruptive one. Bye Stefano From einstein.edison at gmail.com Thu Feb 27 04:48:12 2020 From: einstein.edison at gmail.com (Hameer Abbasi) Date: Thu, 27 Feb 2020 09:48:12 +0000 Subject: [Numpy-discussion] Output type of round is inconsistent with python built-in In-Reply-To: References: Message-ID: Hello, Ilhan, From: NumPy-Discussion on behalf of Ilhan Polat Reply to: Discussion of Numerical Python Date: Thursday, 27. February 2020 at 08:41 To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] Output type of round is inconsistent with python built-in Oh sorry. That's trigger finger np-dotting. What i mean is if someone was using the round method on float32 or other small bit datatypes they would have a silent upcasting. No they won?t. The only affected types would be scalars, and that too only with the built-in Python round. Arrays don?t define the __round__ method, and so won?t be affected. np.ndarray.round won?t be affected either. Only np_scalar_types.__round__ will be affected, which is what the Python round checks for. For illustration, in code: >>> type(round(np_float)) >>> type(round(np_array_0d)) Traceback (most recent call last): File "", line 1, in TypeError: type numpy.ndarray doesn't define __round__ method >>> type(round(np_array_nd)) Traceback (most recent call last): File "", line 1, in TypeError: type numpy.ndarray doesn't define __round__ method The second and third cases would remain unaffected. Only the first case would return a builtin Python int with what Robert Kern is suggesting and a np.int64 with what I?m suggesting. I do agree with something posted elsewhere on this thread that we should warn on overflow but prefer to be self-consistent and return a np.int64, but it doesn?t matter too much to me. Furthermore, the behavior of np.[a]round and np_arr.round(?) will not change. The only upcasting problem here is if someone does this in a loop, in which case they?re probably using Python objects and don?t care about memory anyway. Maybe not a big problem but can have significant impact. Best regards, Hameer Abbasi -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Thu Feb 27 09:46:23 2020 From: robert.kern at gmail.com (Robert Kern) Date: Thu, 27 Feb 2020 09:46:23 -0500 Subject: [Numpy-discussion] Output type of round is inconsistent with python built-in In-Reply-To: References: Message-ID: On Thu, Feb 27, 2020 at 4:49 AM Hameer Abbasi wrote: > Hello, Ilhan, > > > > *From: *NumPy-Discussion gmail.com at python.org> on behalf of Ilhan Polat > *Reply to: *Discussion of Numerical Python > *Date: *Thursday, 27. February 2020 at 08:41 > *To: *Discussion of Numerical Python > *Subject: *Re: [Numpy-discussion] Output type of round is inconsistent > with python built-in > > > > Oh sorry. That's trigger finger np-dotting. > > > > What i mean is if someone was using the round method on float32 or other > small bit datatypes they would have a silent upcasting. > > > > No they won?t. The only affected types would be scalars, and that too only > with the built-in Python round. > Just to be clear, his example _did_ use numpy scalars. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Thu Feb 27 09:47:58 2020 From: robert.kern at gmail.com (Robert Kern) Date: Thu, 27 Feb 2020 09:47:58 -0500 Subject: [Numpy-discussion] Output type of round is inconsistent with python built-in In-Reply-To: References: Message-ID: On Thu, Feb 27, 2020 at 2:43 AM Ilhan Polat wrote: > Oh sorry. That's trigger finger np-dotting. > > What i mean is if someone was using the round method on float32 or other > small bit datatypes they would have a silent upcasting. > > Maybe not a big problem but can have significant impact. > np.round()/np.around() will still exist and behave as you would want it to in such cases (float32->float32, float64->float64). -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From mikofski at berkeley.edu Thu Feb 27 16:27:16 2020 From: mikofski at berkeley.edu (Dr. Mark Alexander Mikofski PhD) Date: Thu, 27 Feb 2020 13:27:16 -0800 Subject: [Numpy-discussion] [ANN] pvlib participating in GSoC In-Reply-To: References: Message-ID: Sorry, the correct link to the wiki list of project ideas is here: https://github.com/pvlib/pvlib-python/wiki/GSoC-2020-Project On Thu, Feb 27, 2020 at 12:21 AM Dr. Mark Alexander Mikofski PhD < mikofski at berkeley.edu> wrote: > Excited to announce that pvlib python > is participating in its > first ever Google Summer of Code GSoC under the NumFOCUS umbrella. If you > are a student interested in modeling renewable solar energy please apply: > > https://summerofcode.withgoogle.com/organizations/4727917315096576/ > > > For project ideas, take a look at the pvlib github wiki: > > https://summerofcode.withgoogle.com/organizations/4727917315096576/ > > > and reach out on our Google Groups forum: > > https://groups.google.com/forum/#!forum/pvlib-python > > > We look forward to hearing from you! > > -- > Mark Mikofski, PhD (2005) > *Fiat Lux* > -- Mark Mikofski, PhD (2005) *Fiat Lux* -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Thu Feb 27 17:41:15 2020 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Thu, 27 Feb 2020 17:41:15 -0500 Subject: [Numpy-discussion] NEP 37: A dispatch protocol for NumPy-like modules In-Reply-To: References: <21692339-9f4b-029c-d422-ea549acbe6c3@gmail.com> <1cfce715d48b847e91739c2a56b9750f15b1958f.camel@sipsolutions.net> Message-ID: On Sun, 2020-02-23 at 22:44 -0800, Stephan Hoyer wrote: > > > On Sun, Feb 23, 2020 at 3:59 PM Ralf Gommers > wrote: > > > Also, I'm still not sure I agree with the tone of the discussion on > > this topic. It's very heavily inspired by what the JAX devs are > > telling you (the NEP still says PyTorch and scipy.sparse as well, > > but that's not true in both cases). If you ask Dask and CuPy for > > example, they're quite happy with __array_function__ and there > > haven't been many complaints about backwards compat breakage. > > > > I'm linking to comments you wrote in reference to PyTorch and > scipy.sparse in the current draft of the NEP, so I certainly want to > make sure that you agree my characterization :). > > Would it be fair to say: > - JAX is reluctant to implement __array_function__ because of > concerns about breaking existing code. JAX developers think that when > users use NumPy functions on JAX arrays, they are explicitly choosing > to convert from JAX to NumPy. This model is fundamentally > incompatible __array_function__, which we chose to override the > existing numpy namespace. > - PyTorch and scipy.sparse are not yet in position to implement > __array_function__ (due to a lack of a direct implementation of > NumPy's API), but these projects take backwards compatibility > seriously. > > Does "take backwards compatibility seriously" sound about right to > you? I'm very open to specific suggestions here. (TensorFlow could > probably also be safely added to this second list.) > Just to be clear, the way scikit-learn would probably be handling backward compatibility concerns is by adding it to their configuration context manager, see: https://github.com/scikit-learn/scikit-learn/pull/16574 So the backward compat is in a sense solved (but there are project specific context managers involved ? which is not perfect maybe, but OK). I am willing to consider pushing this off into its own namespace (and package, preferably in the NumPy org though) if necessary, the idea being that we keep it super minimal, and expand it as we go to keep up with scikit-learn needs. Possibly even with a function registration approach, so that you could have import time checks on function availability and signature mismatch easier. I still do not like the idea of context managers much though, I think I prefer the returned (bound) namespace a lot. Also I think we should *not* do implicit dispatching. Consider this case: def numpy_only(x): x = np.asarray(x) return x + _helper(len(x)) def generic(x): module = np.get_array_module(x) x = module.asarray(x) return x + _helper(len(x)) def _helper(n, module=np): return module.random.unform(size=n) If you try to make the above work with context managers, you _still_ need to pass in the module to _helper [1], because otherwise you would have to change the `numpy_only` function to ensure an outside context does not change its behaviour. - Sebastian [1] If "module" had a `module.set_backend()` and was a global instead `_helper` using the global module would do the wrong thing for `numpy_only`. This is of course also a bit of an issue with the sklearn context manager as well, but it seems to me _much_ less so, and probably not if most libraries slowly switch over and currently use `np.asarray`. > Best, > Stephan > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From allanhaldane at gmail.com Fri Feb 28 11:28:28 2020 From: allanhaldane at gmail.com (Allan Haldane) Date: Fri, 28 Feb 2020 11:28:28 -0500 Subject: [Numpy-discussion] NEP 37: A dispatch protocol for NumPy-like modules In-Reply-To: References: <21692339-9f4b-029c-d422-ea549acbe6c3@gmail.com> <1cfce715d48b847e91739c2a56b9750f15b1958f.camel@sipsolutions.net> Message-ID: <39784d9f-17c0-d5b4-4575-3ad2826ea3ce@gmail.com> On 2/23/20 6:59 PM, Ralf Gommers wrote: > One of the main rationales for the whole NEP, and the argument in > multiple places > (https://numpy.org/neps/nep-0037-array-module.html#opt-in-vs-opt-out-for-users) > is that it's now opt-in while __array_function__ was opt-out. This isn't > really true - the problem is simply *moved*, from the duck array > libraries to the array-consuming libraries. The end user will still see > the backwards incompatible change, with no way to turn it off. It will > be easier with __array_module__ to warn users, but this should be > expanded on in the NEP. Might it be possible to flip this NEP back to opt-out while keeping the nice simplifications and configurabile array-creation routines, relative to __array_function__? That is, what if we define two modules, "numpy" and "numpy_strict". "numpy_strict" would raise an exception on duck-arrays defining __array_module__ (as numpy currently does). "numpy" would be a wrapper around "numpy_strict" that decorates all numpy methods with a call to "get_array_module(inputs).func(inputs)". Then end-user code that did "import numpy as np" would accept ducktypes by default, while library developers who want to signal they don't support ducktypes can opt-out by doing "import numpy_strict as np". Issues with `np.as_array` seem mitigated compared to __array_function__ since that method would now be ducktype-aware. Cheers, -Allan From albuscode at gmail.com Fri Feb 28 23:16:23 2020 From: albuscode at gmail.com (Inessa Pawson) Date: Sat, 29 Feb 2020 14:16:23 +1000 Subject: [Numpy-discussion] help translating into Russian Message-ID: Our collaboration with the students and faculty from the Master?s program in Survey Methodology at the University of Michigan and the University of Maryland is underway. We are looking for a volunteer to translate the survey questionnaire into Russian. If you are available, or you know someone who would be interested to help, please leave a comment here: https://github.com/numpy/numpy-surveys/issues/1. -- Every good wish, *Inessa Pawson * Executive Director Albus Code -------------- next part -------------- An HTML attachment was scrubbed... URL: